How Do Neural Networks Learn?

by Carson
7 views

You might have heard of the term “neural networks”, the most prevalent big framework in the field of machine learning and artificial intelligence. You need data to train an AI like a neural network, but how do they process the data to learn from it? Let’s find out in this article.

Defining the Problem

To know how to train your neural network, one should first define the problem mathematically. Is it a regression problem where you try to predict the numerical value of the target label? Or is it a classification problem where you predict the class that a data point is in?

This step is done because the neural network training is an optimization process. Therefore, you need an objective function that your model can minimize/maximize against. Specifically, in machine learning, these objectives are by convention “loss functions” that are minimized throughout training. That means you want your neural network to be well-trained and work well when the value of the loss function is small.

For regression problems, you would like to minimize the deviation from the model predictions to the ground truth data. So good metrics include MSE (mean squared error), MAE (mean absolute error), and RMSE (root mean squared error). Here’s the MSE for example:

Where y_i is the ground truth value, y_hat_i is the predicted value, and n is the number of samples being taken in calculation of the MSE loss.

For classification problems, you want your model to predict a probability distribution over classes that resemble the ground truth distribution, which is a one-hot distribution of the correct category. This means that the correct class gets 100% confidence while all other classes get 0% confidence. Therefore, in this case, the cross-entropy loss is suitable.

Minimizing the Loss

The optimization problem that the training process tries to solve is very complicated, and has to account for every single parameter of the neural network as well as every single data point in your training dataset. Therefore it is not very suitable to use symbolic solving methods, where you find a set of parameters whose derivative of loss with respect to the ground truth data is zero. (This means when you change any parameter by an infinitesimally small amount, you do not get any change in the loss function). Instead, a numerical method known as gradient descent is used to solve this problem.

Picture a ball rolling down a simple valley (include image). If the surface that the ball lies on is sloping (not flat), it rolls down the direction of the slope. If there is friction on the hill, when it rolls down and gets to the lowest point, it rises back up to a lower location in the other direction and then goes down again. Eventually, the ball finds itself in the lowest spot on that simple valley.

Gradient Descent

The numerical method of gradient descent takes a similar approach. It is an iterative approach that updates weights from initialization to the trained model weights. By analyzing the gradient of the loss function with respect to the parameters (how much will the loss value change when one of the parameters changes slightly), the optimizer computes the direction of the gradient and applies the negative of the gradient to lower the function of the loss value.

Specifically, the update rule of gradient descent is:

Where w_t is the current set of weights, w_(t+1) is the updated set of weights, η is the learning rate (the proportionality constant controlling the size of the weight update compared to the gradient, and ∇f(w_t) is the gradient of the loss function with respect to the weights.

How Are the Gradients Obtained?

Looking at the equation, you may notice an important detail that wasn’t talked about: how are the gradients obtained? The gradients are obtained via backpropagation, which utilizes the chain rule in calculus.

This is also why the entire calculation process of the neural network must be differentiable — it allows gradients to run through each layer and thus allows the network to learn by adjusting its weights via the calculated gradients.

Practical Methods for Gradient Descent

In practical applications, there are a lot of variants of gradient descent that boosts the quality or convergence speed (in terms of actual computation) over the dataset. For example, instead of the entire dataset, one can also take a step of gradient descent for every batch of data sampled from the dataset. This is called stochastic gradient descent (SGD), and as each batch could serve as a rough approximation to the entire training dataset, SGD can converge even before scanning through the entire dataset once.

Going back to the ball-rolling-down-a-valley analogy, you can see that the ball gets lower and lower as time goes on, as it approaches the lowest point in the valley. There is also something similar happening to the convergence process of neural networks. Not only are the gradients smaller in value as the network converges, the learning rate also decreases with time.

There are also other smarter variants than SGD that allows a neural network to converge even faster and to better quality. The famous Adam optimizer, for example, dynamically adjusts the learning rate of each parameter based on the learning progress to foster quality convergence.

Conclusion

In this article, we have walked through the process of the training of neural networks. There is a forward pass (taking the input into the neural network), the loss calculation, the backpropagation, and the weight update.

Related Posts

Leave a Comment

* By using this form you agree with the storage and handling of your data by this website.