Vanishing Gradient Problem in Deep Learning – Step-by-Step Guide

Sharing is Caring

Deep learning and deep neural networks are robust and well-liked algorithms. And a large part of their success can be attributed to the careful architecture of the neural network. Deep learning approaches are based on neural networks, which are a subset of machine learning and are often referred to as neural networks or simulating neural networks (SNNs). In order to reflect the communication between organic neurons, their structure and nomenclature are modeled after the human brain. Figure 1.1 depicts the components of an artificial neural networks (ANN) node layer, which consists of an input layer, one or more hidden layers, and an output layer. The weight and threshold associated with each node, or artificial neuron, which is connected to other nodes. A node is activated and starts sending data to the top layer of the network if its output rises beyond the specified threshold value. In any other case, no data is sent to the network’s next layer.

Vanishing Gradient Problem in Deep Learning

Vanishing Gradient Problem

These days, the networks are used for image analysis. These deep learning models are constructed using multiple layers layered one on top of the other. The vanishing gradient, which assumes a null or asymptotic value at zero and prevents the weights from being updated, is one of the main issues with training these systems.  The gradients of the loss function in neural networks approach zero when additional layers with specific activation functions are added, making the network challenging to train.

Why Vanishing Gradient Problem Occurs

The sigmoid function, for example, squeezes a wide range of input range between 0 and 1 into a tiny input space. As a result, the output of the sigmoid function would change little when the input changes significantly. The derivative shrinks as a result.

The sigmoid function and its derivative are shown as an illustration. As |x| increases, observe how the derivative approaches zero as the sigmoid function’s inputs grow or shrink.

Importance to deal with Vanishing  Gradient Problem

This isn’t a major issue for shallow networks that just have a few layers that use these activations. However, using more layers may result in a gradient that is too narrow for training to be successful. Using backpropagation, neural network gradients are discovered. Simply defined, backpropagation locates the network derivatives by going from the final layer back to the first layer. The derivatives of every layer are multiplied down the net (from the final to the beginning) according to the chain rule in order to get the derivatives of the original layers.

However, n tiny derivatives are multiplied together when n hidden units use an activation like the sigmoid function. As we spread down to the bottom layers, the gradient so drops off exponentially. The initial layers’ weights and biases won’t be adequately updated for each training session if the gradient is too tiny. It can result in overall network inaccuracy because these early layers are frequently essential for identifying the fundamental components of the input data.

Possible Solution

Utilizing other activation functions, such ReLU, which doesn’t result in a small derivative, is the simplest approach. Another option is residual networks, which offer residual connections directly to prior tiers. The residual connection, immediately adds the block’s starting value, x, to its ending value (F(x)+x). Because this residual link bypasses the activation functions that “squash” the derivatives, the block’s overall derivative is higher.

Batch normalisation layers can also help to fix the problem. As previously mentioned, the issue develops when a vast input space is transferred to a tiny one, which makes the derivatives vanish. This is best illustrated at the point where |x| is large. By simply normalising the input so that |x| doesn’t approach the sigmoid function’s outside edges, batch normalisation reduces this issue. Finally, it normalises the input.

Conclusion

We covered a couple of things in the article. At the beginning of the section, the main architecture of the neural network is depicted. Then the vanishing gradient problem, solution and its significance is illustrated.

FAQs

Which layer is most effective at preventing gradient vanishing?

ResNets, or residual neural networks, are one of the most recent and efficient methods for solving the vanishing gradient problem (not to be confused with recurrent neural networks). Skip connections or residual connections that are a part of the network architecture are referred to as ResNets in neural networks.

How may gradient explosion disappearing be stopped?

Generally speaking, explosive gradients can be prevented by carefully constructing the network structure, such as by employing a minimal learning rate, scaling the data points, and a conventional loss function. Exploding gradients, however, might still be a problem in recurrent networks with a lot of input time steps.

Which activation function exhibits the vanishing gradient problem most frequently?

Generally speaking, explosive gradients can be prevented by carefully constructing the network model, such as by employing a minimal learning rate, scaling the target variables, and a conventional loss function. Exploding gradients, however, might still be a problem in recurrent networks with a lot of input time steps.

Leave a Comment