Deep Residual Learning for Image Recognition Explained with Working

Sharing is Caring
Deep Residual Learning for Image Recognition Explained

Deep Residual Network Definition

The authors suggested a deep residual network (ResNet) for image identification in 2016. This kind of deep neural network (CNN) combines the outcome of the current layer with the input from the layer before. The network may learn more quickly and performs better as a result of this skip link. Several tasks, including object detection, semantic segmentation, and picture classification, have been accomplished with the ResNet architecture. Furthermore, because ResNets are composed of layers, their depth can be adjusted to correspond to any degree of spatial representation. The model’s success can be attributed to a number of factors, including the huge visual field that captures more knowledge about each pixel of an image, the separation of the localization and classification stages, the efficiency of computation at higher levels, the effective encoding schemes with simple arithmetic operations, and the improvement in accuracy as features are extracted further into the network. Each issue will be explained in the next section.

Issues with Traditional Convolutional Networks

Traditional CNNs have a severe flaw in that they must learn the complete feature map, which necessitates the use of a large number of parameters. This consequently means that they are sluggish runners and exceedingly expensive to train.

Solution with Residual Network

A class of neural networks called ResNets was put forth as a replacement for conventional CNNs. ResNets in particular make use of skip connections (which I’ll explain later), enabling them to be considerably smaller than conventional CNNs while still achieving comparable performance. Any neural network architecture can use skip connections, but convolutional neural networks benefit from them the most because they allow you to reuse portions of your feature space between layers in various spots.

Network Architectures

In this section, we will discuss three architectures and demonstrate how the residual network is different from traditional networks.  In bellow figure 1.1, three different networks are drafted as VGG 19 and 34-layer plain network is a traditional deep neural network whereas 34-layer residual is a residual neural network. We can clearly notice that residuals have a skip connection which can be generalized.

We construct bypass interconnections based on the aforementioned plain network to transform it into its corresponding residual variant. When the input and output have the exact dimensions, the identity shortcuts can be used directly (solid line shortcuts in Fig. 1.2). When the dimensions rise (Fig. 1.1’s dotted line shortcuts), we weigh two possibilities: (A) The alternative still does identity mapping, but it pads out the entries with extra zeros to account for growing dimensions. This option doesn’t add a new parameter; instead, it uses 1-1 convolutions to match the dimensions using the prediction shortcut. When the shortcuts traverse feature maps of different sizes for both alternatives, they are executed with a 2 stride.

Also Read: Batch Normalization, Its Working, Forumla and Applications

In figure 1.1, VGG architecture, 34-layer plain network, and residual network are figured. All of them are deep neural networks containing convolutional behavior. There are different versions of VGG architecture such as VGG16 and VGG19. Both use the same strategy, but different layers such as VGG16 uses 16 layers and VGG19 uses 19 layers respectively.

Key points about VGG19 and 34- layer Plain Network

A few key points need to be considered that are common in both VGG19 and the 34-layer network below.

  • The network accepts 224*224 size of RGB image as an input. It shows that the metric is (224,224,3)  shaped.
  • Preprocessing is done to calculate the mean over the RBG dataset and finally, it subtracts from each pixel value of an image. 
  • The network uses a 3*3 kernel size with a stride of 1 pixel. However, in max pooling stride, 2  is used with a 2*2 window size.
  • Spatial padding was used to maintain the image’s spatial resolution.
  • Rectified Linear Unit is used as an activation function. As opposed to the earlier networks, they used tanh, sigmoid, etc.,

Skip function in the residual block

However, the third part of the figure shows the residual architecture whose behavior is illustrated in figure 1.2. In a summary, skip connections in deep networks or connections that send a layer’s output to later levels in the neural network that are not immediately next to the layer from where the output came.

Image Recognition using ResNet (Residual Networks)

To address the declining gradient issue, the Resnet model was suggested. In order to allow the model to continue training, the idea is to forego the link and transfer the remainder to the following layer. CNN models may continuously delve deeper thanks to Resnet models.

Resnets have different versions such as Resnet50, ResNet101, etc. However, they are common practicing models for image classification. When the model predicts an unseen image, the model would be able to recognize the class of the image with high accuracy.


We cover different topics such as:

  1. What is a deep residual network
  2. Why was residual block introduced?
  3. What are the benefits?
  4. How residual networks be used in image classification?


What purposes serve residual networks?

The capacity of residual networks to resolve to vanish and expanding gradients while adding additional layers to a deep neural network has led to their increasing popularity for image identification and classification applications, it can be noted. Currently, a ResNet with a thousand layers is not very useful.

Why would ResNet use skip connections?

To skip some layers, use shortcuts or skip connections. Typical ResNet models are constructed with batch normalization in between double- or triple-layer skips that contain ReLU nonlinearities.

Leave a Comment