Convolutional Neural Networks

Amima Shifa
4 min readAug 13, 2021
CNN for Digits Classification

What is image classification ?

Image Classification is a fundamental task that attempts to comprehend an entire image as a whole. The goal is to classify the image by assigning it to a specific label. Typically, Image Classification refers to images in which only one object appears and is analyzed. In contrast, object detection involves both classification and localization tasks, and is used to analyze more realistic cases in which multiple objects may exist in an image.

A breakthrough in building models for image classification came with the discovery that a convolutional neural network (CNN) could be used to progressively extract higher- and higher-level representations of the image content. Instead of preprocessing the data to derive features like textures and shapes, a CNN takes just the image’s raw pixel data as input and “learns” how to extract these features, and ultimately infer what object they constitute.

Convolutional Neural Network (CNN) :

Convolutional Neural Network is an approach used in image recognition and processing that is specifically designed to process pixel data.

To start, the CNN receives an input feature map: a three-dimensional matrix where the size of the first two dimensions corresponds to the length and width of the images in pixels. The size of the third dimension is 3 (corresponding to the 3 channels of a color image: red, green, and blue). The CNN comprises a stack of modules, each of which performs three operations.

A convolution extracts tiles of the input feature map, and applies filters to them to compute new features, producing an output feature map, or convolved feature (which may have a different size and depth than the input feature map). Convolutions are defined by two parameters:

  • Size of the tiles that are extracted (typically 3x3 or 5x5 pixels).
  • The depth of the output feature map, which corresponds to the number of filters that are applied.

During a convolution, the filters (matrices the same size as the tile size) effectively slide over the input feature map’s grid horizontally and vertically, one pixel at a time, extracting each corresponding tile.

For each filter-tile pair, the CNN performs element-wise multiplication of the filter matrix and the tile matrix, and then sums all the elements of the resulting matrix to get a single value. Each of these resulting values for every filter-tile pair is then output in the convolved feature matrix.

During training, the CNN “learns” the optimal values for the filter matrices that enable it to extract meaningful features (textures, edges, shapes) from the input feature map. As the number of filters (output feature map depth) applied to the input increases, so does the number of features the CNN can extract. However, the tradeoff is that filters compose the majority of resources expended by the CNN, so training time also increases as more filters are added. Additionally, each filter added to the network provides less incremental value than the previous one, so engineers aim to construct networks that use the minimum number of filters needed to extract the features necessary for accurate image classification.

Some Important CNN Networks :

  • LeNet-5: LeNet-5 architecture is perhaps the most widely known CNN architecture. It was created by Yann LeCun in 1998 and widely used for written digits recognition (MNIST).
  • AlexNet : The AlexNet CNN architecture won the 2012 ImageNet ILSVRC challenge by a large margin. It achieved a 17% top-5 error rate while the second-best achieved only 26%! It was developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. It is quite similar to LeNet-5, only much larger and deeper, and it was the first to stack convolutional layers directly on top of each other, instead of stacking a pooling layer on top of each convolutional layer.
  • VGG-16 : As seen in the AlexNet architecture, CNNs were starting to get deeper and deeper. The most straightforward way of improving the performance of deep neural networks is by increasing their size. VGG (Visual Geometry Group) invented the VGG-16, which has 13 convolutional and 3 fully-connected layers, carrying with them the Relu activation function from AlexNet.
  • ResNet : Last but not least, the winner of the ILSVC 2015 challenge was the residual network (ResNet), developed by Kaiming He et al., which delivered an astounding top-5 error rate under 3.6%, using an extremely deep CNN composed of 152 layers. The key to being able to train such a deep network is the skip connections: The signal feeding into a layer is also added to the output of a layer located a bit higher up the stack. Let’s see first the ResNet architecture and discuss why it is useful.

Few Applications of CNN are :

  • Image processing and recognition
  • Pattern recognition
  • Speech recognition
  • Natural language processing
  • Video analysis

--

--