How Do Machine Learning Systems Process Images?

by Carson
5 views

Often, machine learning systems need to take in image data. They might need to classify an image, identify the objects in an image, or create a description of that image. But how do these neural networks do it? How do they process these images and turn them into useful numbers that they can deal with? Let’s explore this in this article.

Converting Images Into Numbers

The process of converting images into numbers is not done inside the neural network, but it is done before it is fed into a neural network. In fact, an image on a computer is already a bunch of numbers, just like any other file or data structure, and otherwise, it would not be able to be stored on a digital computer.

But we still need to convert the image file into a usable form, it needs to be decompressed. Most images are stored in compressed form which can’t be directly processed by a neural network, so it needs to be expanded out into the individual pixels. Typically, an image file is turned into a C*H*W tensor, in which C is the number of color channels (typically 3 for the input layer), H is the height, and W is the width. For most computer vision neural networks, the image is resized such that H is equal to W.

Typically, the pixel values, which range from 0 to 255, are typically rescaled to between 0 and 1. Finally, the data is normalized to a specific standard deviation and mean. After that, the image is ready for the feature extraction process below.

Extracting Features From an Image

The primary mechanism at play (and the most valuable component) of a computer vision neural network is the feature extraction process. This process generates a summary of the features of the image in order to make a classification or do other downstream tasks, like the process in which we recognize the color, texture, and shape of an object to recognize it.

Convolutional Neural Networks

One of the most common ways that this is done is by passing a convolution kernel over an image. The effects of that convolution is best explained using a diagram:

An illustration of the convolution process
A filter scans through a matrix once to obtain the value of a single cell

Basically, the convolution kernel is slid over each pixel of the image, and can be used to detect different types of features, such as lines, sharp edges, or other features that are learned from the data by a deep neural network, known as a convolutional neural network (CNN).

The size of the convolutional kernel being slided over the image is usually far smaller than the size of the image itself (e.g., common kernel sizes are 3×3, 5×5, or 7×7 on images at least hundreds of pixels in width). Therefore, the early convolutional kernels emphasize the extraction of local features. As the image passes through more convolutional layers and pooling layers (to reduce the height and width of the image, making the components of the convoluted images closer together), the layers start extracting global features from the image aside from local features, facilitating understanding of the overall structures of the image for downstream tasks.

Finally, the “feature map”, which is the output of the image through the convolutional feature extractor, is obtained. For classification problems, an adaptive pooling mechanism is usually applied to fix the size of the feature map. Then, the feature map is flattened into a vector, which is then passed into a classification head (e.g., a linear layer) for the classification.

Vision Transformers

Another approach to extracting features from an image involves the use of attention-based transformer models. First, the image is split up into patches, either through direct splitting or convolutional layers (cite early convolutions and vision transformers paper). Then, these patches are flattened, and image embeddings are obtained by passing a linear transformation over the flattened patch.

An illustration of the patch embedding processed used in vision transformers

This converts an image modeling problem into a sequence modeling problem. In language modeling (the original use case for transformers), for example, the attention mechanism of transformers influences the word embeddings (internal representations of words in the model) to capture richer meanings of words given the context. A similar thing happens in vision transformers, where image embeddings influence each other through the attention layers. This allows the model to capture both local and global features from the get-go, as very distant image embeddings can already influence other embeddings in the first attention step. This model is a vision transformer (ViT) model.

For classification problems, instead of flattening the whole feature map, usually only one column of the feature map is flattened. This is the column corresponding to the [CLS] token, which is a special marker for classification. The attention mechanism influences the embeddings corresponding to the [CLS] token to allow it to capture a good summary of the entire image.

More Recent Approaches

More recently, other approaches, mostly based on the vision transformer pipeline, has been proposed in research. For example, there was a paper that proposed using an LSTM (long short-term memory network) instead of an attention-based transformer for the sequence modeling part of vision transformers.

And since convolutional neural networks and vision transformers have different inductive biases (convolutional neural networks assume that images can be modeled from fine to coarse features, while vision transformers do not take that inductive bias), attempts have been made to modify CNNs and ViTs to reap the benefits of each other. ConvNext is one example that does this, gradually modifying a ResNet (a type of CNN) according to the design choices in a ViT.

Component for Downstream Tasks

After the feature extraction process, there is a component of a network that handles a downstream task that can be applied in real-life applications.

Among the tasks, classification is the conceptually easiest one. This basically involves assigning a label to an image out of multiple possible labels. Therefore, it suffices to apply a dimension-reducing linear transformation to the flattened feature map (to convert the feature map into a set of logits), and then select the index whose logit has the highest value as the selected class. For instance, if the logits in the 5-class classification turn out to be [0.7, 0.2, 0.05, 0.03, 0.02], then class 0 (with a corresponding class probability of 0.7) is chosen.

Semantic segmentation can be viewed as a pixel-wise classification problem, where each pixel is assigned a class separately to group each pixel into a type of object that it belongs to. For convolutional feature extractors, this is done by extracting an intermediate layer and then applying more convolutions and upscaling operations to produce a feature map of the desired size (nHW), with n channels (where n is the number of classes), and the same size as the input image.

Conclusion

In this article, we’ve talked about how neural networks process images. This boils down to the feature extraction process, which involves a feature extractor pretrained on a large dataset, and the downstream task component.

Related Posts

Leave a Comment

* By using this form you agree with the storage and handling of your data by this website.