Image and Video Representations for Deep learning with FastAI, OpenCV, and matplotlib.

Table of Contents


Image and video representation

Color Image

— loading a color image

What are pixels?

Making slices of an image

Separating the channels of an image

Image Transformation from Color to Grayscale

— The Average Method

— The Weighted Average

— The Luminosity Method

Why grayscale representations are important



In deep learning, unstructured data are more popularly used. Unstructured data is data that isn’t organized in a pre-defined fashion or lacks a specific data model. It includes text, video, image, or audio data which are popularly used in deep learning. As a beginner in deep learning, you are likely to wonder how pictures (or images) fed to a network are handled by a computer. I, too, was a bit puzzled by the less intuitive representation when I started out and was not particularly comfortable with it for a while. In traditional machine learning, where tabulated data is most commonly used, each example (or sample) is simply represented in the feature vectors of the m x n matrix of data. This is a structured representation and very convenient to work it.

In this article, I will be focused solely on image data and will show how to use some of the modules from the FastAI library and custom defined functions to explore and manipulate images. A sample tabulated data is made and displayed below:

A sample tabulated data containing house information
A sample tabulated data containing features of three houses

However, once you make the progression to deep learning (specifically, computer vision), you are presented with a new reality: a representation involving higher order matrices (aka tensors) — an abrupt change from tabulated to less intuitive representation. Before delving into the nitty-gritty on the effect of such representation on model performance, it is best to understand first how images or videos are handled by a computer.

Image and Video representation

In the memory of the computer, an image is a collection of pixels arranged in a 2 or higher dimensional grid, represented using squares [link]. A video is a sequence of frames, and a frame is an image, hence, a collection of pixels. For those of us that started our data science journey from working with tabulated data, we can become attached to the convenient representation of data in a rectangular array — in which each row represents an example (or instance or sample of data) and each column, a feature.

Sample image from mnist dataset: left: render; right: array representation in a dataframe

As a result, confusion (sometimes, resentment) can arise once we begin to work with images or videos because these data have no explicitly defined features. However, since images are composed of pixels, and different locations contain different pixels, it is safe to think of the pixel locations as the features i.e. each pixel location denotes a feature and the value (of that feature) is the pixel intensity (or intensities) at the given location. For example, the pixels that make up the gular fold of the reptile in the figure below must occupy certain positions relative to each other so that when rendered, they appear collectively as a gular fold.

The pixels that make up any part of an image must occupy certain positions relative to each other so that when rendered, they appear collectively as that part.

Color Image

A color image has a more complex representation than a grayscale image. Unlike in grayscale — in which there are only two dimensions — color images have three (e.g RGB images ) or more dimensions (e.g RGBA images, A represents the alpha dimension). RGB images can be thought of as cuboids, with each slice along the cross-section representing a channel. Hence, 3D color images are rank 3 tensors.

Loading a color image

There are many ways to load an image to a IDE, one of the popular methods is Pillow’s Make sure to pass the location of the image as an argument to the function. Below is a simple color image:


To explore the sample image, let’s start by viewing the shape attribute of the image like so:

In the shape attribute of the image, the first two numbers ( 194, 259, — ) represent the number of rows and columns, respectively. The third number ( — , — , 3) represents the dimension or rank of the array. It is also referred to as the channel. All RGB color images have three channels: Red, Green, and Blue.

Pixel (picture element)

A pixel is one of the small dots or squares that make up an image on a computer screen. Each pixel in a RGB color image is a rank 1 tensor of size 3 corresponding to the red, green, and blue values of the pixel.

Let’s check out the RGB values of the first pixel in the sample image:

As from the output above, the pixel is an array of 3 elements, each representing the red, green, and blue values, respectively.

Making slices of the image

One of the preprocessing steps on images is cropping. Cropping or slicing is a useful preprocessing (and augmentation) step. This helps our model generalize better because the object(s) of interest are sometimes not wholly visible in the image or the images in our training data are not of the same scale. One simple way we can crop an image is by using array slicing.

Note: Slice indices must be integers, hence, while trying to compute indices, we must use floor division or cast the result of each division (or product) to an integer as shown below:

Row slices
Column slices

Alternatively, one can utilize the open source implementation of random crop available from frameworks like FastAI, PyTorch, and TensorFlow. The FastAI implementation of RandomCrop() can be found in the link below.

Separating the Channels of an Image

Each of the channels that make up a color image can be separated by either using the OpenCV split()method or array indexing.

Note: If you use the matplotlib library to plot a single-channel image from a 2D array, it (matplotlib) applies pseudocoloring (a jet colormap) by default, unless, a different colormap is set. In order to avoid pseudocoloring, I wrote a modified version of FastAI’s show_images() namedDim_Imsshow()that checks the dimension of the image first and applies an appropriate colormap based on the dimension.

The OpenCV method for splitting is shown below:

Alternatively, array indexing can be combined with list comprehension to carry out the separation.

Image Transformation from Color to Grayscale

Having explored some of the attributes of the image such as the shape, pixel intensities, and channels. Let’s carryout a simple transformation from color to grayscale. Bear in mind that there are several other transformations that can be applied to an image. However, in this article, we will focus only on RGB to Grayscale transformation which involves taking a Rᵐˣⁿˣ³ (aka 3D) color image to a Rᵐˣⁿ (aka 2D)representation.

Again, there are several approaches to doing 3D to 2D transformation. The popular methods are the weighted average, the weighted method, and the luminosity method. All three methods are computationally inexpensive, i.e., they all have linear time complexity in the number of pixel. Also, each of these methods transforms the color image in a unique way.

The Average Method

This is the simplest of all the approaches and involves taking the arithmetic mean of the red, green, and blue colors at every location. This approach does not correctly account for the relative contributions by the three colors.

Grayscale = 1/3 * (R + G + B)

The Weighted Average

The weighted average takes into account the relative contributions of the different colors. It is the grayscale conversion algorithm used by OpenCV’s cvtcolor()and Pillowconvert('L'). The formula is given below:

Grayscale = 0.299 * R + 0.587 * G + 0.114 * B

Thus, green makes the greatest contribution, followed by red, and then blue.

The Luminosity Method

This is a more sophisticated version of the average method. It also computes the average of the values, but does so in a manner that accounts for human perception. Psychologists through studies observed that humans are more sensitive to green than the other colors and as a result, assigned green the greatest weight. The Grayscale values are calculated using the formula below:

Grayscale = 0.21 * R + 0.72 * G + 0.07 * B

The implementation for the three methods is given below:

A client program exploiting the above transformer function:


Why are grayscale representations important?

In applications where the color information of the data is of no import, such as in edge detection or descriptors extraction, grayscale images are preferred as they simplify the algorithm and reduce computational requirements.

Moreover, in many applications of image processing, color information does not help us to identify important edges or features, however there are few exceptions such as objects in which the step changes in pixel values are hue, and as a result, are hard to detect in grayscale, for this application, color information would be useful and RGB or any other color space will be preferred.

Grayscale representations are often desired mostly for the simplicity they offer.


Generally, computers represent images as a collection of pixels, and videos, as a sequence of frames(or images). Images can be transformed for purposes of simplicity (e.g. color-grayscale transformation) or for adversarial perturbation to improve data diversity and improve the generalizing ability of a model. In the next article, I will be exploring a number of image color spaces and evaluating their effects on model performance.

Accuracy is affected by color space [source]



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store