Face Mask Detector using Deep Learning (PyTorch) and Computer Vision (OpenCV)

Overview

The World Health Organization (WHO) reports suggest that the two main routes of transmission of the COVID-19 virus are respiratory droplets and physical contact. Respiratory droplets are generated when an infected person coughs or sneezes. Any person in close contact (within 1 m) with someone who has respiratory symptoms (coughing, sneezing) is at risk of being exposed to potentially infective respiratory droplets. Droplets may also land on surfaces where the virus could remain viable; thus, the immediate environment of an infected individual can serve as a source of transmission (contact transmission). Wearing a medical mask is one of the prevention measures that can limit the spread of certain respiratory viral diseases, including COVID-19. In this study, medical masks are defined as surgical or procedure masks that are flat or pleated (some are shaped like cups); they are affixed to the head with straps. They are tested for balanced high filtration, adequate breathability and optionally, fluid penetration resistance. The study analyses a set of video streams/images to identify people who are compliant with the government rule of wearing medical masks. This could help the government to take appropriate action against people who are non-compliant.

Objective

To identify the person on image/video stream wearing face mask with the help of computer vision and deep learning algorithm by using the PyTorch library.

Approach

Train Deep learning model (MobileNetV2)
Apply mask detector over images / live video stream

Flowchart

Data At Source

The raw images used for the current study were downloaded from PyImageSearch article and the majority of the images were augmented by OpenCV. The set of images were already labeled “mask” and “no mask”. The images that were present were of different sizes and resolutions, probably extracted from different sources or from machines (cameras) of different resolutions.

Data Preprocessing

Preprocessing steps as mentioned below was applied to all the raw input images to convert them into clean versions, which could be fed to a neural network machine learning model.

Resizing the input image (256 x 256)
Applying the color filtering (RGB) over the channels (Our model MobileNetV2 supports 2D 3 channel image )
Scaling / Normalizing images using the standard mean of PyTorch build in weights
Center cropping the image with the pixel value of 224x224x3
Finally Converting them into tensors (Similar to NumPy array)

Deep Learning Frameworks

To implement this deep learning network we have the following options.

TensorFlow
Keras
PyTorch
Caffee
MxNet
Microsoft Cognitive ToolKit

We are using the PyTorch because it runs on Python, which means that anyone with a basic understanding of Python can get started on building their deep learning models.and also it has the following advantage compared with TensorFlow 1. Data Parallelism 2. It looks like a Framework.

In PyTorch, I have used the following module to develop the algorithm,

PyTorch DataLoader - which is used to load the data from the Image Folder
PyTorch DataSets ImageFolder - which is used to locate the image sources and also have a predefined module to label the target variable
Pytorch Transforms - helped to apply the preprocessing steps over the source image while reading from the source folder
PyTorch Device - identifies the running system capabilities like CPU or GPU power to train the model. It will help us to switch the system usage
Pytorch TorchVision - it will help us to load the libraries which are created before. Like pre-trained models, image sources and so on. It is one of core in PyTorch
PyTorch nn - it is one of the core modules. This module helps us to build our own Deep Neural Network (DNN) models. It has all the libraries needed to build the model. Like Linear layer, Convolution layer with 1D, conv2d, conv3d, sequence, CrossEntropy Loss (loss function), Softmax, ReLu and so on.
PyTorch Optim - help us to define the model optimizer. it will help the model to learn the data well. For example Adam, SDG and so on
Pytorch PIL - helps to load the image from the source
PyTorch AutoGrad - another important module, it provides automatic differentiation for all operations on Tensors. For example, a single line of code .backward() to calculate the gradients automatically. This is very useful while implementing the backpropagation in DNN

Image Classification Algorithm from PyTorch

CNN (Convolutional Neural Network) has many versions of pre-trained and well-architected networks for example AlexNet, ResNet, Inception, LeNet, MobileNet and so on. In our case I have chosen the MobileNetV2 due to its lightweight and very efficient mobile-oriented model.

MobileNetV2

MobileNetV2 builds upon the ideas from MobileNetV1 [1], using depthwise separable convolution as efficient building blocks. However, V2 introduces two new features to the architecture: 1) linear bottlenecks between the layers, and 2) shortcut connections between the bottlenecks1. The basic structure is shown below.

MobileNetV2 Architecture

The typical MobilenetV2 architecture has as many layers listed below, In Pytorch we can use the models library in TorchVision to create the MobileNetV2 model instead of defining/building our own model.
The weights of each layer in the model are predefined based on the ImageNet dataset. The weights indicate the padding, strides, kernel size, input channels and output channels.

Why choose MobileNetV2?

Based on ImageNet dataset MobileNetV2 outperforms MobileNetV1 and ShuffleNet (1.5) with comparable model size and computational cost. And also it will perform well for the smaller dataset.

Image: Performance over ImageNet dataset

Face Mask Detector

Training with Face Mask

As described above I have used the augmented data which is a smaller dataset. I have also applied all the necessary preprocessing requisites applied on the top of source images using the PyTorch transforms, PIL and DataSets.
The sample face mask training files look like as below.

Image 1 . training set example

Model Overwritten

MobileNetV2 was chosen as an algorithm to build a model that could be deployed on a mobile device. A customized fully connected layer which contains four sequential layers on top of the MobileNetV2 model was developed. The layers are

Average Pooling layer with 7x7 weights
Linear layer with ReLu activation function
Dropout Layer
Linear layer with Softmax activation function with the result of 2 values.

The final layer softmax function gives the result of two probabilities each one represents the classification of “mask” or “not mask”.

Accuracy Overview

The data set has been divided into two sets, likely a training and validation set. The accuracy of image classifier over the training set vs validation.

Image 2: Training vs Validation Accuracy

Test Sample

For the reference I have attached a single image prediction result.

Image: Test result

Face Mask Detection in Webcam stream

The flow to identify the person in the webcam wearing the face mask or not. The process is two-fold.

To identify the faces in the webcam
Classify the faces based on the mask.

Identify the Face in the Webcam

To identify the faces a pre-trained model provided by the OpenCV framework was used. The pre-trained model is based on Single-Shot-Multibox Detector (SSD) and employs a ResNet-10 Architecture as the backbone. The model was trained using web images. OpenCV provides 2 models for this face detector:

Floating-point 16 version of the original Caffe implementation.
8 bit quantized version using Tensorflow.

In this example I have used the Caffe model in this face mask detector.

SSD Basic Architecture

Single-shot MultiBox Detector is a one-stage object detection algorithm. This means that, in contrast to two-stage models, SSDs do not need initial object proposals generation step. This makes it, usually, faster and more efficient. This model will return the array of faces detected on the video frame.

Face Mask classifier

Subsequent to face detection, the mask detector model was used to identify the frame face covered by mask or not. The flow of this process described in the image below.

Face Mask identification on Stream

The live stream test video is ,

Conclusion

The current study used OpenCV, Pytorch and CNN to detect whether people were wearing face masks or not. The models were tested with images and real-time video streams. Even though the accuracy of the model is around 70%, the optimization of the model is a continuous process and we are building a highly accurate solution by tuning the hyper parameters. MobileNetV2 was used to build the mobile version of the same. This specific model could be used as a use case for edge analytics.

I had taken the reference from this article.

Thanks for spending your valuable time. Add your feedbacks in the comment section.

Search This Blog

RamaKavanan