Gadde Sai Shailesh's other Models Reports

Major Concepts

 

Sign-Up/Login to access Several ML Models and also Deploy & Monetize your own ML solutions for free

Models Home » Domain Usecases » Others » Crowd Counting

Crowd Counting

Models Status

Model Overview

Crowd Counting using CSRNet

Problem Statement

The problem of crowd counting has been studied for research purpose for quite some time. For too long, countless precious lives have been lost in stampedes. In countries with huge festivals, crowd management is a serious issue. One popular festival is ‘Kumbh Mela’, in India. It may be almost impossible to control the crowds in such scenarios, but the respective authorities can always take precautionary controls before the risk occurs. Human beings do not have the ability to count people in crowds, but we can make intelligent machines and programs that can do it easily. Artificial Intelligence can achieve this easily.

                                                                                                          


There is a wide range of applications for crowd counting like Political rally, Sports events, Social events.


Crowd Counting is a difficult problem especially in dense crowds due to two main reasons:



  1. There is often clutter, overlap and occlusions present.

  2. In the perspective view, it is difficult to take into account the shape and size of an object present with respect to the background.


Previously there have been many studies on crowd counting and a lot of algorithms have been proposed in the literature for tackling this problem. Most of them use some form of the convolutional neural network along with a density map estimation which predicts a density map over the input image and then summing to get the count of objects.

What were the previous methods used to tackle crowd counting?



source - Paper



Broadly speaking, there are currently four methods we can use for counting the number of people in a crowd:


1. Detection-based methods


Here, we use a moving window-like detector to identify people in an image and count how many there are. The methods used for detection require well-trained classifiers that can extract low-level features. Although these methods work well for detecting faces, they do not perform well on crowded images as most of the target objects are not clearly visible.

2. Regression-based methods


We were unable to extract low-level features using the above approach. Regression-based methods come up trumps here. We first crop patches from the image and then, for each patch, extract the low-level features.

3. Density estimation-based methods


We first create a density map for the objects. Then, the algorithm learns a linear mapping between the extracted features and their object density maps. We can also use random forest regression to learn non-linear mapping.

This is the most accurate method for dense crowds, and the method used by our model of choice: CSRNet.


 
Density map(right)

4. CNN-based methods


Ah, good old reliable convolutional neural networks (CNNs). Instead of looking at the patches of an image, we build an end-to-end regression method using CNNs. This takes the entire image as input and directly generates the crowd count. CNN's work really well with regression or classification tasks, and they have also proved their worth in generating density maps.


CSRNet, a technique we will implement here deploys a deeper CNN for capturing high-level features and generating high-quality density maps without expanding the network complexity. Let’s understand what CSRNet is before jumping to the coding section.

What is the architecture Of CSRNet(Congested Scene Recognition Net)

We choose VGG16 as the front-end of CSRNet because of its strong transfer learning ability and its flexible architecture for easily concatenating the back-end for density map generation. The VGG-16 uses a variety of activation functions (used to abstract node outputs inter-layer) and other hyper-parameters (layer/node configuration) including ReLU and SoftMax. ReLU is standard in machine learning, it means Rectified Linear Unit. SoftMax is less common but noteworthy. The goal of SoftMax is to turn numbers into probabilities and polarize classification. The softmax activation will output one value for each node in the output layer. This can be useful in image classification at the last layer because it provides probabilities for any classification scenario. Since our goal is to output a density map instead of a one-hot classification, CSRNet’s backend structure replaces the fully connected and SoftMax layers with more convolutional layers.


The output size from VGG is ⅛th of the original input size. CSRNet also uses dilated convolutional layers in the back end. The output size of this front-end network is 1/8 of the original input size. If we continue to stack more convolutional layers and pooling layers (basic components in VGG-16), output size would be further shrunken, and it is hard to generate high-quality density maps.

Dilated Convolutions


In dilated convolution, a small-size kernel with k × k filter is enlarged to k + (k − 1)(r − 1) with dilated stride r.


For maintaining the resolution of the feature map, the dilated convolution shows distinct advantages compared to the scheme of using convolution + pooling + deconvolution.

Comparison between dilated convolution and max-pooling, convolution, upsampling. The 3 × 3 Sobel kernel is used in both operations while the dilation rate is 2.

The above image is an image of crowds, and it is processed by two approaches separately for generating output of the same size. In the first approach, input is downsampled by a max-pooling layer with factor 2, and then it is passed to a convolutional layer with a 3×3 Sobel kernel. Since the generated feature map is only 1/2 of the original input, it needs to be upsampled by the deconvolutional layer (bilinear interpolation). In the other approach, we try dilated convolution and adapt the same 3 × 3 Sobel kernel to a dilated kernel with a factor = 2 strides. The output is shared the same dimension as the input (meaning pooling and deconvolutional layers are not required). Most importantly, the output from dilated convolution contains more detailed information (referring to the portions we zoom in on).



  1. The architecture is based on the fact that dilated convolutions support the exponential expansion of the receptive field without loss of resolution or coverage.

  2. Allows one to have a larger receptive field with the same computation and memory costs while also preserving resolution.

  3. Pooling and Stride Convolutions are similar concepts but both reduce the resolution.

  4. It preserves the resolution/dimensions of data at the output layer. This is because the layers are dilated instead of pooling, hence the name dilated causal convolutions.

  5. It maintains the ordering of data. For example, in 1D dilated causal convolutions when the prediction of output depends on previous inputs then the structure of convolution helps in maintaining the ordering of data.


                                                      


The dilated convolutions help produce our density maps:



Previous Results


 
For some more context, MAE is the Mean Absolute Error. The mean absolute error is the average of the absolute error for every training image. MSE is the Mean Squared Error. MSE is different from MAE because it puts more emphasis on large errors.


What is the dataset used?

We used the ShanghaiTech dataset it is a introduce a new large-scale crowd counting dataset named Shanghaitech which contains 1198 annotated images, with a total of 330,165 people with centres of their heads annotated.




As far as we know, this dataset is the largest one in terms of the number of annotated people. This dataset consists of two parts: there are 482 images in Part A which are randomly crawled from the Internet, and 716 images in Part B which are taken from the busy streets of metropolitan areas in Shanghai. The crowd density varies significantly between the two subsets, making accurate estimation of the crowd more challenging than most existing datasets. Both Part A and Part B are divided into training and testing: 300 images of Part A are used for training and the remaining 182 images for testing, and 400 images of Part B are for training and 316 for testing


Let's Understand the code


What are all the libraries required?



What are steps needed to be followed?



  1. Create a Density map for all the images

  2. Train the model

  3. Save the weights

  4. Validation


How to create a density map for the images?


First, define the paths of the dataset.
part_A_train,part_A_test,part_B_train,part_B_test contains the paths of files in the dataset folder.

dataset
      |----part_A_final
                    |--train_data
                              |--images
                              |--ground-truth
      |----part_B_final
                    |--train_data
                              |--images
                              |--ground-truth


Then, define the function for Gaussian filter density



Then, for all the images in 'image_paths' let's create a density_map



This may take some time because creating a density for all the images is a big task.

Output
 



For a better understanding of how to create a density map refer here.



How to train the model?

We have to create a JSON file that contains the paths for all the images present. Then we can train our model.



Output of the 15th epoch




Validation Script and Output Visualization




Let's predict some crowd images

  

    


0 comments