VidAnomalyNet: An Efficient Anomaly Detection in Public Surveillance Videos Through Deep Learning Architectures

ABSTRACT


INTRODUCTION
Video surveillance and automatic detection of anomalies has become an important research area.It is indispensable in the modern applications in urban and industrial environments involving in development, day to day operations and sustainability.This kind of research contributes towards safety of citizens, improved security, efficiency and real time approach in monitoring and making well-informed decisions.Not only in industrial environments and cyber-physical systems, video surveillance plays crucial role in areas of high human population density.As urban areas are rapidly increasing in population diversity and density living in multi storeyed buildings with increased pedestrian, crowed and vehicular movements, video surveillance has its pivotal role in facilitating human safety, security, law and order besides bestowing evidence towards speedy investigations made by law-enforcing agencies [1].
The motivation for anomaly detection in videos stems from the need to enhance security, safety, and monitoring processes in various domains.The problem statement revolves around developing models that can automatically identify deviations from normal behavior in video streams, with the ultimate aim of preventing or responding to unexpected events effectively.
There is continuous horizontal and vertical expansion of industrial and urban areas leading to exponential usage of CCTV cameras.When there are several thousands of such cameras in operation, it is not desirable to have manual observation of such video streams.It is also not ideal to monitor video footage only when certain untoward incidents occur.There should be technology-driven approach that takes this into an autonomous video surveillance process which monitors and analyses videos in real time.Towards this end, Artificial Intelligence (AI) technology has wherewithal to support automatic analysis of surveillance videos in real time and provide its findings as the incidents occur.As explored in the previous studies [2][3][4], to mention few, deep learning is an AI based learning approach that has capacity to serve the purpose of autonomous and comprehensive video surveillance.Especially detection of anomalous behaviours or incidents has to be given paramount importance.Towards this end there are many existing contributions found in the literature.
Video Anomaly Networks (VANs) are designed to effectively capture spatiotemporal patterns in video sequences, making them more suitable for anomaly detection in dynamic environments compared to traditional methods or architectures.Specific features that contribute to the efficiency of VANs for learning and detection include.
Spatiotemporal Convolutional Layers: VANs often incorporate spatiotemporal convolutional layers, allowing them to simultaneously process spatial and temporal information in video frames.This is crucial for capturing the dynamic nature of video sequences.
Temporal Modeling: Incorporation of recurrent layers, such as Recurrent Neural Networks (RNNs) or Long Short-Term Memory (LSTM) units, enables VANs to model temporal dependencies and long-term patterns in video data.This is essential for understanding the context and continuity of events over time.
Unsupervised Learning Techniques: VANs often operate in an unsupervised learning setting, where they are trained on normal video data without explicit labels for anomalies.Unsupervised learning allows the model to learn the inherent patterns of normal behavior and detect anomalies based on deviations from these patterns.
Autoencoder Architectures: Autoencoders are commonly used in VANs for unsupervised learning.These architectures learn to reconstruct normal video frames and identify anomalies by measuring the reconstruction error.Autoencoders are effective in capturing and representing relevant features of the input data.
Adaptability to Different Anomalies: VANs are designed to be adaptable to different types of anomalies.The learned representations are expected to be generic enough to capture a wide range of abnormal behaviors, making the model versatile in various applications.
Attention Mechanisms: Some VANs may incorporate attention mechanisms to focus on specific regions or frames of interest within the video sequence.This helps the model prioritize relevant information for anomaly detection, improving efficiency and accuracy.
Real-Time Processing: Efficient VAN architectures are designed to handle real-time processing of video streams, making them suitable for applications where timely anomaly detection is crucial, such as surveillance and security.
Evaluation Metrics: VANs are typically evaluated using metrics such as precision, recall, F1 score, and area under the receiver operating characteristic (ROC) curve.These metrics provide a quantitative measure of the model's efficiency in correctly identifying anomalies while minimizing false positives.
The previous studies [5][6][7][8][9][10] explored convolutional autoencoder along with optimal flow method to detect anomalies from videos.Autoencoders are also used for this kind of research [11][12][13][14][15], different autoencoder models are investigated to ascertain their merits in detection of video abnormalities.In there is spatio-temporal approach in video analytics by exploiting autoencoders considering both spatial and temporal dimensions.In anomaly detection is explored using variational autoencoder and Gaussian mixture on top of convolutional approach.Their method also focused on localization of anomalies.In autoencoding and attention mechanisms are combined to have spatio-temporal autoencoding process.In autoencoders are used in critical infrastructure monitoring and analysis.Research found that autoencoders are good for anomaly detection.However, they focus on learning much information and sometimes relevant information learning becomes an issue.There is need for training autoencoders with lot of data with hyperparameter tuning.There are many CNN variants [16][17][18][19], to mention few, used for anomaly detection from videos.It is found that CNN models are suitable for image data analysis.Moreover, their learning capability makes then preferred advanced neural network architectures to solve real world problems.However, "one size does not fit all" as the CNN models cannot directly provide optimal performance in every case considered.Our contributions in this paper are as follows.
(1) We proposed a novel deep learning architecture known as VidAnomalyNet.This model is based CNN model as it is found to be highly successful in processing image data.Moreover, CNN is found efficient in feature map generation and optimization.With our architecture, there is more efficient learning process and detection of anomalies from surveillance videos.
(2) We proposed a framework that makes use of our VidAnomalyNet architecture to enhance performance in anomaly detection from surveillance videos.The framework provides a set of reusable components for facilitating intended functionality.
(3) We also proposed an algorithm known as VidAnomalyNet for Automatic Anomaly Detection (VAAD).This algorithm is designed on top of the proposed deep learning architecture.It has provision for multi-class classification with ability to detect four classes such as normal, fire, accident and robbery.It can be easily extended to identify more number of anomalies.
(4) We also explored MobileNetV1 with transfer learning by adding new layers to the base model for video anomaly detection.The rationale behind this is that MobileNet is good for processing imagery data.We compared the performance dynamics of MobileNet based enhanced architecture and our VidAnomalyNet.Our empirical study has revealed that VidAnomalyNet outperforms MobileNetV1 with highest accuracy 96.35%.
The remainder of the paper is organized into several sections.The related work in Section 2 provides valuable literature insights that helps us to ascertain research gaps and the need for building a more appropriate deep learning architecture for automatic detection of anomalies from surveillance videos.Literature also throws light on merits and demerits of existing detection methods.Our proposed architecture, algorithm and underlying mechanisms including dataset details are provided in Section 3. It throws light on our VidAnomalyNet architecture which has layers configured based on the performance dynamics in the empirical study.It also provides the details of MobileNetV1 and how it is subjected to transfer learning process to improve the baseline model.
Ultimately, the choice of deep learning approach depends on the nature of the task, the characteristics of the data, and computational resources available.Hybrid models and ensembles are also common, combining the strengths of different architectures for improved performance.It's important to consider the specific requirements and challenges of each application when selecting a deep learning approach.

RELATED WORKS
The Convolutional Long Short-Term Memory (Conv-LSTM) network, which explains real-time crowd AD.To anticipate violent acts and assist stakeholders in exhibiting such activities in real time, a Deep Learning (DL)-centric strategy was adopted.Conv-LSTM was used to both detect violent actions and capture the frame.The suggested system produced better accuracy at a quicker rate.However, due to the difficulty of classifying individual or group activities, accuracy was still inadequate [20].
They demonstrated an AD module and a human detection module that together made up a supervised Local Distinguish Ability improving Network (LDA-Net).An inhibitory loss function and embedding were devised to reduce misclassification in highly imbalanced datasets.The results of the simulation demonstrated how the constructed supervised LDA-Net produced the state-of-the-art results.However, the created model's considerable computational complexity resulted from the addition of a new axis [21].
The proposed an online AD technique for surveillance videos utilizing transfer learning as well as continual learning.The developed algorithm utilized the Feature Extraction (FE) power of neural network-centered methodologies and statistical detection methods.Simulation outcomes considerably gave pre-eminent accuracy for the built system.Nevertheless, it was still challenging to learn to detect abnormalities promptly.The established an approach centered on bidirectional prediction, and subsequently built the loss function utilizing the real target frame along with its bidirectional prediction frame.Moreover, with a focus on the prediction error map's foregrounds, an anomaly score estimation approach centered on the SW scheme was built.Better AD was delivered by the experimental outcomes with higher scores.However, it highly depends on assumptioncentric data generation.Thus, the false alarm rate was high [22].
The outlined a convolutional neural network (CNN)-centric, lightweight solution to AD.In sequence learning, the generated residual attention-centric LSTM was trained with the retrieved spatial CNN features, and it demonstrated both recognition and AD efficiency.Extensive experiments were conducted to validate the efficiency of the developed paradigm.Modelling based on normal activity was impractical because normal was such a general term, and it would be challenging to classify everything that fell under it [23].
A data-driven adaptive AD method for human activities was presented by the.Behaviour modelling was obtained by using the Consensus Novelty Detection Ensemble, which is an ensemble for novelty detection systems and includes a One-Class SVM.The simulation results demonstrated the good performance of the designed system.The deep systems have a problem in that they require more data and a lot of computing power [24].

METHODS
This section presents details of our methods, dataset used for empirical study, our deep learning architecture, mechanisms, algorithm defined and evaluation procedure used.

Dataset
Public surveillance videos are collected from [25] for the empirical study in this paper.These videos do have many realistic anomalies and also normal instances.This dataset is widely used by many researchers, such as the study [26], in computer vision applications.It is known as UCF-Crime dataset which covers 13 classes of real world anomalies.Out of 13 classes we have extracted four classes such as fire, accident, robbery and normal.The fire class videos show the fire intentionally set to properties.The accident class consists of videos with traffic accidents where cyclists, pedestrians and vehicles are involved.The robbery class consists of videos reflecting thieves taking money unlawfully by threatening or force.But this class does not include shooting kind of threats.The normal class of videos contain indoor and outdoor scenes but do not reflect any occurrence of crime.
Figure 1 shows the data distribution dynamics for 4 classes.From the collected dataset 4 classes of data are extracted for the research carried out in this paper.Out of total number of surveillance videos (16853 n), we have taken 3123 normal instances, 1563 fire instances, 2654 accident instances and 2276 robbery instances.Figure 2 shows an excerpt from each class of the collected dataset.

Our methodology
We proposed a methodology and a novel CNN based deep learning architecture known as VidAnomalyNet for efficient detection of anomalies from surveillance videos.Before going to technical details of the VidAnomalyNet, we delve into overall methodology which provides the modus operandi of our approach in some detail.Overview of the proposed methodology for anomaly detection from surveillance videos is illustrated in Figure 3.
The UCF-crime dataset is used to extract only 4 classes for our experiments.They are known as fire, accident, robbery and normal.In the extracted dataset (now onwards we simply call it UFC crime dataset or dataset), there are 9616 instances covering all 4 classes.Each class related videos in the dataset is subjected to splitting into training set (75%) and testing (25%).Afterwards, the data is analysed to know whether it is balanced and need augmentation.Data augmentation is carried out in order to have more quality in the training process.Afterwards, the training data is given to our proposed novel deep learning architecture known as VidAnomalyNet (explained in Section 3.

References
Approach Techniques Dataset [27] Deep learning DCNN UCSD, CUHK, ShanghaiTech [28] Deep Learning DTM technique UCSD, Mall, UMN and MED [29] Deep Learning Neural Network models Custom dataset [30] Deep Learning SFE technique TUT 2016 [31] Deep Learning and Bio-Inspired CRN along with AntHocNet Custom dataset [32] Deep Learning IIN UCF-Crime, UCSD [33] AI DCNN Custom dataset [34] Deep Learning DCNN methods Seven benchmark datasets [35] Deep Learning DNN UCF-Crime [36] Deep Learning LMNN Custom dataset The dataset is subjected to pre-processing which results in 75% training set and 25% test set for each class.Then there are two phases involved in the system.Since it is supervised learning phenomenon, it has training and testing phases.In the former, our novel deep learning model named VidAnomalyNet is built with the architecture show in Figure 5.The model is compiled and trained with different optimizers and parameter as presented in Table 1.Afterwards the trained model is saved for reuse.The saved model is used in the testing phase of the system.The trained model is used to use test data to perform prediction of anomalies.The prediction results are four classes including normal, accident, fire and robbery.

Proposed VidAnomalyNet architecture
It is based on CNN model as CNN is found suitable for image analysis.VidAnomalyNet is made up of many kinds of layers.They include Convolutional 2D layers, batch normalization layers, max pooling 2D layers, flatten layer, dense layers, dropout and activation layers.Figure 5 shows the layers, their output shape and number of parameters.
VidAnomalyNet architecture is designed with our empirical study to maximize performance in anomaly detection from surveillance videos.Our model includes multiple layers.In the context of video anomaly detection, separable convolution 2D can be employed to efficiently process spatiotemporal information in video data.Traditional 2D convolutions involve applying a filter/kernel to the input data in both the spatial dimensions (width and height) simultaneously.Separable convolution, on the other hand, decomposes the standard 2D convolution into two consecutive operations: a spatial convolution along each spatial dimension separately (width and height), followed by a 1D temporal convolution along the time dimension and to a standard 2D convolution, leading to computational efficiency and results available in Table 2.It is followed by batch normalization and max pooling 2D with pool size (2,2).The softmax layer is made up of a dense layer followed by activation layer.The VidAnomalyNet is finally built with width 128, highest 128 and depth 3 besides number of classes 4.
The model is trained with three kinds of optimizers namely SGD, Adam and RMSProp.The loss function used is known as sparse categorical cross entropy.
In the proposed VidAnomalyNet architecture separable convolution 2D layer is preferred as it could reduce number of parameters without compromising performance.The rationale behind this is that it has provision factorization that results in reduction of parameters leading to reduction in model size and computation time.Separable convolutional 2D Eq. ( 3) exploits pointwise Eq. ( 1) and depth wise convolution Eq. ( 2) variants towards optimization.
Here pointwise approach is the regular convolutional method.Depth wise variant makes use of single kernel but results in more number of parameters.Max pooling 2D layers are used to ensure spatial invariance and optimize feature maps.Pool size is (2, 2) and as per that pooling window based processing takes place.It exploits subsampling towards optimizing feature maps as expressed in Eq. ( 4).
It considers averaging inputs and multiply them using a trainable scalar denoted as β.Then it adds trainable bias denoted as b and the result is passed via non-linearity.The max pooling function is as expressed in Eq. (5).
The map pooling layer exploits a window function denoted as u(x,y) for given input patch and it computes maximum of the neighbourhood.The outcome is in the form of optimized feature map.
Batch normalization is the technique of normalization that is taken care between layers in the proposed model VidAnomalyNet.Instead of using full data, multiple batches are used in order to make the learning process easier, adopt to learning rates and speed up the training phenomenon.The batch normalization is made as in Eq. (6).
The mean of output of the neurons is denoted as   while standard deviation of the output of neurons is denoted as   .
In the proposed VidAnomalyNet model dense layers are also used.Each neuron in the dense layer receives input from previous layer's all neurons.Thus it is called as dense layer.It has capability to classify given image based on the received output from convolutional layers.

Architecture of MobileNet
In our empirical study MobileNet and MobileNet with transfer learning are used as existing models used for comparing with the proposed VidAnomalyNet model.The original model of MobileNet V1 [37,38] is used in our empirical study.Then it is further improved with transfer learning as discussed in Section 3.5.It is made up of lightweight CNN layers for computer vision applications.It makes use of depth wise separable convolutional filters where single convolution is performed on every input channel.
Then there is pointwise convolution filter that combines the result of depth wise convolution in a linear fashion considering 1x1 convolutions as illustrated in Figure 6, Figure 7 shows the architectural layers of MobileNet V1.
As presented in Figure 7, the MobileNet architecture includes depth wise convolutional layers, batch normalization and ReLU activation provide in number of layers.We trained this model with the proposed methodology and results are obtained for detection of abnormal activities from surveillance videos.Then we also experimented with MobileNetV1 with transfer learning to improve the training process as discussed in Section 3.5.

MobileNet based transfer learning architecture
Transfer learning (TL) is one of the techniques of ML which is used to preserve knowledge gained while solving a problem and reuse it layer for som other related problem.In other words, TL is used to reuse knowledge and speed up the process of training and detection.We proposed a transfer learning architecture by reusing MobileNetV1.As presented in Figure 7, some additional layers are added to exploit the existing architecture and improve its performance in video anomaly detection.
Figure 8, the additional layers added are global average pooling layers, two dropout layers and two dense layers.
By combining mobile networks with transfer learning, you can benefit from the efficiency of lightweight architectures while still achieving good performance on video anomaly detection tasks, even with limited labeled data for the specific application domain.This makes the approach particularly appealing for deploying anomaly detection models on resource-constrained devices and in scenarios where obtaining large amounts of labeled data is challenging.

Performance evaluation method
Precision is one such metric expressed in Eq. (7).It is the ratio between correctly classified anomalous instances and both correctly and incorrectly classified anomalous instances.

Precision (p)=
+  (7) Recall is another metric which is the ratio between correctly classified anomalous instances and both correctly classified anomalous instances and incorrectly classified anomalous instances.This measure is expressed as in Eq. (8).
F1-score is the measure which is the harmonic mean of both precision and recall.This measure is expressed as in Eq. ( 9).F1-score=2 * ( * ) (+) (9) Accuracy is yet another widely used metric for performance evaluation.This metric is as expressed in Eq. (10).
Accuracy=   +   + +  +  (10) All these metrics result in a value between 0.0 and 1.0 reflecting least and highest performance respectively.

RESULTS AND DISCUSSIONS
This section presents experimental results of the proposed model known as VidAnomalyNet along with existing model known as MobileNetV1 and transfer learning variant of MobileNetV1.In all experiments the surveillance videos dataset, for each class, is divided into 75% for training and 25% for testing.Each model is evaluated with three different optimizers such as Adam, SGD and RMSProp.Batch size used is 64 and number of epochs is 100.Learning rate with SGD optimizer is le-2 and for Adam and RMSProp optimizers it is set to le-0.001.Implementation of the models is made using Python 3.9 and Jupyter IDE.

Data analysis and augmentation
All the samples used for empirical study are clubbed and they are analysed to know whether there is need for data augmentation.As presented in Figure 10, Figure 11 and Figure 12, there is data analysis and data augmentation used in order to improve training quality prior to exploiting deep learning models for anomaly detection from surveillance videos.

Performance evaluation
The performance of proposed model VidAnomalyNet is provided in this section along with comparing the same with MobileNetV1 and MobileNetV1 with transfer learning.
Both accuracy and loss metrics are used for visualizing performance of the proposed model.Higher accuracy denotes better performance while lower in loss denotes better performance.As presented in Figure 13 training accuracy, validation accuracy, training loss and validation loss are provided against number of epochs.
As presented in Figure 14, confusion matrix shows the prediction details of the proposed model in terms of TP, FP, TN and FN.It shows ground truth and predicted labels for all four classes.In confusion matrix 0 indicates normal class, 1 indicates fire, 2 indicates accident and 3 denotes robbery.
As presented in Figure 15 and Figure 16, performance and related confusion matrix are provided for the proposed model with Adam optimizer.
As presented in Figure 17 and Figure 18, performance and related confusion matrix are provided for the proposed model with RMSProp optimizer.

Comparison with existing deep learning models
We compared the performance of our model VidAnomalyNet with the MobileNetV1 model and its transfer learning variant.
As presented in Table 3, the anomaly detection performance of VidAnomalyNet with public surveillance videos is provided with different optimizers.
As presented in Table 4, the anomaly detection performance of VidAnomalyNet using SGD optimizer with public surveillance videos is compared with the state of the art models.As presented in Figure 21, the performance of VidAnomalyNet model is evaluated with different optimizers.Each optimizer showed its influence on the model in terms of performance.With Adam optimizer it showed precision 90%, recall 92%, F1-score 91% and accuracy 91%.With RMSProp optimizer it showed precision 91%, recall 91%, F1-score 91% and accuracy 91%.With SGD optimizer it exhibited precision 93%, recall 94%, F1-score 94% and accuracy 96.5%.Highest accuracy 96.5% is achieved when VidAnomalyNet model is used with SGD optimizer.As presented in Figure 22, the performance of VidAnomalyNet model with SGD optimizer is compared against state of the art models.MobileNetV1 is the existing model used in experiments.MobileNetV1 is also used with transfer learning in the empirical study.MobileNetV1 achieved precision 78%, recall 87%, F1-score 81% and accuracy 92.10%.MobileNetV1 with transfer learning showed better performance over MobileNetV1 with precision 91%, recall 78%, F1-score 82% and accuracy 95%.The proposed model VidAnomalyNet with SGD optimizer outperformed existing models with precision 93%, recall 94%, F1-score 94% and accuracy 96.50%.It is observed from the results that there is influence of transfer learning on MobileNetV1.Due to the transfer learning, MobileNetV1 with transfer learning could improve detection performance significantly.However, both the existing models could not exceed the proposed model due to the fact that VidAnomalyNet follows novel approach in configuration of layers and their functioning.Therefore, the proposed model can be used in computer vision applications where public video surveillance is to be done in real time.Since it is a multi-class classifier with highest accuracy, its saved model can act as real time detector of anomalies from public surveillance videos.(1) Security and Public Safety: Application: Identifying unusual activities or behaviors in public spaces, transportation hubs, and critical infrastructure to enhance overall security and public safety.
(2) Retail Loss Prevention: Application: Detecting suspicious activities such as shoplifting, fraud, or other unauthorized behaviors in retail environments to minimize losses.
(3) Crowd Monitoring: Application: Monitoring and detecting anomalous behaviors in crowded areas, such as events, stadiums, or public gatherings, to ensure crowd safety and manage potential security threats.
(4) Airport Security: Application: Identifying unusual or threatening behaviors, abandoned objects, or unauthorized access in airport terminals to enhance aviation security.
(5) Critical Infrastructure Protection: Application: Monitoring critical infrastructure sites, such as power plants, water facilities, or transportation networks, for any abnormal activities that could indicate security threats.
(6) Traffic Surveillance: Application: Detecting abnormal traffic patterns, accidents, or incidents in real-time to improve traffic management and enhance road safety.
(7) Border Security: Application: Monitoring borders and detecting unusual or unauthorized crossings, smuggling activities, or other security threats.
(8) Perimeter Security: Application: Securing the perimeter of facilities, military bases, or sensitive areas by identifying intruders or suspicious activities.
(9) Banking and ATMs: Application: Detecting anomalous behavior around ATMs or within bank branches, such as skimming devices or suspicious transactions, to prevent fraud.

Figure 2 .
Figure 2.An excerpt from dataset reflecting all four classes

Figure 3 .
Figure 3. Overview of the proposed methodology for anomaly detection from surveillance videos 3).The deep learning model is trained with the training videos covering 4 classes.After completion of the training the model is persisted to secondary storage for further reuse instead of re-inventing the wheel every time when new surveillance video arrives for anomaly detection task.The saved knowledge model is then used to perform detection of anomalies by using test videos.Then the proposed deep learning model VidAnomalyNet is evaluated and compared against state-of-the-art model such as MobileNetV1 and MobileNetV1 with transfer learning implemented.The final outcome with regard to anomaly detection includes classification of test videos with four classes such as normal, accident, fire and robbery.Figure 4 shows the flow of the proposed methodology.

Figure 4 .Figure 5 .
Figure 4. Flow of the proposed methodology

Figure 10 .Figure 11 .
Figure 10.A sample picked for analysis

Figure 14 .
Figure 14.Confusion matrix reflecting predictions of VidAnomalyNet model with SGD optimizer

Figure 15 .Figure 16 .Figure 17 .
Figure 15.Training and validation accuracy and loss of VidAnomalyNet model with Adam optimizer

Figure 21 .
Figure 21.Performance of VidAnomalyNet model with different optimizers

Figure 22 .
Figure 22.Performance of VidAnomalyNet model compared against state of the art Surveillance video anomaly detection has various applications across different domains.Here is a list of specific applications where anomaly detection in surveillance videos plays a crucial role:

Table 1 .
Parameters for different optimizers used for model training

Table 2 .
Parameters for different optimizers used for model training

Table 3 .
Performance of VidAnomalyNet with different optimizers

Table 4 .
Performance of VidAnomalyNet with SGD optimizer is compared against existing models