JOURNAL METRICS

CiteScore 2022: 2.8 ℹCiteScore:

CiteScore is the number of citations received by a journal in one year to documents published in the three previous years, divided by the number of documents indexed in Scopus published in those same three years.

SCImago Journal Rank (SJR) 2022: 0.299 ℹSCImago Journal Rank (SJR):

The SJR is a size-independent prestige indicator that ranks journals by their 'average prestige per article'. It is based on the idea that 'all citations are not created equal'. SJR is a measure of scientific influence of journals that accounts for both the number of citations received by a journal and the importance or prestige of the journals where such citations come from It measures the scientific influence of the average article in a journal, it expresses how central to the global scientific discussion an average article of the journal is.

Source Normalized Impact per Paper (SNIP) 2022: 0.665 ℹSource Normalized Impact per Paper(SNIP):

SNIP measures a source’s contextual citation impact by weighting citations based on the total number of citations in a subject field. It helps you make a direct comparison of sources in different subject fields. SNIP takes into account characteristics of the source's subject field, which is the set of documents citing that source.

qqtu_pian_20240428144739.png

Background-Foreground Segmentation Using Multi-Scale Attention Net (MA-Net): A Deep Learning Approach

Vishruth Boraiah Gowda^* | Gopalakrishna Madigondanahalli Thimmaiah | Megha Jaishankar | Chaitra Yuvaraj Lokkondra

Department of Information Science and Engineering, SJB. Institute of Technology, Bengaluru 560060, India

Affiliated to Visvesvaraya Technological University, Belgavi 590018, Karnataka, India

Department of Artificial Intelligence and Machine Learning, SJB. Institute of Technology, Bengaluru 560060, India

Department of Artificial Intelligence and Machine Learning, Ramaiah. Institute of Technology, Bengaluru 560060, India

Department of Computer Science and Engineering, SJB. Institute of Technology, Bengaluru 560060, India

Corresponding Author Email:

vishruth1711@gmail.com

Received:

21 February 2023

Revised:

8 April 2023

Accepted:

18 April 2023

Available online:

30 June 2023

| Citation

ria_37.03_04.pdf

OPEN ACCESS

Abstract:

Background subtraction serves as a critical foundation for numerous computer vision tasks, and a variety of traditional techniques have been proposed. In recent years, deep neural network architectures have emerged as a promising approach, with the UNET architecture being a notable example. However, UNET is considered an older architecture with limitations. To address these limitations, a novel neural network technique called Multi-scale Attention Net (MA-Net) is proposed, which incorporates a self-attention mechanism for adaptively integrating local features with their global dependencies. The attention mechanism within MA-Net enables the capture of complex contextual dependencies. Two distinct blocks are developed for the MA-Net: The Position-wise Attention Block (PAB) and the Multi-scale Fusion Attention Block (MFAB). While PAB models the interdependencies between features from spatial dimensions, representing pixel dependencies in a global view, MFAB capitalizes on fused multi-scale semantic feature fusion to capture channel dependencies between feature maps, effectively segmenting the foreground from the background. The proposed method is evaluated using the CDNET 2014 dataset and demonstrates improved performance under Shadow, dynamic background, and illumination challenges. This study highlights the potential of the MA-Net for advancing the field of background-foreground segmentation in computer vision tasks.

Keywords:

background subtraction, self-attention mechanism, multi-scale feature fusion, deep neural network architecture, background-foreground segmentation

1. Introduction

Video surveillance has gained prominence in recent times, with a wide range of applications including intelligent traffic monitoring, optical motion capture, and human-computer interfaces. To achieve effective foreground detection in video surveillance, it is crucial to have a robust background model. Our work aims to design an efficient background model for this purpose. This process of designing a background model for identifying the foreground pixels is known as background subtraction, as illustrated in the study [1].

Background subtraction forms the basis of computer vision, and it is used as a preprocessing step in many applications namely object tracking, surveillance, and human-computer interaction. Traditional techniques for background subtraction have been proposed, including GMM (Gaussian mixture models) and KDE (kernel density estimation) [2, 3]. However, these methods are sensitive to variations in lighting and camera motion, and they have difficulty dealing with complex backgrounds and dynamic scenes.

Off late, techniques based on deep learning techniques are applied for background subtraction, and they have shown promising results [4-6].

Feature extraction is an important step in comparing the background image and the video frames. The right features must be chosen to reveal the necessary information. Most algorithms use grayscale or the intensities of the three colors RGB as features. In a few cases, the intensity of pixels is combined with other features that are manually designed, such as in the study [7, 8]. Additionally, it is pertinent to select the appropriate feature region. Features could be extracted from pixels, blocks, or patterns. However, features from pixel-wise categories often produce noisy segmentation because it does not take into account local correlation, whereas block-wise or pattern-wise features may be insensitive to small changes.

Segmentation is a key step in processing each video frame by using a background model. It is achieved by taking features either from pixels or regions of pixels that are the same in both frames and by adopting a distance measure, such as the Euclidean distance, for determining how similar the pixels are. Each pixel is then assigned a background or foreground label based on its similarity to a similarity threshold. The entire background subtraction system is composed of these building blocks, and its reliability is dependent on and limited by the performance of each individual block. This means that if one module does not perform well, the overall system will not function optimally. Background subtraction has been extensively studied, resulting in a wide range of algorithms that can be used for this purpose. Many of the best methods currently in use are based on algorithms developed long ago, some of which are described earlier.

There are various segmentation models in deep learning architecture, such as Unet, Unet++, PAN, PSP Net, and Ma-net. However, only the Unet architecture has been used for designing a background subtraction model. With this in mind, we decided to adopt the Ma-net architecture because of its recent success in medical image segmentation [9].

Specifically, our proposed MA-Net is composed of two main blocks: PAB (Position-wise Attention Block) and MFAB (Multi-scale Fusion Attention Block). PAB models interdependencies between features from spatial dimensions, which represent dependencies of pixels spatially in a global view. MFAB relies on fused multi-scale semantic features are for capturing channel dependencies between feature maps for effectively segmenting foreground from background. Our experiments on the CDNET 2014 dataset show that MA-Net outperforms the state-of-the-art methods in challenging scenarios, particularly under the shadow, illumination changes and dynamic background. In summary, our contribution in this paper is a deep neural network architecture, MA-Net (Multi-scale Attention Net) that combines the self-attention mechanism with the multi-scale feature fusion to adaptively integrate local features with their global dependencies for background subtraction. We illustrate our method’s effectiveness on the CDNET 2014 dataset, and our results demonstrate that it performs better than a few of the state-of-the-art methods under challenging scenarios.

Further discussed, related works in section 2. Then, in section 3, we discussed the proposed work and architecture. Afterward, in section 4, we discussed the results and a comparative study. Finally, in section 5, we summarized the entire discussion with a conclusion and an overview of future possibilities.

2. Related Work

Traditional background subtraction methods include GMM (Gaussian mixture models) and KDE (kernel density estimation) [2, 3]. These methods are sensitive to variations in lighting and camera motion, and they have difficulty dealing with complex backgrounds and dynamic scenes. Wang et al. [6] propose a survey on recent advances in achieved in background subtraction, including deep learning-based methods. They reviewed various deep learning architectures such as DCNN, RNN (Recurrent Neural Network), and GAN (Generative Adversarial Network) for background subtraction.

Piccardi et al. [4] proposed a review of background subtraction techniques, including both traditional and deep learning-based methods. He has discussed the pros and cons of each method, and he highlighted the potential of deep learning-based techniques in background subtraction.

Other related works include the use of UNet architecture for background subtraction [10]. UNet is a popular architecture for image segmentation tasks, but it is considered an older architecture. Our proposed MA-Net is based on UNet architecture, but it also incorporates a self-attention mechanism to adaptively integrates local features with their global dependencies, which is not present in the UNet architecture.

Another related work is the use of attention mechanisms in image segmentation tasks [11, 12]. These methods have shown that attention mechanisms can improve the performance of image segmentation tasks by selectively focusing on important regions of the input. In our proposed MA-Net, we also use an attention mechanism for background subtraction but in a different way than the previous works as we have also incorporated a fusion of multi-scale features for capturing channel dependencies between feature maps for effectively segmenting foreground from background.

In this paper, we propose a deep neural network architecture, MA-Net (Multi-scale Attention Net) that combines the self-attention mechanism with the multi-scale feature fusion which adaptively integrates local features with their global dependencies for background subtraction. Our method is different from the previous works as it is based on a self-attention mechanism that captures complex contextual dependencies and multi-scale feature fusion which captures channel dependencies between feature maps. Our experimental results on the CDNET 2014 dataset demonstrate the effectiveness of our method in challenging scenarios.

3. Proposed Approach

Proposed approach is a deep neural network architecture called MA-Net (Multi-scale Attention Net) for background subtraction. The basis for our proposed method is UNet architecture, but it also incorporates a self-attention mechanism for adaptively integrating local features with their global dependencies. The attention mechanism in MA-Net allows it to capture complex contextual dependencies and the multi-scale feature fusion allows it to capture the channel dependencies between feature maps for effectively segmenting foreground from background. The detailed architecture of MA-Net has discussed in section 3.1 and MA-Net architecture of proposed model is shown in Figure 1.

4-1.png

Figure 1. Proposed model of Ma-Net architecture

3.1 Ma-Net architecture

The Ma-Net (Multi-scale Attention Net) architecture proposed [13, 14] in this paper has two stages: the encoder stage and the decoder stage. The encoder stage is responsible for extracting features from the input image, while the decoder stage is responsible for reconstructing the foreground mask from the extracted features.

The encoder stage is comprised of many convolutional layers, which are designed to learn multi-scale features from the input image. In order to capture the dependencies between the features, we use a self-attention mechanism, which is applied after each convolutional layer. This allows the network for adaptively integrate local features with their global dependencies. The encoder stage of the MA-Net architecture is comprised of many convolutional layers, where batch normalization and activation by ReLU function is applied after each layer. The convolutional layers are designed to learn multi-scale features from the input image. To capture the dependencies between the features, we use a self-attention mechanism, which is applied after each convolutional layer. The self-attention mechanism is implemented using a PAB (Position-wise Attention Block) [15]. The PAB uses multi-head self-attention for modeling interdependencies between features in spatial dimensions, which represent the spatial dependencies between pixels in a global view. This allows the network to adaptively integrating local features with their global dependencies. The PAB (Position Attention Block) simulates a wide range of rich spatial contextual information across local feature maps as shown in Figure 2. Given a local feature map, I $\in$ R^H×W×1024 as input, it is fed into a 3 × 3 convolution layer to obtain I' $\in$ R^H×W×2048. Then, 1 × 1 Convolution layers are utilized to generate A $\in$ R^H×W×256, B $\in$ R^H×W×256, and C $\in$ R^H×W×2048, respectively. A and B are reshaped to R^N×256 andR^256×N, respectively, and then a matrix multiplication is performed between A $\in$ R^N×256 and B $\in$ 256 × N. Where N is the number of pixels. After that, the softmax function is used to obtain the spatial attention feature map, P $\in$ R^N×N. Where P_ji denotes the impact of the i^th position on the j^th position in the feature map as shown in Eq. (1).

$P_{j i}=\frac{\exp \left(A_i, B_j\right)}{\sum_{i=1}^N \,\exp \left(A_i, B_j\right)}$ (1)

The decoder comprises of many up-sampling layers, which are used to reconstruct the foreground mask from the extracted features. The decoder stage also uses the self-attention mechanism to focus specifically on significant features and to improve the accuracy of the foreground mask. The up-sampling layers in the decoder are followed by a convolutional layer and a batch normalization layer. To further improve the accuracy of the foreground, mask a self-attention mechanism is used in the decoder stage. We implement the self-attention mechanism using a MFAB (Multi-scale Fusion Attention Block) which uses a fused multi-scale semantic features for capturing channel dependencies between any feature maps and detail architecture shown in Figure 3.

4-2.png

Figure 2. Position wise attention block: Used for foreground segmentation from background

4-3.png

Figure 3. Multi scale fusion attention block: To capture the channel dependencies between any feature maps

4-4.png

Figure 4. The architecture of the Res-block in our proposed method includes convolution layers, Group Normalization (GN), and a residual connection

Meanwhile, C $\in$ R^H×W×2048 is reshaped to C $\in$ R^N×2048. A matrix multiplication is performed between the spatial attention map and C $\in$ R^N×2048, and the result is reshaped to O' $\in$ R^H×W×2048. After that, an element-wise sum operation is performed between I' and O'. Finally, the final output, O $\in$ R^H×W×512, is obtained through a 3x3 convolution layer. The final output, O, is as shown in Eq. (2).

$O_i=\alpha \sum_{i=0}^N\left(P_{i j} C_i\right)+I_j^{\prime}$ (2)

The network produces foreground mask as the final output which indicates the pixels that belong to the foreground object.

In our approach, a deep neural network architecture is proposed for background subtraction that combines the ResNet50 encoder network with the self-attention mechanism in the decoder part. for capturing the spatial and channel dependencies of feature maps, we adopt two mechanisms based on self-attention concepts they are: the PAB (Position-wise Attention Block) and the MFAB (Multi-scale Fusion Attention Block). The PAB captures spatial dependencies between feature maps, while the MFAB combines channel dependencies between feature map by merging high and low-level feature information.

The ResNet50 architecture is based on the residual learning framework, which aims to alleviate the problem of vanishing gradients in very deep networks by using skip connections that bypass some of the layers. This allows the network to learn residual functions, which are the differences between the desired output and the input, rather than learning the underlying mapping function from the input to the output. In our semantic segmentation model, the ResNet50 architecture is used as the encoder, which is responsible for extracting features from the input image.

Inspired by the position attention module in the study [16], we use PAB to record the spatial interdependence between any two position feature maps. The PAB simulates a broader variety of rich spatial contextual information across local feature maps by using a 3x3 convolution layer, 1x1 convolution layers, matrix multiplication and softmax function. The resulting features are then passed to the decoder part of the network which uses upsampling layers to increase the spatial resolution and produce the final segmentation map.

Overall, our proposed approach combines the ResNet50 encoder network with a self-attention mechanism in the decoder part which captures dependency at spatial and channel levels in feature maps that can improve the background subtraction’s overall performance as shown in Figure 4.

4. Experimental Results

For achieving better feature representation to segment background-foreground, we introduce a new network named MA-Net (Multi-scale Attention-Net) that incorporates a dual attention mechanism. In the experimental results section, the performance of the proposed MA-Net architecture is evaluated and compared with state-of-the-art methods on the CDNET 2014 dataset. The CDNET 2014 dataset is a challenging dataset for background subtraction, as it contains videos with various scenarios such as shadows, illumination changes, and dynamic backgrounds. In most background subtraction datasets, the count of pixels belonging to background is substantially more in comparison to the count of foreground pixels. This imbalance in the class creates significant issues for loss functions adapted commonly like as mean-squared error and cross-entropy. A viable option for imbalanced binary datasets is the Jaccard index. The Jaccard index is a better choice for as a loss function for probability maps. In this case, the Jaccard index measures the similarity between the predicted probability map and the ground truth probability map. The Jaccard index as a loss function for probability maps can be represented in Eq. (3).

$L(P, Q)=1-J(P, Q)=1-\frac{|P \cap Q|}{|P \cup Q|}$ (3)

where, P and Q are the predicted and ground truth probability maps, respectively, J(P, Q) is the Jaccard index, and L(P, Q) represents the loss function.

4.1 Dataset and evaluation details

For evaluating the performance of our proposed MA-Net (Multi-scale Attention-Net) network, we used the CDNet-2014 dataset. It has 53 videos under natural settings which are categorized into 11 categories by subsuming various challenging scenarios like night videos, dynamic backgrounds, shadows etc. The resolution (spatial resolution) of videos ranges from 320x240 to 720x526 pixels. Every video is annotated with the region of interest labeled as either 1) foreground, 2) background, 3) hard shadow, or 4) unknown motion.

The performance of our network is evaluated by employing a 4-fold cross-validation strategy on as it is intuitive and simple. Where all videos from the data set are grouped into 4-folds in every category as evenly as possible. The proposed approach BGS algorithm on three of the folds and testing is performed on the fold which remains, this process is replicated for all four combinations, details of different categories along with videos chosen for training and testing samples as which are mentioned in Table 1. Where the entry “yes” in every fold corresponds to the video which has been chosen for testing and videos that fall under blank entries from that category are used for training purpose. Results obtained on full CDNet-2014 dataset could be uploaded in the evaluation server for comparing with the state-of-the-art methods.

When algorithm's performance, is measured, pixels having unknown motion labels was ignored and pixels having hard shadows is considered as background. This method of treating hard shadow as background is in line with CD-NET 2014 for change detection task. for evaluating performance, we use metrics listed by CD-Net 2014 namely F-score (F1), precision (Pr) and recall (Re).

4.2 Training details

To train the Ma-Net network for background-foreground segmentation, we have used the ADAM optimizer with a learning rate of 0.0005, β1 = 0.9, and β2 = 0.99. The mini-batch size was set to 8 and the number of epochs was set to 50. The loss function used in the training process was a binary cross-entropy loss, which is commonly used for binary classification problems. The binary cross-entropy loss is calculated as the negative log-likelihood of the binary classification problem, and is given by the following Eq. (4):

$\mathrm{L}=-[\mathrm{y} * \log (\mathrm{p})+(1-\mathrm{y}) * \log (1-\mathrm{p})]$ (4)

where, y is the ground truth label (0 for background and 1 for foreground), p is the predicted probability of the foreground, and log is the natural logarithm. The binary cross-entropy loss penalizes the model more when it predicts a higher probability for the wrong class. During the training process, data augmentation techniques were used to improve the robustness of the model, including random data augmentation at the beginning of each epoch, global illumination changes, and random Gaussian noise. The input frames were fixed to a size of 224×224 pixels, randomly cropped from the input frame, to leverage parallel GPU processing during training. In the evaluation step, thresholding with a threshold of θ = 0.5 was applied to the output of the sigmoid layer of the network to obtain binary maps as shown in Figures 3 and 4 respectively. No scaling or cropping was applied to the inputs during evaluation. The visual results obtained for different categories of CDNet-2014 dataset has been shown in Table 2 and Table 3 respectively.

4.3 Comparison with state of the art

The proposed MA-Net architecture performance is measured by employing several metrics like recall, accuracy, F1-score and precision. The result obtained during evaluation of four-fold dataset is tabulated in Table 4, Table 5, Table 6 and Table 7, the overall result tabulated in Table 8 is obtained by taking the average of that. The proposed MA-Net architecture is compared with few state-of-the-art techniques. The results in Table 8 demonstrate that the proposed MA-Net architecture outshines other state-of-the-art techniques in terms of recall, F1 score, accuracy and precision. This signifies the effectiveness of the MA-Net architecture in dealing with challenging scenarios such as dynamic backgrounds, shadows, and illumination changes. Additionally, results also illustrate that the proposed MA-Net architecture fares well with various types of noises like salt-and-pepper noise, speckle noise, and Gaussian noise. This indicates that the proposed MA-Net architecture is able to handle noise effectively and maintain high performance. In summary, results obtained in the experiment signify the effectiveness achieved by the proposed MA-Net architecture in background subtraction and its ability to handle challenging scenarios such as dynamic background, shadows, and illumination changes. It also shows robustness to different types of noise as shown in Table 9.

Table 1. Categorization of CD-NET 2014 dataset in to four folds

Category	Video	Fold-1- Test	Fold-2 - Test	Fold-3- Test	Fold-4 – Test
Baseline	Highway	Yes
	Pedestrians		Yes
	Office			Yes
	Pets2006				Yes
Camera Jitter	Badminton	Yes
	Traffic		Yes
	Boulevard			Yes
	Sidewalk				Yes
Shadow	Copy Machine	Yes
	Bus Station		Yes
	Cubicle			Yes
	People In Shade			Yes
	Bungalows				Yes
	Backdoor				Yes
Dynamic Background	Overpass	Yes
	Fountain02	Yes
	Fountain01		Yes
	Boats		Yes
	Canoe			Yes
	Fall				Yes

Table 2. Proposed method visual results of different standard datasets

Datasets	Current Frame	Ground Truth	Prediction of the Network
Baseline	4-t2_1.jpg
Camera Jitter	4-t2_2.jpg
Dynamic Background	4-t2_3.jpg
Low FrameRate	4-t2_4.jpg
Night	4-t2_5.jpg

Table 3. Proposed method visual results of different standard datasets

Datasets	Current Frame	Ground Truth	Prediction of the Network
PTZ	4-t3_1.jpg
Shadow	4-t3_2.jpg
Snowfall	4-t3_3.jpg
Thermal	4-t3_4.jpg
Turbulence	4-t3_5.jpg

Table 4. Results obtained from evaluation metric on all categories of Fold-1

Category	Test Accuracy	Test_Precision	Test_Recall	Test_F-Score	Test_tp	Test_fp	Test_tn	Test_fn
Baseline	0.9986	0.999619	0.985434	0.992476	1580094	603.0005	15512442	23356
Camera Jitter	0.993275	0.859632	0.975335	0.913836	629889	102854	16914238	15929
BadWeather	0.986292	0.998001	0.579998	0.733636	337487	676.0005	17294592	244389
Dynamic Background	0.997761	0.985296	0.946439	0.965477	1089813	16264	33641910	61675
IntermittentObject Motion	0.985342	0.977089	0.757867	0.853628	743934	17444	16406484	237681
Low Framerate	0.99915	0.893557	0.146079	0.251107	1276	152.0005	8941883	7459
Night Videos	0.958127	0.488634	0.183775	0.267096	46667	48838	5813508	207268
PTZ	0.970965	0.376545	0.899484	0.53086	272710	451533	15846218	30475
Shadow	0.991531	0.973203	0.905536	0.938151	1141859	31441	16484728	119117
Thermal	0.982596	0.917748	0.761219	0.832187	1516803	135942	33020306	475794
Turbulence	0.998832	0.964105	0.680632	0.79794	41443	1543	17912985	19446
Full	0.989159	0.901564	0.837148	0.868162	7404206	808421	1.98E+08	1440358

Table 5. Results obtained from evaluation metric on all categories of Fold-2

Category	Test Accuracy	Test Precision	Test Recall	Test_F-Score	Test_tp	Test_fp	Test_tn	Test_fn
Baseline	0.999418	0.983866	0.967578	0.975654	205379	3368.001	17405157	6882
CameraJitter	0.984665	0.965294	0.887087	0.924539	1614669	58054	15309642	205524
BadWeather	0.986301	0.9956	0.886606	0.937947	1815975	8026	15484073	232258
DynamicBackground	0.997623	0.88194	0.951395	0.915352	339568	45456	26019379	17348
IntermittentObjectMotion	0.992784	0.837646	0.589735	0.692162	115075	22304	13967944	80055
LowFramerate	0.994702	0.98418	0.82561	0.897948	95121	1529.001	3964452	20092
NightVideos	0.976229	0.817799	0.490757	0.61341	373071	83118	18938496	387124
PTZ	0.965026	0.357007	0.28851	0.319124	146191	263299	17066428	360520
Shadow	0.995418	0.990207	0.898334	0.942035	642146	6351	16526547	72673
Thermal	0.965639	0.932061	0.922012	0.927009	3776288	275260	12935925	319416
Turbulence	0.997912	0.933114	0.537989	0.682488	40304	2889	17879802	34612
Full	0.986614	0.922832	0.840429	0.879705	9160918	766045	1.76E+08	1739373

Table 6. Results obtained from evaluation metric on all categories of Fold-3

Category	Test _Accuracy	Test_ Precision	Test_ Recall	Test_ F-Score	Test_tp	Test_fp	Test_tn	Test_fn
Baseline	0.991454	0.907625	0.980571	0.942689	1230276	125213	16124753	24377
CameraJitter	0.956779	0.812174	0.80199	0.80705	1290449	298433	12368896	318610
BadWeather	0.991395	0.975804	0.808483	0.884299	470973	11678	13728628	111566
DynamicBackground	0.995128	0.945234	0.980433	0.962512	1085086	62869	16178459	21656
IntermittentObjectMotion	0.961368	0.960358	0.619081	0.752849	84330	3481.001	1293531	51888
LowFramerate	0.987601	0.982232	0.750637	0.850958	373746	6761.001	10054271	124159
NightVideos	0.981729	0.664892	0.348118	0.456976	102184	51501	12946772	191349
PTZ	0.992236	0.174975	0.956022	0.29581	29217	137761	17749138	1344.001
Shadow	0.996261	0.988588	0.965322	0.976817	2729988	31513	31793931	98072
Thermal	0.976745	0.971838	0.776347	0.863162	1291933	37438	15912881	372184
Turbulence	0.999652	0.985066	0.692815	0.813489	13654	207.0005	17974836	6054
Full	0.988308	0.920852	0.868245	0.893775	8702507	747987	1.66E+08	1320588

Table 7. Results obtained from evaluation metric on all categories of Fold-4

Category	Test_Accuracy	Test_Precision	Test_Recall	Test_F-Score	Test_tp	Test_fp	Test_tn	Test_fn
Baseline	0.994933	0.78705	0.944173	0.858482	274557	74286	17499351	16234
CameraJitter	0.959826	0.37334	0.899115	0.527603	119888	201235	5009391	13452
BadWeather	0.997485	0.991376	0.957936	0.974369	852761	7418	16939704	37446
DynamicBackground	0.984158	0.877624	0.947832	0.911378	1452159	202489	16091859	79926
IntermittentObjectMotion	0.974049	0.994486	0.797389	0.885098	164121	910.001	1435294	41702
LowFramerate	0.991096	0.972456	0.917482	0.94417	1209866	34268	14716633	108815
NightVideos	0.990564	0.924558	0.586284	0.717551	211353	17246	17256430	149143
PTZ	0.996714	0.939039	0.919141	0.928983	223342	14499	10134292	19648
Shadow	0.993938	0.989319	0.96282	0.975889	3201427	34565	22737303	123626
Thermal	0.990517	0.922443	0.704822	0.799081	335013	28167	17261480	140303
Turbulence	0.996071	0.984128	0.859583	0.917649	388651	6268	17293684	63488
Full	0.991832	0.937168	0.914133	0.925507	8434630	565498	1.56E+08	792291

Table 8. Comparison of methods for unseen videos from CDNet-2014

Method	Recall	Precision	F-Scores
Ours	0.8649	0.9206	0.8917
BSUV-net [17]	0.8203	0.8113	0.7868
WisenetMD [18]	0.8179	0.7535	0.7668
BSUV-net 2.0 [19]	0.8136	0.9011	0.8387
FgSegNet v2 [20]	0.5119	0.4859	0.3715
SWCD [21]	0.7839	0.7527	0.7583

Table 9. Comparison of methods according to the per-category F-Score for unseen videos from CDNet-2014

Method	Bad Weather	Low Frame Rate	Night	PTZ	Thermal	Shadow	IntObj Motion	Camera Jitter	Dynamic Background	Base Line	Turbu -Lence	Overall
Ours	0.8825	0.7360	0.5137	0.5186	0.8553	0.9582	0.7959	0.7932	0.9386	0.9423	0.8028	0.8917
BSUV Net2.0 [11]	0.8844	0.7902	0.5857	0.7037	0.8932	0.9562	0.8263	0.9004	0.9057	0.9620	0.8174	0.8387
FgSeg NetV2 [12]	0.3277	0.2482	0.2800	0.3503	0.6038	0.5295	0.2002	0.4266	0.3634	0.6926	0.0643	0.3715
BSUV Net [16]	0.8713	0.6797	0.6987	0.6282	0.8581	0.9233	0.7499	0.7743	0.7967	0.9693	0.7051	0.7868
SWCD [17]	0.8233	0.7374	0.5807	0.4545	0.8581	0.8779	0.7092	0.7411	0.8645	0.9214	0.7735	0.7583
Wise net MD [18]	0.8616	0.6404	0.5701	0.3367	0.8152	0.8984	0.7264	0.8228	0.8376	0.9487	0.8304	0.7535

5. Conclusion

We have developed a new deep-learning technique to achieve background subtraction in videos that have not been seen before. We have also introduced a video-agnostic evaluation method for evaluation which treats every video from the dataset as unseen. The choice of cross-validation strategy is made as it is found to be highly beneficial in the evaluation of BGS algorithms in future. The input to the network includes the current frame, two reference frames from different time scales, and semantic information for all three frames. To enhance the generalization capacity of the network, we have formulated a simple yet effective illumination-change model. Our experimental results on the CDNet-2014 dataset show that the network outperforms current state-of-the-art video-agnostic background subtraction algorithms in terms of all performance metrics, indicating the potential of deep-learning algorithms for unseen or unlabeled videos. As a part of future work, we plan to further explore different architectures and techniques for improving the performance under challenging categories such as "PTZ" and "Night videos," as well as investigate different background models for the reference frames. In the proposed work, we have focused on delivering a high-performance, supervised background subtraction algorithm especially for unseen category videos by not paying too much attention towards without computational complexity.

References

[1] Babaee, M., Dinh, D.T., Rigoll, G. (2018). A deep convolutional neural network for video sequence background subtraction. Pattern Recognition, 76: 635-649. https://doi.org/10.1016/j.patcog.2017.09.040

[2] Elgammal, A., Harwood, D., Davis, L. (2000). Non-parametric model for background subtraction. In: Vernon, D. (eds) Computer Vision — ECCV 2000. ECCV 2000. Lecture Notes in Computer Science, vol 1843. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45053-X_48

[3] Zivkovic, Z., Van Der Heijden, F. (2006). Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recognition Letters, 27(7): 773-780. https://doi.org/10.1016/j.patrec.2005.11.005

[4] Piccardi, M. (2004). Background subtraction techniques: a review. In 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No. 04CH37583), The Hague, Netherlands, pp. 3099-3104. https://doi.org/10.1109/ICSMC.2004.1400815

[5] Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B., Glocker, B., Rueckert, D. (2018). Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999. https://doi.org/10.48550/arXiv.1804.03999

[6] Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X. (2017). Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156-3164.

[7] Heikkila, M., Pietikainen, M. (2006). A texture-based method for modeling the background and detecting moving objects. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4): 657-662. https://doi.org/10.1109/TPAMI.2006.68

[8] Bilodeau, G.A., Jodoin, J.P., Saunier, N. (2013). Change detection in feature space using local binary similarity patterns. In 2013 International Conference on Computer and Robot Vision, Regina, SK, Canada, pp. 106-112. https://doi.org/10.1109/CRV.2013.29

[9] Fan, T., Wang, G., Li, Y., Wang, H. (2020). Ma-net: A multi-scale attention network for liver and tumor segmentation. IEEE Access, 8: 179656-179665. https://doi.org/10.1109/ACCESS.2020.3025372

[10] Lee, S.H., Lee, G.C., Yoo, J., Kwon, S. (2019). WisenetMD: Motion detection using dynamic background region analysis. Symmetry, 11(5): 621. https://doi.org/10.3390/sym11050621

[11] Tezcan, M.O., Ishwar, P., Konrad, J. (2021). BSUV-Net 2.0: Spatio-temporal data augmentations for video-agnostic supervised background subtraction. IEEE Access, 9: 53849-53860. https://doi.org/10.1109/ACCESS.2021.3071163

[12] Lim, L.A., Keles, H.Y. (2018). Foreground segmentation using convolutional neural networks for multiscale feature encoding. Pattern Recognition Letters, 112: 256-262. https://doi.org/10.1016/j.patrec.2018.08.002

[13] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

[14] Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H. (2019). Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3146-3154.

[15] Ronneberger, O., Fischer, P., Brox, T. (2015). U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W., Frangi, A. (eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. MICCAI 2015. Lecture Notes in Computer Science(), vol 9351. Springer, Cham. https://doi.org/10.1007/978-3-319-24574-4_28

[16] Tezcan, O., Ishwar, P., Konrad, J. (2020). BSUV-Net: A fully-convolutional neural network for background subtraction of unseen videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, pp. 2774-2783. https://doi.org/10.1109/WACV45572.2020.9093464

[17] Işık, Ş., Özkan, K., Günal, S., Gerek, Ö.N. (2018). SWCD: A sliding window and self-regulated learning-based background updating method for change detection in videos. Journal of Electronic Imaging, 27(2): 023002-023002. https://doi.org/10.1117/1.JEI.27.2.023002

[18] St-Charles, P.L., Bilodeau, G.A., Bergevin, R. (2016). Universal background subtraction using word consensus models. IEEE Transactions on Image Processing, 25(10): 4768-4781. https://doi.org/10.1109/TIP.2016.2598691

IJHT
MMEP
ACSM
EJEE
ISI
I2M
JESA
RCMA
RIA
TS
IJSDP
IJSSE
IJDNE
JNMES
IJES
EESRJ
RCES
AMA_A
AMA_B
AMA_C
AMA_D
MMC_A
MMC_B
MMC_C
MMC_D

Username
Password
Remember me

Search form

Background-Foreground Segmentation Using Multi-Scale Attention Net (MA-Net): A Deep Learning Approach