VADNet: Visual-Based Anti-Cheating Detection Network in FPS Games

ABSTRACT


INTRODUCTION
FPS games are a highly competitive genre with demanding requirements for reaction speed, and they have become one of the mainstream categories in the current gaming market [1][2][3][4].Simultaneously, the issue of cheating has emerged as a major challenge in the FPS gaming industry, impacting the gaming ecosystem and players' overall gaming experience.Recent data indicates that in the first half of 2023, there were a cumulative 3.2 billion detections of cheating in mobile games, marking a 40% year-over-year increase.Among these, the most prevalent type of cheat used in FPS games is wallhack, accounting for 58.33% of the total, while aimbot, despite only representing 8.33%, has the most significant impact on user experience, as shown in Figure 1.Therefore, detecting and identifying cheating has become an urgent and imperative problem that needs immediate resolution in the FPS gaming industry.
The genre of shooting games, unlike card games and others that do not require a focus on real-time client-side calculations, exhibits significant differences.To ensure a smooth gaming experience in shooting games, many gameplay calculation logics need to be executed locally on the client-side [5].This makes it impractical to adopt server-side validation methods, laying the groundwork for cheating through hacks.Utilizing cross-process hacks with elevated privileges enables the extraction of crucial game logic data, allowing for features such as wallhacks through external rendering processes.In this scenario, the game logic executes without any anomalies, and the game remains unaware of the detailed information related to the hacking process.Alternatively, modifying shader data can influence the GPU rendering process, enabling features like perspective rendering, character coloring, and removing grass and trees.These challenging hacks are difficult for conventional anti-cheat measures to handle and detect.Therefore, this study adopted a visual approach using deep learning to discern and counteract these issues.The use of deep learning for visual detection of cheating has been identified as a promising approach [6][7][8], attributed to significant visual differences between normal players and cheaters.This method, which does not require additional privacy information and relies solely on reliable image data [9], faces the challenge of accurately identifying these visual differences.
Challenge 2: Quantifying Cheating Behavior The often subtle and difficult-to-detect behavioral differences manifested by cheating behavior necessitate realtime capabilities in cheating detection systems [10,11].Furthermore, the quantification of appropriate cheating detection indicators and the design of a reasonable detection system are demanded, ensuring the system's ability to accurately and rapidly respond to various situations [12,13].
Challenge 3: Enhancing Credibility It is crucial for anti-cheat system design to ensure accurate cheating detection without falsely identifying legitimate players as cheaters.The critical challenge lies in improving target detection and cheat classification accuracy, reducing false-positive rates to avoid negatively impacting legitimate players, and enhancing the overall credibility of the cheat detection system [14,15].
Contributions are as follows: (1) A visual-based FPS game anti-cheat network has been developed, achieving comprehensive cheat detection through supervised learning on datasets of normal gaming and cheating scenarios.( 2) By detecting and categorizing player behavior, a set of key performance indicators has been introduced.Monitoring the relative relationships between actions such as aiming frequency, effective aiming duration, and kill count enables the more accurate capture of differences between cheating players and regular players.(3) Extensive experiments and analyses conducted on a real online FPS game dataset have demonstrated the network's effectiveness and potential in detecting players using cheats.[16] categorized methods for detecting game cheats into three types: player client detection, game network communication detection, and game remote server detection.In the case of Player Unknown's Battlegrounds (PUBG), the Battle Eye system employed by the game continuously collects information on all processes in the user's memory.It randomly extracts files from the player's computer for analysis.However, this inevitably raises a series of privacy concerns.Yu et al. [17] investigated the situation where players' clients send data packet commands to the server.They monitored different amounts of traffic during various time periods to identify changes in traffic when cheats were used compared to normal gameplay.However, cunning cheaters often employ techniques such as traffic obfuscation, encryption, and confusion to bypass detection.Some games, like "Fantasy Westward Journey" and "Dungeon & Fighter," intermittently present image captchas or numeric quizzes during gameplay to determine if users are present [18].However, the effectiveness of these anti-cheat methods, based on pop-up event detection, has gradually diminished due to advancements in facial recognition and image processing technologies.

Deep learning methods
Traditional cheating detection systems often struggle to be effective against new vulnerabilities or sophisticated cheaters.
With the advancement of artificial intelligence (AI) technology, methods based on image processing and AI have been widely explored in the field of cheating detection by many researchers.Galli et al. [19].developed AI agents resembling human players by analyzing human player behavior using various methods.However, this approach has low accuracy and requires human involvement, resulting in inefficiency.Spijkerman and Marie Ehlers [20] applied Support Vector Machines (SVMs), decision trees, and Naive Bayes machine learning models to analyze players' mouse and keyboard operations.By integrating learning features into SVM models, they achieved superior cheating detection results.However, relying solely on SVMs for cheating detection may overlook crucial information such as players' aiming frequency, hit rates, and pre-aim positions [21].Some researchers have used Recurrent Neural Networks (RNNs) to detect cheating.However, RNNs' sequential computation process leads to high computational complexity during training and inference.This limitation restricts the scalability and accuracy of RNNs when dealing with long sequences or large-scale tasks [22][23][24].
In comparison to traditional image processing methods, deep learning can automatically learn and extract features from images without relying on manually designed feature extractors [25][26][27].Through the combination and training of multiple layers of neural networks, deep learning excels at extracting advanced features from complex images.Deep learning has achieved significant success in areas such as image recognition, object detection, image enhancement, and image reconstruction.In the context of game cheating detection, deep learning's capabilities are evident, as exemplified by the automatic recognition of cheating behavior patterns, such as "aimbot," in players of games like PUBG.These models leverage extensive datasets of player behavior features to accurately identify key cheating indicators, providing essential auxiliary criteria for the detector's judgment [28][29][30].

Preliminary
This section provides an overview of object detection and the important definitions used in the employed models.Additionally, a concise summary of commonly used symbols is given in Table 1.

Notations
2. Batch Normalization (BN) BN normalizes the variance and mean of features across examples within each small batch, aiming to prevent issues like gradients vanishing or exploding.

Max pooling
Pooling layers down sample each input feature map by utilizing a 2×2 max-pooling window with a stride of 2, achieving a reduction in the spatial dimensions of the input data.

Effective targeting
The duration during which aiming is detected, and a person is recognized in the viewfinder is considered the effective targeting time in object detection.

METHODOLOGY
This section introduces the proposed vision-based anticheating detection model, as shown in Figure 2, which consists of four key modules: the data preprocessing module, backbone module, neck module, and head module.Images from the dataset are first concatenated and padded through the data preprocessing module.Subsequently, the backbone module splits and applies convolutional operations to extract features from the images.Pooling operations are then used to merge the feature vectors.The neck module serves as a connecting module to further optimize and fuse features to adapt to the downstream tasks.Finally, the head module calculates the loss function and classifies the indicators of cheating through the classifier module for output.

Data preprocessing module
The data preprocessing module primarily involves scaling the input images to the network's input size and normalizing them.During the model training phase, this module employs Mosaic data augmentation operations to concatenate multiple images into a new complete photo for data input, using random scaling, random cropping, and random arrangement.This approach not only enhances the training speed of the model but also reduces its memory requirements.
The module utilizes adaptive anchor box calculation, with the formula as follows: represents the aspect ratio.

Backbone module
The backbone module introduces the focus module to segment the image, splitting the high-resolution image (feature map) into multiple low-resolution images or feature maps.
The input  ∈ ℝ ×××ℎ undergoes the focus layer to , where the channel count is quadrupled compared to the original RGB three-channel mode.The final result  ̂ is a feature map with twice the down-sampling without losing information, as depicted in Figure 3. , as depicted in Figure 3.The formula for this process is as follows: The input feature map comprises c input channels ( ̂1:  ̂2:  ̂3: … :  ̂) and c' output channels ( 1 :  2 :  3 : … :  ′ ).The weight parameters of this CONV layer have the shape [filter_height, filter_width, in_channels, out_channels], denoted by  (,′) ∈ ℝ ×××′ .Here, f represents the activation function, and   ′ is the bias term for the output feature maps of the same size.
The output  undergoes several CSP1_X and CSP2_X layers.The structure of CSP1_X is illustrated in Figure 4, and the corresponding formula is as follows: ( ) As shown in Figure 5, SPP performs max pooling on each feature map using three different sizes of pooling kernels to obtain predetermined feature map sizes.Finally, all feature maps are flattened into feature vectors and fused.

Neck module
The Neck network is a crucial component in object detection algorithms, responsible for further optimizing and fusing features extracted by the backbone network to better adapt to the requirements of object detection tasks, as illustrated in Figure 6.The FPN is a top-down process that transfers and fuses high-level feature information through upsampling to obtain feature maps for prediction, allowing the network to perform object detection at multiple scales.In the bottom-up stage, images are input into the backbone network, and features with different scales of information are extracted using CSP1_X, CSP2_X, and Conv.In the top-down stage, the feature maps obtained at higher levels are transmitted to the lower levels through upsampling, enriching the semantic information in the lower-level feature maps.Finally, in the Lateral Connection process, the features obtained by upsampling the higher-level feature maps are fused with the lower-level feature maps, and the fusion can be a simple addition or a 1×1 convolution.After fusion, a fused feature map is generated at each scale, forming a feature pyramid.This feature pyramid contains semantic information at different scales, enabling the model to detect objects at different scales.Although FPN has already integrated shallow features once, it still cannot achieve satisfactory segmentation results.Therefore, PAN is introduced, enhancing the network's perception of multi-scale information through a mechanism that fuses features along both lateral and contextual paths.As shown in Figure 6, it involves downsampling N2 (N2 and P2 are the same feature map, so N2 already contains a considerable amount of low-level features).The downsampled feature map is then fused with P3 to obtain N3.Therefore, N3 contains more low-level features than P3, and this pattern continues for N4, N5, and so forth.

Head module
The model has three main loss functions: Classification Loss (cls_loss), responsible for determining whether the classification of anchor boxes matches the annotations; Localization Loss (box_loss), which measures the error between predicted boxes and annotated boxes (GIoU); and Confidence Loss (obj_loss), which calculates the network's confidence.Binary cross-entropy loss functions are employed for both classification and localization losses, as represented by the following formulas: where, w gt and h gt are the width and height of the ground truth bounding box, and w and h are the width and height of the predicted bounding box.Additionally, the output end of the Head module adopts sigmoid as the activation function, addressing the issue of slow weight updates in the loss function.

Classifier module
The differentiation in the ratios of duration, frequency, and kill effectiveness in aiming between normal players and those suspected of cheating is analyzed.Data standards are quantified to ascertain the usage of cheats by a player.The calculation of the duration of effective aiming is conducted by recognizing the period during which the player aims and maintains the target within sight.The duration ratio is subsequently calculated by dividing the duration of effective aiming by the total operational time of the aiming mechanism.The formula is delineated as follows: where, TA represents the total time with the sight open, calculated by the continuous duration of the aiming action detected through target recognition.The ratio of effective aiming instances is the number of times effective aiming occurs divided by the total number of times the sight is open.
The ratio of scoped kills is the quotient of the number of kills and the number of times the sight is opened.
In the process of normal player behavior, there is ineffective scoping, i.e., situations where there is no one in sight.However, for cheating players, the ratios of effective scoping occurrences (  ) and effective scoping duration (  ) are significantly higher than those of normal players, and the scoped kill ratio (  ) is also higher.By quantifying and statistically analyzing the differences in these three indicators between normal players and cheaters, we can classify whether cheating is involved.

Datasets
Our dataset consists of videos collected from online FPS games provided by a network company, ensuring privacy protection.Frames from the videos were processed and cut into numerous images, and the self-made dataset was created through preprocessing and labeling, resulting in training, testing, and validation sets.The basic statistics of the dataset are summarized in Table 2.

Experiment settings
The experiment runs on a device equipped with an NVIDIA Tesla T4 16GB GPU.In this experiment, the model undergoes training for 100 epochs with a batch size of 16, and the images are resized to 640×640 pixels.Multi-scale training is applied, treating the dataset as a single category.The SGD optimizer and synchronized batch normalization are utilized.A quartersized data loader is employed, and a cosine learning rate scheduler is used.Label smoothing is set to 0.0, and early stopping waits for 100 epochs.

Evaluation metrics
Three common metrics are employed for evaluation, including precision, recall, and F1 score.Additionally, a confusion matrix is plotted to illustrate the correct recognition scenarios (see Table 3).
Therefore, P (Precision) refers to the proportion of correctly predicted positive samples (TP) among all samples predicted as positive TP+FP.An increase in P indicates that the model's judgments of "predicted positive" are more reliable, meaning that the model more accurately identifies positive instances.

Recall
() Recall, starting from true positive labels, calculates the proportion of correctly predicted positive samples (TP) among all true positive samples (TP+FN).A higher recall is desirable as it indicates that the model makes fewer false negatives and has a lower probability of missing actual positive instances.

F1-Score
( ) When β=1, the term is referred to as the F1-Score, which assigns equal importance to Recall and Precision, amalgamating these two metrics into a singular measure.

Experiment result
As shown in Figure 7, it can be observed that P is 1.00, indicating that when the model predicts a sample as a positive class, there is a 100% probability of being correct.This suggests that the model has high confidence when predicting positive classes.R is 0.82, indicating that the model successfully identified 82% of actual positive samples, implying that the model is effective in capturing most positive instances in the dataset.
The P-R index of 0.729 mAP@0.5 means that the area under the Precision-Recall curve is 0.729, and the mean average precision at IOU 0.5 is 0.729.This indicates that the model performs relatively well in binary classification, maintaining high precision and recall simultaneously under certain thresholds.
The F1-Score of 0.63, being the harmonic mean of precision and recall, provides a comprehensive evaluation of the model's ability to balance false positives and false negatives.
As shown in Figure 8, the labels and predictions of batch 0 in the validation dataset exhibit a high level of synchronization, confirming our outstanding performance in the process of recognizing the dataset.Figure 9 shows the box_loss, cls_loss, and dfl_loss on the training and validation sets.The box_loss calculates the error between the predicted box and the annotated box using Complete Intersection over Union (CIoU).The cls_loss computes whether the anchor box and its corresponding classification are correct, while the dfl_loss represents the distribution focal loss.mAP50(B) stands for Mean Average Precision at IoU 0.50 for Large Objects, and its formula is as follows:    This helps understand the model's classification accuracy for each class and identifies classes that are prone to confusion.Analyzing this confusion matrix allows insights into the model's performance on individual classes and reveals which classes are more likely to be confused.In the graph, it is evident that the confusion probability for the "ADS" class is relatively low, indicating high prediction accuracy.On the other hand, there is a higher confusion probability between the "head" class and the "person" class, possibly due to difficulties in calculating the effective aiming duration and the number of effective aiming instances.

CONCLUSION
In this study, deep learning visual algorithms were integrated to identify cheating in FPS games, leading to the acquisition of a series of quantifiable player metrics, such as effective aiming duration, through training on key labels.Comprehensive analyses and visualizations of the target detection results were conducted, leading to the conclusion that the model exhibits exceptional performance on crucial labels (such as ADS and nameplate), demonstrating high accuracy.The convergence curves for precision, recall, and F1-Score also indicated favorable performance, providing a reliable basis for subsequent calculations of metrics like effective aiming duration.
Nevertheless, limitations were observed, including notable fluctuations in the precision convergence curve, which suggest issues such as overfitting and potential for improvement in the recognition rate of the "head" label.Future research directions will be aimed at addressing these challenges to further enhance the model's performance.Efforts will be dedicated to finer parameter tuning, with an exploration of more suitable learning rates and adjustment parameters to mitigate model fluctuations and overfitting.To improve recognition of the head label, the introduction of advanced target detection techniques or adjustments to network structures will be considered to better capture features in the head region.Moreover, the ongoing optimization of the dataset, with the incorporation of more gaming scenarios and cheat variations, is aimed at enhancing the model's robustness.

Figure 1 .
Figure 1.Common cheating methods This paper proposes a deep learning and visual analysisbased FPS anti-cheat model, aimed at detecting the use of cheats by analyzing player behavior in game videos.The logical patterns of players' in-game actions are examined, and visual recognition techniques are employed, allowing the model to effectively determine the presence of cheating behavior, irrespective of the cheat type involved.To achieve the anti-cheat objectives, three key challenges must be addressed: Challenge 1: Recognizing Visual DisparitiesThe use of deep learning for visual detection of cheating has been identified as a promising approach[6][7][8], attributed to significant visual differences between normal players and cheaters.This method, which does not require additional

Figure 2 .
Figure 2. Illustration of the proposed VADNet anti-cheating detection model

Figure 3 .
Figure 3. Process of the focus module After this image is subjected to another convolution operation, it results in  ∈ ℝ ×′×  2 × ℎ 2, as depicted in Figure3.The formula for this process is as follows:

Figure 4 .Figure 5 .
Figure 4. Schematic diagram of submodule constructionThe Convolutional Block Layer (CBL) sums over all input channels, multiplying each channel by the corresponding weights and adding biases.The formula for feature extraction using the Res (Residual) method is as follows:

Figure 6 .
Figure 6.FPN-PAN structure represents the sample, y represents the label, a represents the predicted output, and n represents the total number of samples.The confidence loss calculation adopts the CIOU_Loss as the bounding box loss function, and the formula is as follows: is the intersection over union between the predicted box and the ground truth box. 2 (,   ) represents the Euclidean| distance between the center points of the predicted box and the ground truth box.c denotes the diagonal distance of the minimum closed region that can simultaneously contain the predicted box and the ground truth box.

Figure 7 .
Figure 7. Evaluation metrics on the dataset Label results of the validation dataset (b) Prediction results of the validation dataset

Figure 8 .
Figure 8. Visualization comparison of batch 0 labels and predictions in the validation dataset APi refers to the area under the Precision-Recall curve for class i, AP50 signifies that the IoU value is set to 50%, and AP50-95 indicates that the IoU values range from 50% to 95%.The calculation involves taking the mean of the AP values at these IoU levels.Looking at the convergence curves of the loss functions on the validation set: (1) Box loss, val_loss, and dfl_loss exhibit a trend of initially decreasing and then increasing.Their minimum values appear before training for 50 epochs,

Figure 9 .
Figure 9. Loss function convergence curve Figure 10 shows the confusion matrix during the prediction process.Each row of this confusion matrix corresponds to the true class, and each column corresponds to the predicted class.Each element in the matrix indicates the percentage of samples belonging to a specific class in the total samples when the model predicts that class.

Figure 10 .
Figure 10.Confusion matrix Initially, utilizing  bounding boxes and 9 anchor boxes, the aspect ratio  is calculated.If the maximum aspect ratio () of the anchor boxes is greater than the minimum aspect ratio () of the bounding boxes, it is considered a successful match.If the probability of a successful match is less than 98%, a genetic algorithm and kmeans are employed to recalculate the anchors, and the anchor box with the highest success rate is saved.Adaptive image scaling techniques have been incorporated, and the formula is as follows:

Table 2 .
Basic statistics of the dataset

Table 3 .
Different recognition scenarios