Brain Tumor MRI Segmentation Method Based on Segment Anything Model

ABSTRACT


INTRODUCTION
Brain tumors are identified as one of the tumors with exceedingly high incidence and mortality rates, constituting over 85% of all primary central nervous system tumors globally and accounting for approximately 2% to 3% of cancer-related deaths.Such tumors pose a significant threat to human health [1].Consequently, the early diagnosis and treatment of brain tumors are deemed crucial.In clinical practice, brain MRI is commonly employed for the examination and diagnosis of patients [2].MRI, a noninvasive imaging technology, is capable of clearly depicting soft tissue lesions and is extensively used in the diagnosis and treatment of brain tumor diseases.To obtain accurate and comprehensive segmentation information, brain tumor segmentation typically requires the utilization of multimodal MRI scan datasets with varying imaging parameters, as illustrated in Figure 1, which presents images of brain tumors in different modalities.These images in varying modalities capture distinct pathological information and can effectively complement each other.
The segmentation of brain tumors in MRI scans is a critical task in medical image segmentation [3].The objective of brain tumor segmentation is to precisely locate different types of tumor regions within medical images, as demonstrated in Figure 2. The segmented areas include the necrotic tumor core (NCR), the peritumoral edema (ED), and the enhancing tumor (ET).These distinct regions provide vital references for clinical practice.Brain tumors are highly heterogeneous, exhibiting variability in grayscale values and irregularity in shapes within MRI.Therefore, the exploration of precise and reliable methods for brain tumor MRI segmentation represents a challenging endeavor.

Figure 1. Brain tumor images in different modalities
In recent years, with the advancement of AI technology and computational power, foundational models have increasingly played a significant role in the field of natural language processing (NLP), such as Chat-Generative Pre-trained Transformer (GPT) and GPT-4.0 [4].These large language models have gradually impacted the field of computer vision.

Figure 2. Brain tumor MRI segmentation tasks
Recently, Kirillov et al. [5] introduced the SAM, a foundational model for image segmentation, achieving groundbreaking advancements in the field of computer vision.SAM is celebrated for its exceptional zero-shot transfer capabilities, enabling segmentation of any object within any image without the necessity for any annotations.It has demonstrated commendable results in natural images.
Several researchers have embarked on investigations into the capabilities of SAM in downstream image segmentation tasks.Ding et al. [6] applied SAM to the segmentation of very high resolution (VHR) remote sensing images, proposing the SAM-CD model for change detection (CD) in remote sensing image segmentation and achieving accuracy surpassing that of state-of-the-art (SOTA) methods.Ahmadi et al. [7] utilized SAM for the assessment of civil infrastructure, employing SAM to detect cracks in concrete structures.By integrating SAM with the U-net model, more accurate and comprehensive crack detection results were obtained.These studies indicate that fine-tuning and improvements to SAM can enhance its segmentation performance in downstream tasks.However, due to the complexity and specificity of medical image segmentation tasks, the suitability of SAM for medical image segmentation requires further exploration.
Therefore, this study investigates the segmentation performance of SAM in brain tumor MRI, examining the effectiveness of SAM in the brain tumor MRI segmentation.To enhance the precision of brain tumor MRI segmentation and augment the generalizability of medical image segmentation, a method based on SAM is proposed.After outputting features from the image encoder, Transformer features are reshaped through feature mapping.Then, CNN features are obtained through three layers of 3*3 convolution operations.To better fuse local and global features, a Feature Fusion Block (FFB) is employed between CNN and Transformer features for feature fusion and correction, resulting in fused features with superior representational capability.Experimental results demonstrate that the method proposed herein achieves better segmentation accuracy compared to the SAM alone.

Foundation model theory
In recent years, AI large models have seen rapid development in the field of NLP.AI large models are short for "AI pre-training large models," and they encompass "pretraining" and "large models," combining to introduce a new paradigm in AI.Specifically, models undergo pre-training on large-scale datasets, enabling them to support various applications directly without the need for fine-tuning or with minimal data adjustment.In 2021, Bommasani et al. [8] proposed the concept of foundation models.Models based on self-supervised learning showcase diverse capabilities throughout the learning process.These capabilities provide both momentum and theoretical underpinnings for downstream applications, leading to the designation of these large models as foundation models.
The advent of foundation models has significantly enhanced the generalization capabilities of models, allowing them to process target tasks from different sources.Numerous milestone models have been introduced to date.In the NLP domain, the most renowned foundation models are the GPT series developed by OpenAI [9].Adopting a pre-training plus fine-tuning approach, models trained on extensive corpora have demonstrated outstanding performance across a variety of NLP tasks, including text classification, machine translation, and summary generation.With the rapid development of NLP and multimodal fields, several emerging foundation models have been proposed in the field of computer vision.

SAM
In April 2023, Kirillov et al. [5] introduced the SAM, a foundational model for image segmentation.Designed and trained to be promptable, SAM has been demonstrated to facilitate zero-shot transfer to new image distributions and tasks, achieving instance segmentation without the need for any annotations, and has produced commendable results in natural images [10].The framework of the SAM is depicted in Figure 3.
SAM transforms segmentation into three main issues: task, model, and data, intertwining these components.Initially, SAM defines a segmentation task that is sufficiently universal to provide a robust pre-training objective.It includes a prompt encoder and combines these two sources of information within a lightweight mask decoder for predicting segmentation masks.Subsequently, the model is trained using a diversified, largescale dataset.
Following the introduction of SAM, numerous researchers have explored its segmentation capabilities.For instance, to enhance SAM's interactivity, Dai et al. [11] proposed SAMAug, which generates additional point prompts without requiring further manual intervention on SAM.To reduce SAM's inference time, Zhang et al. [12] introduced the EfficientViT-SAM, replacing SAM's image encoder with EfficientViT and thoroughly evaluating it across a series of zero-shot benchmarks.EfficientViT-SAM offers significant improvements in performance and efficiency over all previous SAMs.Song et al. [13] proposed the scalable bias attention mask for SAM (BA-SAM) to enhance SAM's adaptability across different image resolutions without the need for structural modifications.Through several rounds of finetuning on downstream tasks, BA-SAM achieves state-of-theart accuracy across all datasets.Rajič et al. [14]

Application of SAM in medical image segmentation
The advantages of the SAM in the field of natural image segmentation are evident.The transferability and zero-shot segmentation capabilities of SAM are of significant importance for medical imaging, suggesting that SAM's application in medical image segmentation could effectively assist physicians in automated disease diagnosis and screening.Several researchers have explored the role of SAM in medical image segmentation, applying it to downstream segmentation tasks.
Zhang and Wang [15] evaluated the performance of SAM in BraTS2019 dataset.It was found that there is still a gap between SAM and the SOTA models without any model finetuning.Zhang and Jiao [16] discussed the potential for SAM in future medical imaging, indicating that SAM does not yield satisfactory segmentation results in many publicly available medical image datasets.Mattjie et al. [17] explored the functionality of SAM in 2D medical imaging, validating its performance across six different datasets with four types of imaging modalities: X-ray, ultrasound, dermatoscopy, and colonoscopy.The results suggested that SAM could achieve better segmentation outcomes by increasing the number of prompt points and bounding boxes.Hu et al. [18] proposed a method for skin cancer segmentation, SkinSAM, which was validated on the HAM10000 dataset.By fine-tuning the model (ViT_b_finetuned), an average pixel accuracy of 0.945, an average Dice score of 0.8879, and an average Intersection over Union (IoU) score of 0.7843 were achieved.
These studies reveal significant variations in the effectiveness of SAM across different medical image segmentation tasks, highlighting the immense research potential in the domain of medical image segmentation.

METHODOLOGY
Due to the impact of noise, field shift effects, and other factors on MRI, the intensity values of the same tissue are often uneven.While the SAM is capable of segmenting the ET in most cases, it struggles to effectively segment the NCR and the ED.Thus, enhancements have been made to the foundational SAM in this study, enabling the extraction of deeper-level features.

FFB
The FFB principally consists of three steps: channel selfattention, spatial sub-attention, and fusion, as illustrated in Figure 5.
Upon obtaining the channel self-attention feature   ℎ , it is reshaped and then inputted for spatial self-attention weighting.Through parallel operations of max and average pooling,    and    are obtained from   ℎ : = (  ℎ ) (3) Subsequently, a 1*1 convolution is applied to both    and    .Through sigmoid operation, the spatially weighted feature map   ℎ is derived: ℎ is weighted to the spatial self-attention input feature to obtain spatially weighted features    , which can be calculated by the equation: where, ⨂ represents element-wise multiplication, (•) denotes the sigmoid operation, and (•) and (•) signify max and average pooling, respectively.Finally, for feature correction, the fusion module is used for the final adjustment.As input features for fusion,    ∈ ℝ ×× undergo a 1*1 convolution and max pooling for downsampling.After the ReLU activation, they are then upsampled through the 1*1 convolution and linear interpolation to restore feature resolution.  can be obtained by the equation below:

Dataset
The performance of the model was evaluated using the BraTS2021 dataset.BraTS2021 is a large-scale multimodal brain glioma MRI segmentation dataset comprising 2040 cases, including 1251 cases in the training set, 219 cases in the validation set, and the remainder in the test set.Each case contains four modalities: T1, T1ce, T2, and FLAIR, with each modality having dimensions of 240×240×155 (L×W×H).The annotations in BraTS2021 primarily include the ET, the ED, and the NCR.
Given that only the training set possesses actual segmentation masks, making it more suitable for the segmentation method used in this study, the model was evaluated using the training set.A single MRI sequence was used as the input to assess the accuracy of model segmentation.Since physicians are more concerned with the tumor core (TC) location in clinical treatment, the TC segmentation was examined on the contrast-enhanced T1-weighted sequence, considering the characteristics of each MRI modality.

Experimental process
The experimental setup was equipped with a workstation featuring 8 NVIDIA A100 GPU, running on Python 3.10.0,Pytorch 1.10.1, and CUDA 11.1 for local execution.The model was trained over 100 epochs.MRI voxel intensity was normalized to a range between 0 and 255 by dividing by the maximum intensity of each 3D dataset and subsequently multiplying by 255.
Specifically, MRI was divided into two-dimensional slices along the plane's outer dimension.A pre-trained ViT-B model encoder [19] was employed as the image encoder to compute all image embeddings.The AdamW optimizer (β1=0.9,β2=0.999)[20] was selected, with an initial learning rate set at 1e-5 and a weight decay of 0.1.A cosine annealing learning rate scheduler was used to adaptively decrease the maximum learning rate smoothly to a minimum value (1e-7).
Partial Encoder Fine-Tuning (PEFT) technique was applied for fine-tuning the model, keeping the encoder part (image and prompt encoders) parameters frozen and only updating the gradients of the decoder.This approach aims to enhance the model's performance with limited data and computational resources, reducing the number of parameters that needed to be optimized during training.

Evaluation metrics
To evaluate the accuracy of the segmentation model, the Dice Similarity Coefficient (DSC), Hausdorff Distance (HD), and Average Symmetric Surface Distance (ASSD) were employed as evaluation metrics.
The DSC measures the similarity between the predicted and true segmentation results.Similar to the IoU, its range is from 0 to 1, with 1 indicating the maximum similarity between prediction and truth.The DSC is calculated as follows: The HD first calculates the minimum distance from each point in set A to set B, and then selects the maximum value among these distances.To mitigate the influence of outliers, HD95 is the 95 th percentile of all these distances.The HD is calculated using the formula below: ℎ(, ) =  ∈  ∈ (, ) (9) where, d(a,b) represents the distance between points a and b.
The ASSD is used to measure the degree of surface alignment in the segmentation results.It calculates the minimum distance from each point on the segmented surface to the ground truth surface, and vice versa, taking the average of these two sets of minimum distances.The smaller the value, the better the segmentation performance.The ASSD can be represented by the formula: ( where, S(A) denotes the surface voxels in set A, and d(v,S(A)) represents the shortest distance from any voxel to S(A).

Experimental results and analysis
This study evaluated the predictive accuracy of the model using three nested structures of the following subregions: ET, TC (ET + NCR + NET), and the whole tumor (WT) (i.e., TC + ED).The segmentation results of the U-net, Unet++, ResUnet, TransUnet, and SAM on the BraTS2021 dataset were compared.To contrast the interactive performance of SAM, segmentation of brain tumors was conducted using 2 prompt points, 10 prompt points, and a fully automatic segmentation approach.The segmentation results with 10 prompt points surpassed those with 2 prompt points, indicating that a greater number of prompt points leads to improved final segmentation outcomes.The results are displayed in Table 1.It is observed that the segmentation method of this study outperforms the original SAM prompt points and automatic segmentation methods in terms of segmentation effects, with DSC surpassing the best segmentation outcomes.SAM demonstrated the best segmentation effects on the TC, followed by the WT, and lastly the ET.This indicates that SAM exhibits superior performance on objects with clearer boundaries, as the TC has the clearest boundaries among the three regions.The model's best segmentation effects were observed in the WT region in Figure 6.

CONCLUSION AND PROSPECT
The automatic segmentation of brain tumors is crucial for clinical diagnosis and treatment.This article adds convolution and feature mapping to the original architecture of SAM, allowing for the extraction of deeper image features.By fusing the convolved features with the Transformer features within the image encoder, precise segmentation of brain tumors has been achieved.Through experimental verification, the method proposed in this paper achieved better accuracy in brain tumor segmentation than the original SAM method.In addition, this study investigated the segmentation results of different numbers of SAM cue points, indicating that increasing the number of cue points helps improve segmentation results.
However, only T1 was validated in this article, without considering the contextual information of the slices.Therefore, future researchers can further explore the use of SAM for multimodal brain tumor segmentation, extending 2D segmentation results to 3D.This progress will help doctors diagnose and treat brain tumor patients before surgery.
effectively extended SAM's capabilities to the video domain, introducing the SAM-Point Tracking (PT) model for object tracking and segmentation in dynamic videos.SAM-PT utilizes sparse point selection and point propagation techniques to generate masks, leveraging local structural information unrelated to object semantics.Experimental results demonstrate that SAM-PT can produce robust zero-shot performance on popular video object segmentation benchmarks, including DAVIS, YouTube-VOS, and MOSE.

Figure 3 .
Figure 3. Framework of the SAM

Figure 5 .
Figure 5. FFB Initially, combined features   ∈ ℝ 2×× are obtained through aggregation of   and   .The output    ∈ ℝ 2×× from   through the channel self-attention module can be derived using the following equation:

Figure 6 .
Figure 6.Visualization of segmentation resultsFigure7shows two heterogeneous tumor regions, more prompt points are needed, as demonstrated in this study, where 2 positive sample points and 3 negative sample points were utilized.However, the segmentation results for ET and ED were suboptimal.

Figure 7 .
Figure 7. Visualization of segmentation results of two heterogeneous tumor regions

Table 1 .
Comparison of model segmentation results