Objective

The general purpose of this model is to assist in "segmentation" based on "prompts," which differ from the usual human-provided "prompts" by indicating "points or areas of interest" to the user. Here, the prompts are derived from the differences between the masks and the original images in the training set, and through a prompt word encoder, the mask generator produces the correct output. In this way, the model can better capture subtle signals in medical imaging. In addition to its primary function, each independent component of the model can also be used separately, such as in multimodal tasks where the image encoder can be independently extracted to generate image embeddings.

Model Architecture

The model architecture is based on the Segment Anything Model (SAM). It can be generally divided into four parts: the image encoder, the prompt word encoder, the connecting layer, and the mask generator. Furthermore, during the training of the image encoder, the other parts were frozen, and only the gradients of the adapter were opened, which preserved the effect of pre-training while reducing the training volume.

Model Architecture 2D

Model Architecture 2D

Model Architecture 3D

Model Architecture 3D

The input of the prompt word encoder is points, boxes, or masks, and such prompts are obtained through the differences between the input images and mask images in the training set. By adjusting the parameters in the prompt word encoder during training, it can better capture the differences between the two.

The connecting layer connects the embeddings obtained from the image encoder and the image mask embeddings, while the mask generator produces the corresponding segmentation of the image.

The model uses pre-trained models on publicly available datasets in the real world, followed by fine-tuning on medical segmentation datasets.

Results

We have deployed an image segmentation software online that can be used directly. Users can upload images of interest, and the software will automatically segment the images and return the segmented images, which users can then read using their local or online image viewers.

Methods

Network Structure

The network utilized in this study was built on the transformer architecture, which has demonstrated remarkable effectiveness in various domains. Specifically, the network incorporated a vision transformer (ViT)-based image encoder responsible for extracting image features, a prompt encoder for integrating user interactions (bounding boxes), and a mask decoder that generated segmentation results using the image embedding, prompt embedding, and output token.

Loss Function

In this study, we employed the unweighted sum of cross-entropy loss and dice loss as our final loss function. This approach has demonstrated remarkable robustness across a wide array of medical image segmentation tasks.

Let \( S \) and \( G \) represent the predicted segmentation result and the ground truth, respectively. Here, \( s_i \) and \( g_i \) denote the predicted segmentation and the ground truth for voxel \( i \), respectively.