Model | SWUer-China

Data Preparation

1) Data Collection
AFP Collection
AFP data was collected from antifungal peptide databases, including DBAASP, dbAMP, drAMP, APD 3, CAMPR 4, and LAMP 2. Peptide sequences containing 2-50 amino acids were selected, resulting in a total of 6385 AFP sequences, of which 3905 had MIC data. Among them, 3294 MIC data points were measured against Candida albicans. Non-AFP Collection
The non-AMP dataset was downloaded from UniProt after removing any entries containing the following keywords: antimicrobial, antibiotic, antiviral, antifungal, effector, or excreted. Peptides containing 3-50 amino acids were selected, resulting in a total of 3,828,616 non-AFP sequences.

2) Data Splitting

The collected data was split into training, test, and validation sets with an 8:1:1 ratio:

Training set contains 5108 AFP sequences, including 3124 with MIC values, and 3,062,893 non-AFP sequences.
Test set contains 639 AFP sequences, including 391 with MIC data, and 382,862 non-AFP sequences.
Validation set contains the remaining AFP and non-AFP sequences.

3) Data Preprocessing

AFP and non-AFP datasets were converted into fixed-size vectors, with the 20 standard amino acids (AA) represented as numeric values from 1 to 20:

If a sequence had fewer than 50 AAs, it was padded with zeros.

A label of 1/0 was added as a classification tag to indicate whether the sequence was AMP or non-AMP, or the MIC value was added.

Sequences were tokenized using a tokenizer, and sequence beginnings/ends were marked with [CLS] and [SEP] labels.

Model Selection

The pre-trained DeBERTa model from the Hugging Face Transformers library was loaded and fine-tuned for the antifungal peptide sequence task.

Model Fine-Tuning
- A fully connected layer was added for binary classification of AFP/non-AFP.
- A linear layer was added for predicting AFP MIC values.

Model Training

Based on previous reports (Ma et al., 2022), both qualitative and quantitative models were trained.

The models were trained using PyTorch, with the Adam optimizer (default parameters), a batch size of 64, and a learning rate of 2×10^-5.

For the qualitative model: cross-entropy loss function was used.
For the quantitative model: mean squared error (MSE) loss function was used.

Early stopping was applied to prevent overfitting by monitoring the validation set error, stopping when it increased.

Additionally, the 10-fold cross-validation method was employed, splitting the training set into ten parts, with one part for validation and the rest for training. Hyperparameters were kept consistent, and the average training loss and validation loss of the 10 models were used to assess performance. After identifying optimal hyperparameters, the entire dataset was used to train one final model.

Hyperparameters like learning rate and batch size were adjusted during training. The learning rate was fine-tuned between 1×10^-6 and 5×10^-5, and the batch size was fine-tuned between 32 and 128.

Once training was complete, models were saved and fixed.

Model Combination

The trained models were combined according to the diagram below to construct the AFP Prediction pipeline:

Model Evaluation and Optimization

We evaluated the models using common classification metrics such as accuracy, precision, recall, and F1 score for the validation set. These metrics helped assess the performance of the qualitative model in distinguishing AFP from non-AFP sequences.

To assess the quantitative model, we used regression metrics, including mean square error (MSE), mean absolute error (MAE), and Pearson's correlation coefficient. These provided insight into how well the model predicted AFP MIC values.

For more detailed results and analysis, please refer to the [[Measurement]] section. Below is a summary of the key performance results:

Accuracy: 99.43%

True Negative Rate (TNR): 99.66%

Balanced Accuracy: 93.40%

Recall: 87.14%

Precision: 82.75%

MCC: 0.85

F1 Score: 84.89%

Pearson's Coefficient: 0.62

Mean Absolute Error (MAE): 0.40

Mean Squared Error (MSE): 0.37

Root Mean Squared Error (RMSE): 0.60

Antifungal Peptide Prediction

The model was used to predict AFP towards C. albicans strain SC 5314 from protein sequences in public databases such as NCBI and obtained 34 antifungal peptide sequences with potentially good antimicrobial effects. Their MIC values are shown in the table below: