1) Data Collection
AFP Collection
AFP data was collected from antifungal peptide databases, including DBAASP, dbAMP, drAMP, APD 3, CAMPR 4, and LAMP 2. Peptide sequences containing 2-50 amino acids were selected, resulting in a
total of 6385 AFP sequences, of which 3905 had MIC data. Among them, 3294 MIC data points were measured against Candida albicans.
Non-AFP Collection
The non-AMP dataset was downloaded from UniProt after removing any entries containing the following keywords: antimicrobial, antibiotic, antiviral, antifungal, effector, or excreted. Peptides containing 3-50 amino acids were selected, resulting in a total of 3,828,616 non-AFP sequences.
2) Data Splitting
The collected data was split into training, test, and validation sets with an 8:1:1 ratio:
3) Data Preprocessing
AFP and non-AFP datasets were converted into fixed-size vectors, with the 20 standard amino acids (AA) represented as numeric values from 1 to 20:
If a sequence had fewer than 50 AAs, it was padded with zeros.
A label of 1/0 was added as a classification tag to indicate whether the sequence was AMP or non-AMP, or the MIC value was added.
Sequences were tokenized using a tokenizer, and sequence beginnings/ends were marked with [CLS] and [SEP] labels.
The pre-trained DeBERTa model from the Hugging Face Transformers library was loaded and fine-tuned for the antifungal peptide sequence task.
Model Fine-Tuning
Based on previous reports (Ma et al., 2022), both qualitative and quantitative models were trained.
The models were trained using PyTorch, with the Adam optimizer (default parameters), a batch size of 64, and a learning rate of 2×10-5.
Early stopping was applied to prevent overfitting by monitoring the validation set error, stopping when it increased.
Additionally, the 10-fold cross-validation method was employed, splitting the training set into ten parts, with one part for validation and the rest for training. Hyperparameters were kept consistent, and the average training loss and validation loss of the 10 models were used to assess performance. After identifying optimal hyperparameters, the entire dataset was used to train one final model.
Hyperparameters like learning rate and batch size were adjusted during training. The learning rate was fine-tuned between 1×10-6 and 5×10-5, and the batch size was fine-tuned between 32 and 128.
Once training was complete, models were saved and fixed.
The trained models were combined according to the diagram below to construct the AFP Prediction pipeline:
We evaluated the models using common classification metrics such as accuracy, precision, recall, and F1 score for the validation set. These metrics helped assess the performance of the qualitative model in distinguishing AFP from non-AFP sequences.
To assess the quantitative model, we used regression metrics, including mean square error (MSE), mean absolute error (MAE), and Pearson's correlation coefficient. These provided insight into how well the model predicted AFP MIC values.
For more detailed results and analysis, please refer to the [[Measurement]] section. Below is a summary of the key performance results:
Accuracy: 99.43%
True Negative Rate (TNR): 99.66%
Balanced Accuracy: 93.40%
Recall: 87.14%
Precision: 82.75%
MCC: 0.85
F1 Score: 84.89%
Pearson's Coefficient: 0.62
Mean Absolute Error (MAE): 0.40
Mean Squared Error (MSE): 0.37
Root Mean Squared Error (RMSE): 0.60
The model was used to predict AFP towards C. albicans strain SC 5314 from protein sequences in public databases such as NCBI and obtained 34 antifungal peptide sequences with potentially good antimicrobial effects. Their MIC values are shown in the table below: