Software | DNHS-SanDiego-CA

Overview of Multimodal Machine Learning Model

We created and trained a multi-stream neural network machine learning model that uses the following data sets to predict if a molecule-E. coli promoter combination generates a fluorescent response:

1. Molecule-Promoter fluorescence data collected by our Research team

2. Molecule diagrams (chemical structures)

3. Protein Promoter Sequences

4. DNA nucleotide sequences (as matched to each protein promoter sequence)

This model follows 3 main steps: data cleaning, model building and training, and testing/output validation. As part of data cleaning, each string-DNA sequence was encoded using a truncated hash function, and each protein promoter sequence was encoded using a label encoded function to convert these string inputs into numerical inputs the model can process. We stored each group of data with its corresponding molecular structure (in the form of a png) in an npz format for the model to process. The multi-stream neural network model processes both the image stream and the statistical stream to develop a correlation between the data and the fluorescence generated.

Statistical Outputs of ML Model

Figure 1. Line graph plotting the validation loss and overall loss over the number of epochs (iterations) of the model without including DNA sequence data.

Figure 2. Line graph plotting the validation loss and overall loss over the number of epochs (iterations) of the model with inclusion of DNA sequence data

These two figures represent the training and validation loss. The teal line represents the model's error on the training data. The loss decreased over various epochs (training iterations) as reflected between the two different figures. The metric also reflects how well the model performs on unseen data (validation set). It helps identify overfitting, where the model memorizes the training data but fails to generalize to new examples (represented by the orange line). Since the training and validation loss values are steadily decreasing, the model is proven to be learning effectively (not overfitting). The values where the validation and training loss plateau show the model accuracy when optimized at a global minimum.

Figure 3. Bar chart with class distributions/types and percentages of data points for each class

This chart shows the class distributions of the data. An ideal dataset would have images and metadata distributed equally through both classes to ensure that the model can learn patterns from both classes. This prevents the model from being biased towards one class over the other and aids in forming accurate patterns for distinction. Our model can be further improved by adding additional class types, as the above graph indicates.

Figure 4. Line graph with ROC curve plotted against False and True positive rates, with an ROC curve area of 0.96

The ROC curve above plots the false positive versus true positive rate for the testing data. The ideal point on this curve is the point (0,1), where the false positive rate is as low as possible while the true positive rate is 100%. The model chooses a point on this curve to maximize the true positive rate while keeping the false positive rate as low as possible. The area under the curve is 0.96, which represents the testing accuracy for the model. This is a visual representation of the information the model gives us regarding the significance of a certain molecule in detecting a promoter because a random guess would produce an area of 0.5, and the model is well above the random guess as shown with an area of 0.96.

Figure 5. Bar graph depicting Training, Validation, and Testing accuracies, as well as sensitivity and specificity of the model when DNA sequences are not included, with regular, dropout, and regularization constraints

Figure 6. Bar graph depicting Training, Validation, and Testing accuracies, as well as sensitivity and specificity of the model when DNA sequences are included, with regular, dropout, and regularization constraints

Figure 7. Percentage values used to make above visual charts

The above graphs show the effect of adding dropouts and regularization constraints to the model architecture. The general trend shows that as we add dropouts and regularization, the training accuracy lowers, but the validation and testing accuracy increases. This is expected because these functions serve to reduce overfitting so the model doesn’t detect false patterns in training and get those predictions wrong in testing. While the final accuracy with the addition of dropouts and regularization decreased the overall training accuracy, it increased all other metrics significantly, proving to be a beneficial addition to the model.

While the training accuracy for the model with and without nucleotide sequences shows little to no increase, the validation and testing accuracy show a substantial increase. The addition of more data allows the model to form more underlying connections which assists in more accurate classification carrying over to the validation and training data.

Figure 8. Bar graph depicting Training, Validation, and Testing accuracies, as well as sensitivity and specificity of the model when DNA sequences are/not included, on a scale from 75-95%.

The figures above depict the Receiver Operating Characteristic (ROC) Curve and the Confusion Matrices of the model, key factors in determining the effectiveness of the model.

The ROC Curve plots the True Positive Rate (y) against the False Positive Rate (x). An ideal ROC curve would stay close to the top left corner, which indicates high TPR and low FPR. The Area Under the Curve (AUC) score (0.96 in our model) quantifies this performance. Since our score is very close to 1.0, our model’s accuracy is very high.

The confusion matrices (with and without the promoter sequence DNA data) show the accuracies of the model, including the number and percentages of true and false positives and negatives (TP, FP, FN, TN). As evident in the charts above, the addition of the promoter data in the form of DNA sequences to the original molecule-promoter fluorescence tables and molecule diagrams causes the model to be more accurate in training, validation, and testing, as well as in sensitivity and specificity.

Contents

Overview of Multimodal Machine Learning Model

Statistical Outputs of ML Model