Introduction
Computational modeling and software design can assist with understanding the Ann Arbor 1,4-dioxane plume as well as with creating tools capable of helping others in their sustainability efforts. Our computational modeling efforts are two-pronged. Our first task was to design models capable of determining the biodegradability of a given compound. Our second task was to investigate the physical properties of water with real world concentrations of dissolved 1,4-dioxane. All code, data, and molecular dynamics simulation files can be found on our gitlab.
Biodegradation Model
Motivation
Compounds used in industrial processes, like in the pharmaceutical industry, are sometimes released into the ecosystem with disastrous consequences1,2. These incidents, accidental or otherwise, can carry risks to humans1. Therefore, it is worthwhile to investigate if the compounds they utilize are biodegradable. Biodegradable compounds can be naturally degraded by common enzymes often found in the environment and thus present a safer option for protecting communities1,2.
There are a number of accepted tests for determining biodegradability. One of the most common is the Modified MITI Test2,3,4. The MITI test uses compounds in aqueous bacterial stock and tests the oxygen demand over a period of 28 days17. Due to the time constraint nature of the test as well as other issues, high throughput testing is difficult. Therefore, the use of machine learning (ML) can advantageously reduce the number of samples needed for testing3.
Previous Research
ML approaches have been explored for biodegradability testing. Most of the previous research has been traditional in nature, i.e. not involving deep learning, but a few instances of deep techniques have been implemented. These approaches enable the exploration of different feature generation methods as well as model architectures. The best currently known models are Quantitative Structure-Activity Relationship (QSAR) models, with the two most common techniques being Support Vector Machines (SVMs) and Graph Convolutional Networks (GCNs). However, the many attempts to model biodegradation seem to have a natural performance limit around improving the sensitivity.
Compiled Results of Previous Papers' Attempts to Predict Biodegradation
Paper Name | Year | Accuracy | AUC | Data source | Notes |
---|---|---|---|---|---|
Concomitant prediction of environmental fate and toxicity of chemical compounds1 | 2020 | Regressing Model | .879 | NITE | - |
Modeling the Biodegradability of Chemical Compounds Using the Online CHEmical Modeling Environment (OCHEM) 2 | 2014 | .876 | - | NITE + University Dataset | Required refinement of structures in preprocessing |
Multimodal
Deep Neural Networks using Both Engineered and Learned Representations for Biodegradability Prediction3 |
2018 | .875 | - | MITI Test (NITE) | Combines a neural network with feature based approach |
Modeling of ready biodegradability based on combined public and industrial data sources4 | 2019 | .750 | - | NITE, ECHA, VEGA, EPI Suite, OPERA, Solvay | Had access to large database compared to other papers |
A Comparative Study of the Performance for Predicting Biodegradability Classification: The Quantitative Structure-Activity Relationship Model vs the Graph Convolutional Network19 | 2022 | 0.84 | - | ECHA, NITE, VEGA, EPI Suite, OPERA | Tested a vast range of models including kNNs, SVMs, RFs, GB, and GCNs |
Prediction of biodegradability from chemical structure: Modeling of ready biodegradation test data16 | 2009 | 0.83 | - | NITE (MITI Test), | Studied fragments of the chemicals and their effects on the biodegradability |
Development of models predicting biodegradation rate rating with multiple linear regression and support vector machine algorithms20 | 2020 | 0.77 | - | NITE (MITI Test), | - |
Biodegradability Prediction of Fragrant Molecules by Molecular Topology14 | 2016 | 0.808 | - | Boethling | - |
Machine Learning
1. Dataset
Data provides an incredibly important contribution when designing high fidelity models. We believe that the data published by Lunghini et. al. is the most thorough and well processed available biodegradability dataset4. They aggregate data from the ECHA database, NITE, VEGA, EPI Suite, and OPERA for public availability. Overall, this data contains 2830 samples, with 1097 being biodegradable and 1733 being non-biodegradable4. We removed five molecules due to processing errors. We further split this into 1956, 559, 280 samples for training, testing, and validation purposes respectively (roughly a 70/20/10 split).
2. Machine Learning and Software
We sought to explore a number of ML and artificial intelligence (AI) paradigms in order to thoroughly explore possible solutions. Our first challenge was representing molecules in our ML models. Molecules are real-world objects with properties and structural relationships that are not easily described by numbers. Additionally, ML models are generally strict, such that the input they require is a vector of numbers often correlating to real world properties. To accomplish this stringent task, many software packages have been developed to sufficiently represent molecules.
These softwares, which include DRAGON, Mordred, RDkit Descriptors, etc., generate large numbers of features to describe molecules18. Features quantifying or describing structural or chemical components of molecules can be represented in tabular formats with a known dimensionality. Due to the shape of the data being both universal and known, models can be more easily developed around these data. To that end, we implement feature generation with Mordred, which created 1040 unique descriptors per molecule. We explored Random Forests (RFs), XGBoost, and Naive Bayes models. Additionally, we explored the application of Graph Neural Networks (GNNs), as they do not require strict input dimensions.
3. Tree Based Models
Both RF and XGBoost models are tree based, meaning they use a number of decision trees to drive their predictions. A decision tree splits the dataset at random intervals (nodes) and, as a sample meets a number of criterion, the model is able to generate a class prediction5. The predictive power of one decision tree is not ideal, which is where RF and XGBoost become helpful. RF models average a large number of independent trees, which is called bagging6. XGBoost implements boosting, which is where trees are iteratively designed to correct the errors from previous trees6.
Both RF and XGBoost models are considered ensemble models, a class of models that combines the predictions of multiple constituent models. In general, ensemble models are robust. We also explored the effects of intelligent feature selection to attempt increasing performance by presenting the models with the most relevant features. To this end, we investigated three feature selection techniques. Mutual information characterizes how the knowledge of one variable reduces the uncertainty in the item characterization7. Fisher Score attempts to quantify how well a certain feature is able to separate the class of all the samples in a dataset8. Finally, we investigated the impact of reducing the number of highly correlated features through multicollinearity analysis in an attempt to reduce redundant features9.
We present the results of both RF Models below:
The XGBoost Model results follow:
4. Naive Bayes and Bayesian Averaging
The second broad class of models we investigated were Naive Bayes models. Naive Bayes models are predicated on the famous Bayes Theorem, which provides a convenient framework for evaluating conditional probabilities. Naive Bayes models make assumptions that present critical challenges to the proper application of the model. First, Naive Bayes assumes that features are independent of each other - this is often not true and presents an intrinsic limitation for performance. Second, Naive Bayes models calculate probabilities based on a manually specified probability distribution. These distinct model classes include Gaussian Naive Bayes, Categorical Naive Bayes, and Bernoulli Naive Bayes based on their respective distribution titles15.
We chose to implement three Naive Bayes models. First, we implemented a Gaussian Naive Bayes across the entire dataset. We believe this represents our most naive modeling approach, as it makes no attempt to intelligently select features or model types. Secondly, we chose to design and implement a custom Naive Bayes architecture capable of combining multiple Naive Bayes models using unique assumed probability distributions. To do this, we had to narrow the number of features used, which will be explained more thoroughly in the next paragraph. Our third model was a Gaussian Naive Bayes, but implemented across the same features as the custom Naive Bayes for a more fair comparison.
To select features for custom Naive Bayes architecture, we designed a series of tests to determine if a distribution was binary, categorical, or within a reasonable margin of being Gaussian. Specifically for Gaussian distribution, we used a Shapiro-Wilk test, which technically can only disprove data being drawn from a normal distribution. However, for modeling purposes, we set the P-value threshold to 0.2. We believe that if data was greater than this threshold, it could be modeled somewhat-confidently with a Gaussian distribution. We were able to identify 49 Bernoulli features, 90 categorical features, and 0 Gaussian features. For combining model classes, we used Bayesian Averaging, a popular and simplistic technique to combine the outputs of multiple Bayesian classifiers. The simplicity of Bayesian Averaging lies in its definition as a linear combination of the outputs of multiple naive Bayes models, weighted by their respective posterior possibilities15.
Finally, it is worth noting that while we compared models across a variety of metrics, no truly fair comparison could be achieved. Categorical Naive Bayes are unable to make predictions on classes which they were not exposed to during the training phase. Therefore, three samples of the testing data had to be excluded from the testing set due to introducing previously unseen classes. While we don’t believe this severely modifies the testing metrics, it is a limitation.
5. Deep Learning and Graph Neural Networks
An alternative approach to modeling is achievable through deep learning. While many deep learning models such as multi-layer perceptrons (MLPs) and convolutional neural networks require geometrically constrained data, GNNs allow for graphs of any shape to be processed10. Graphs are a mathematical structure represented by a number of nodes and edges. Naturally, molecules can be represented through these structures where atoms are interpreted as nodes and bonds as edges. Furthermore, a feature vector can be associated with each node10. This framework suggests a very natural way to do computational work with molecules as their true structure (or something much closer to) can be represented to the model.
We tried a variety of graph neural networks and feature generators. Ultimately, we landed on the combination of a graph convolution model with features generated by ConvMolFeaturizer, which generates a feature for each atom12,28. Both of these classes were provided by the Deepchem package12,28.
The technique of transfer learning is often used in machine learning to increase performance. While this is usually done in the area of computer vision, we also explored its application in biodegradability. Essentially, transfer learning works by training a model on tangentially related data with more samples which gives the model increased inferencing abilities when applied to the data of interest. To make the model predict on the key data, the model must be fine-tuned with a subset of the actual data13.
To accomplish this for biodegradability, we trained a GraphConvModel using Tox21 data, a dataset that classifies molecules as toxic or non-toxic28. After training, we fixed the GNN weights and fed the intermediate vector into a MLP which we fine-tuned on the biodegradability data.
Conclusion and Discussion
These experiments reveal a few trends in machine learning. Generally, we see robust ensemble models (XGBoost and RF) outperform the rest. Even when performing feature selection, base models tend to perform on par or better than models trained with selected data. This may be due to a number of reasons. First, the decision tree basis of these models inherently makes the models consider the conditional set of features of a molecule. Individual features may not have very high predictive power, but a conditional sequence of features might. Additionally, because there are many voters in ensemble modeling, these models tend to be robust in our experience and show higher immunity to variation in the data. Finally, we note that XGBoost outperforms RF and all other model classes. We believe that this is likely due to the boosted nature of XGBoost. As noted in previous sections, the dataset is skewed. By iteratively correcting for errors, the XGBoost may be less likely to vote naively after training.
Our Naive Bayes models performed relatively poorly, however it is clear that the method of Bayesian averaging boosts performance substantially. One reason for the less impressive performance of the averaged Naive Bayes model is due to the massive reduction in the number of features utilized. Because most features’ distributions could not be characterized, the averaged model used roughly an order of magnitude less features than other model classes. This could hamper overall performance.
Unfortunately, our GNN models underperformed from what we were expecting. For the non-transfer learning approach, we did see learning occur and general convergence of performance after about 10 epochs of training. However, the model seems to be sensitive to epochs afterward and does not continue increasing performance metrics convincingly. This may be due to the lack of a larger dataset, variability in the dataset, or the network may simply have had trouble making sense of the feature vectors initially supplied. One of the downsides of deep-learning is that models often act as a black-box: we cannot know or make sense of what happens internally. Transfer learning performed worse. While the GNN exhibited convergence on the Tox21 dataset, the MLP performed about randomly on the biodegradability dataset. This could be due to lack of overlap between the datasets, not powerful enough feature representation being achieved on Tox21, or improper transfer scheme. While the transfer performance was underwhelming, it is not surprising, as the GNN performance on the biodegradability dataset was not especially effective, since our GNN did not perform especially well on the Tox21 dataset.
Overall, we believe that the base XGBoost model is our overall best. On the testing set data, we achieve the following performance: Accuracy : 0.855, Specificity: 0.891, Sensitivity: 0.799, AUC: 0.925, and MCC Score: 0.694. To further verify performance, we applied this model to the validation dataset and achieved the following performance: Accuracy: 0.875, Specificity: 0.918, Sensitivity: 0.809, AUC: 0.934, and MCC Score: 0.736. Availability of our model is through a reproducible training pipeline in the Gitlab repository, which allows anyone to recreate our best performing model. This is available through download and can be used to generate predictions on molecules characterized by the same Mordred features we use.
Molecular Dynamic Simulation of 1,4-Dioxane Dissolved in Water
In parallel with machine learning-based models to study potential biodegradability of compounds, our team also aimed to study any effect that 1,4-dioxane in our problem scope may have on the viscosity of water, thereby influencing the function of our proposed bioreactor. We hypothesized that the current highest concentration of 1,4-dioxane in water should not have an effect on the macroscopic physical properties of water. For this application, we used LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator), a versatile and widely used molecular dynamics simulation software designed to model particles in a wide range of materials capable of simulating complex systems with millions of particles. Additionally, LAMMPS supports a wide variety of force fields and potential models, making it flexible and adaptable to a wide range of scientific research applications21.
Current steps
First, we used the Automated Topology Builder (ATB) to generate a structure for 1,4-dioxane that can be used for LAMMPS. Then, force field template (GROMOS 54) and molecule data file containing key information (charges, bond, pair, etc.) were extracted from ATB into a Linux environment. Moltemplate, a LAMMPS-associated software package, was used to generate structure files in a LAMMPS-readable format for 1,4-dioxane. An SPCE water data file was then added to generate a water box of 2.5 million H2O molecules for a 1,4-dioxane molecule to mirror the 1900 ppb of 1,4-dioxane in water (highest recorded concentration in Ann Arbor). Larger systems might be more accurate, though we expected minimal differences in the macro properties between water and water containing 1,4-dioxane of this concentration. Note that there is likely overlap, and moltemplate doesn’t seem to have a function to check for molecule overlap/energy errors. This will need to be manually corrected in LAMMPS later on. Another way is to using the 1,4-dioxane and H2O mol file and randomly generate designated numbers of each molecule in a simulation box in LAMMPS.
Users who wish to expand on this can imitate the following pipeline. Moltemplate will produce several outputs — take the system.data file and the parm.lammps files since they contain the simulation setup for LAMMPS input and the corresponding parameters for corresponding atoms, respectively.
In LAMMPS, group H2O and 1,4-dioxane based on their assigned atom numbers from moltemplate. Delete H2O molecules that overlap with 1,4-dioxane. The setup is now ready. Minimize and run the simulation. The method we used is the periodic perturbation method, the primary built-in viscosity calculation mode in LAMMPS. This method works by measuring the velocity profile in response to applied stress. A cosine-shaped periodic acceleration is added to the system via the fix accelerate/cos command, and the compute viscosity/cos command is used to monitor the generated velocity profile and remove the velocity bias before thermostatting. The reciprocal of eta (viscosity) is computed within the script, and printed out as v_invVis in thermo_style command. Then, eta is obtained from the reciprocal of time average of v_invVis to be about 0.75 eta, which is the typical value of water simulation at this time point. This value is expected to reach one at several hundred picoseconds, as is the case for water viscosity simulation using this method, which can be verified using devices with higher computational power as the wall time for a system of this size is significant on personal computers.
All code, data, and molecular dynamics simulation files can be found on our gitlab.