Skip to main content

Software

ChromoBlend

Introduction to Modelling

KlothY integrates a variety of complex systems with numerous interacting components. Within this framework, modelling is a fundamental tool, helping us make sense of these intricate systems and predict the outcomes of various biological processes. Our modelling approaches combine both theoretical insights with hands-on experimental work, effectively bridging the gap between the wet lab and dry lab parts. Throughout this project, we use various modelling techniques, including the development of a software for predicting chromoprotein volumes, metabolic modelling to optimise bacterial cellulose (BC) production, and life cycle analysis (LCA) to evaluate the sustainability of our practices.

Our Aim

Accurately replicating colours using chromoproteins can be challenging. The process of mixing them often relies on trial and error, which can be time-consuming and inefficient, leading to inconsistent results. Recognising these challenges, we reached out to Professor Oliver Ebenhöh, a leading expert in mathematical modelling. During our discussions, he highlighted the potential of mathematical modelling in dye applications, particularly in the context of the Cyan, Magenta, Yellow and Key (CMYK) colour model, which is a subtractive model widely used in colour printing (Westland and Cheung, 2023). Inspired by this concept, we aimed to develop a software capable of predicting the precise amounts of chromoproteins needed to create any desired hue.

Our approach focuses on the CMYK colour model, using aeCP597 for Cyan (Shkrob et al., 2005), eforCP for Magenta (Alieva et al., 2008), and fuGFP for Yellow, intentionally leaving out the K component due to the lack of a matching chromoprotein. By mixing these three chromoproteins, we can achieve a wide spectrum of colours, allowing the reproduction of any desired shade. Our software streamlines this process by allowing users to input a HEX code, a widely used digital colour format, and then calculate the precise volumes of each chromoprotein needed to replicate that colour. This allows labs and industries to consistently replicate specific colours without the need for manual adjustments or repeated trial-and-error. Developing the software

In order to create a reliable software for predicting dye concentrations, we needed a dataset that accurately reflected the relationship between dye proportions and the final colour outcome. Rather than relying on literature values, we generated our own dataset by systematically mixing chromoproteins, specifically, aeCP597 for Cyan, eforCP for Magenta, and fuGFP for Yellow and then measuring the resulting colours in terms of their RGB values.

To predict the required concentration of chromoproteins for any given RGB input, several methods can be considered. One potential approach is Linear Regression, which we decided against due to its reliance on the assumption of linearity in the data (Schmidt and Finan, 2018). Another option was Random Forest but due to our relatively small dataset of only 100 samples, using Random Forest would not have been reliable, as it typically performs better with larger datasets (Couronné et al., 2018). Ultimately, we chose K-Nearest Neighbour (KNN), a widely used supervised machine learning method (Uddin et al., 2022). Essentially, KNN works by comparing the input RGB values to the closest matches in our dataset and averaging the corresponding CMY chromoprotein volumes of those nearest neighbours. This approach is ideal due to its simplicity and ability to handle non-linear relationships, making it well-suited for data with potentially complex patterns (Kaliappan et al., 2022).

To ensure that KNN works optimally with our dataset, several parameters needed to be set. One of the most critical parameters is k, which specifies how many neighbours the algorithm considers when making a prediction (Zhang, 2016). If k is set too low, the model risks overfitting, becoming overly sensitive to small variations in the data. On the other hand, if k is too large, the model may oversimplify and lose accuracy. The challenge, therefore, is to find the optimal k value that balances accuracy without overfitting the data (Zhang, 2016). Additionally, scaling the RGB input values was essential to improve the model’s accuracy. Since KNN is sensitive to the scale of input data, we had to standardise the RGB values to ensure that one colour channel, for example Red, does not dominate the predictions. We explored two scaling methods to determine the most suitable one. Z-score normalisation scales the data based on its mean and standard deviation, making it effective when the dataset contains outliers (Andrade, 2021). Min-max normalisation, on the other hand, scales all data points between 0 and 1, which is useful when the dataset has a limited range of values (Lambert et al., 2024).

To determine the best scaling method and k value combination, we used Root Mean Squared Error (RMSE) as our performance indicator (Ajala et al., 2022). RMSE evaluates how closely the predicted chromoprotein volumes align with the actual values. To ensure that this measure is reliable, RMSE must be calculated on unseen data that the model has not been trained on (Ajala et al., 2022). Therefore, we split the dataset into training and testing sets using an 80/20 ratio, where 80% of the data trains the model, and 20% is reserved for testing. This approach helps achieve a balance between overfitting, where the model is too closely tuned to the training data, and underfitting, where it fails to capture important patterns (Rácz et al., 2021). By computing RMSE on the test set, we can assess how well the model generalises to unseen data. Ultimately, the lower the RMSE value, the more accurate the model (Forrest et al., 2021). For our dataset, we compared the RMSE values for different scaling methods and k values and, though the values are quite similar, decided that Z-score normalisation and k=5 gives us the most accurate prediction.

Final Code

After optimising the KNN model, we applied it to predict the volumes of chromoproteins needed for any given colour. Firstly, the input HEX code is converted into its corresponding RGB values. These RGB values are then scaled using Z-score normalisation, as it prevents any single colour channel from disproportionately affecting the predictions, a factor we addressed during the optimisation phase. Once the RGB values are appropriately scaled, the KNN algorithm searches the dataset for the nearest matches. Using the optimised value of k (in this case, 5), the model identifies the five closest RGB values in the dataset based on their similarity to the input. The corresponding proportions of the chromoproteins aeCP597 for Cyan, eforCP for Magenta, and fuGFP for Yellow from these neighbouring values are then averaged. This averaging provides an estimate of the volumes of chromoproteins required to reproduce the input colour as accurately as possible.

Future Directions & Outlook

Our software has the potential to be used by various laboratories and industries for colour prediction. However, one limitation we currently face is the relatively small size of our dataset. While the KNN model has shown promising results, having a larger dataset would enhance its performance. With more data, we could also explore the use of Random Forest, which is well-suited for larger datasets due to its ability to capture non-linear relationships and patterns, ultimately leading to more accurate predictions (Couronné et al., 2018).

In addition to expanding the dataset, we could implement a feedback loop that allows users to provide insights on the accuracy of the predictions. This feature would enable the model to learn from user input and refine its performance over time, creating an adaptive system that evolves with new data points through an active learning approach.

software link

References

Ajala, Sunday, et al. “Comparing Machine Learning and Deep Learning Regression Frameworks for Accurate Prediction of Dielectrophoretic Force.” Scientific Reports, vol. 12, no. 1, July 2022, https://doi.org/10.1038/s41598-022-16114-5. Accessed 20 Sept. 2022. Alieva, Naila O., et al. “Diversity and Evolution of Coral Fluorescent Proteins.” PLoS ONE, edited by Hany A. El-Shemy, vol. 3, no. 7, July 2008, p. e2680, https://doi.org/10.1371/journal.pone.0002680. Andrade, Chittaranjan. “Z Scores, Standard Scores, and Composite Test Scores Explained.” Indian Journal of Psychological Medicine, vol. 43, no. 6, Oct. 2021, p. 025371762110465, https://doi.org/10.1177/02537176211046525. Couronné, Raphael, et al. “Random Forest versus Logistic Regression: A Large-Scale Benchmark Experiment.” BMC Bioinformatics, vol. 19, no. 1, July 2018, https://doi.org/10.1186/s12859-018-2264-5. Forrest, Lauren N., Valentina Ivezaj, and Carlos M. Grilo. “Machine Learning v. Traditional Regression Models Predicting Treatment Outcomes for Binge-Eating Disorder from a Randomized Controlled Trial.” Psychological Medicine 53.7 (2023): 2777–2788. Web. Kaliappan, Jayakumar, et al. “Performance Evaluation of Regression Models for the Prediction of the COVID-19 Reproduction Rate.” Frontiers in Public Health, vol. 9, Sept. 2021, https://doi.org/10.3389/fpubh.2021.729795. Lambert, Tamara P., et al. “A Comparison of Normalization Techniques for Individual Baseline-Free Estimation of Absolute Hypovolemic Status Using a Porcine Model.” Biosensors, vol. 14, no. 2, Multidisciplinary Digital Publishing Institute, Jan. 2024, pp. 61–61, https://doi.org/10.3390/bios14020061. Rácz, Anita, et al. “Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification.” Molecules, vol. 26, no. 4, Feb. 2021, p. 1111, https://doi.org/10.3390/molecules26041111. Schmidt, Amand F., and Chris Finan. “Linear Regression and the Normality Assumption.” Journal of Clinical Epidemiology, vol. 98, June 2018, pp. 146–51, https://doi.org/10.1016/j.jclinepi.2017.12.006. Shkrob, Maria A., et al. “Far-Red Fluorescent Proteins Evolved from a Blue Chromoprotein from Actinia Equina.” Biochemical Journal, vol. 392, no. 3, Dec. 2005, pp. 649–54, https://doi.org/10.1042/bj20051314. Uddin, Shahadat, et al. “Comparative Performance Analysis of K-Nearest Neighbour (KNN) Algorithm and Its Different Variants for Disease Prediction.” Scientific Reports, vol. 12, no. 1, Apr. 2022, https://doi.org/10.1038/s41598-022-10358-x. Westland, Stephen, and Vien Cheung. “CMYK Systems.” Springer EBooks, Springer Nature, Jan. 2023, pp. 1–7, https://doi.org/10.1007/978-3-642-35947-7_13-3. Zhang, Zhongheng. “Introduction to Machine Learning: K-Nearest Neighbors.” Annals of Translational Medicine, vol. 4, no. 11, June 2016, pp. 218–18, https://doi.org/10.21037/atm.2016.03.37.

Property testing machine software

We wrote a code in the programming language scratch, with modifications enabling the implementation for the LEGO Mindstorms EV3 set, which served as the base for our property testing device (See hardware) . The code provides the ability to control our property testing machines' winding mechanism with the EV3 Brick’s control panel. The way the program operates, it winds the cable up or down as long as the associated buttons are pushed. To program this, we used the program “EV3 Classroom”[1], which is freely available. We provide the file “property testing machine 02.lmsp” we used below. As an alternative, there is a picture of the code attached (Picture 1).

image 1

Picture 1: Code for property testing machine “DAS MASCHIN” created with “EV3 Classroom”.

software link

References

  1. https://education.lego.com/de-de/downloads/mindstorms-ev3/software/, last opened 09/29/2024, 19:42