Inspiration

Coffee, a drink that is revered as one of the most consumed beverages around the world, is also one of the biggest contributors to employment for many. But this drink isn’t as easy to grow as it is to brew. While the world coffee industry is worth about $484 billion which is forecasted to reach around $690 billion by 2030, It does not come without its challenges

Majority of the coffee production happens in middle to low income countries where the average size of land holdings is about 2ha (hectare). Farmers tend to pool resources like labour and equipment for the harvest season. Advance planning of harvest schedules, procuring harvesting equipment and labor requirements by farmers necessitates financial planning and budgeting, often based on the potential revenue out of an estimated yield of coffee. Misjudgement in yield or destruction of crops leads to financial miscalculation which results in wastage of financial resources and pushes farmers into vicious cycles of debts.

In addition, the ever changing climatic conditions affect coffee related agricultural practices. They disrupt the growth and harvest cycle leading to uncertainty about potential coffee yield in a crop year.

The team had a meeting with Maurice D’Moss, a former iGEM participant from 2021 and a fellow coffee enthusiast, to discuss the challenges encountered by the coffee industry who highlighted issues like yield uncertainty, adulteration, lack of regulation of prices, farmer debts, etc.

Driving inspiration from issues like weather-related uncertainty and challenges faced by agriculturists further validated the problem at hand. Hence, we decided to develop a machine learning model called BREW(Bean Research and Estimate Workflow) which is a regressive model that primarily predicts the coffee yield of a particular year. This software is a simple tool that considers 15 parameters to give potential crop output for a particular year.

Dataset

The publicly available data related to coffee provided by government websites were found to be scarce and unconsolidated for the most part. As a result, we took it upon ourselves to consolidate relevant data from various reputable and verified websites, as well as government organizations. The websites provided us location-specific legacy data of the required parameters. This dataset was not only useful for training our model but also for future projects in the similar domains.

We used the data available from the following websites:

1. Coffee Board of India : data on parameters: “Area under plantation”, “Bearing area”, “Coffee Yield” and “Productivity” for coffee producing states. The major coffee producing states in India are Karnataka, Kerala and Tamil Nadu have been divided into 10 districts in total. Other non-traditional coffee producing regions include Andhra Pradesh, Orissa and the north eastern region of India.

2. India Water-Resource Information System : provided the data:“ district wise monthly rainfall” from 1990 - 2024.

3. The data for “Surface Pressure”, “Temperature”, “Specific Humidity”, “Surface Soil Wetness”, “Temperature Maximum” and “Temperature Minimum” was obtained from NASA POWER DAV for the following locations for years from 1990-2022.

Location and lat/long:
Karnataka_Chikmagalur -13.3392 75.7785
Karnataka_Kodagu -12.426 75.7266
Karnataka_Hassan -13.0028 76.1089
Kerala_Wayanad -11.6104 75.8024
Kerala_Travancore -8.5059 76.6554
Kerala_Nelliampathy -10.5081 76.643
Tamil Nadu_Pulneys -10.24 77.47
Tamil Nadu_Nilgiris- 11.423 76.693
Tamil Nadu_Salem -11.6643 78.146
Tamil Nadu_Coimbatore -11.0168 76.9558

The data was directly queried from the API, hence it hasn't been included in the compiled spreadsheet.

4. National Atlas and Thematic Mapping Organisation : provided the soil Nitrogen, phosphorus and Potassium values in the range “1-3”.

Preview of Dataset:

For more information about the datasets and to access them, please visit the contributions page.

Important parameters

The parameters for the model were finalized post meeting with the Karnataka Planters’ Association who provided valuable insights into coffee plantation and the importance of the various parameters that were relevant to the Coffee Yield

The 15 parameters we used in training our model are:

• Location: Location (must be one of the specified options)
• PS: Pressure at the surface
• T2M: Temperature at 2 meters above ground
• QV2M: Specific humidity at 2 meters
• GWETTOP: Surface soil moisture
• T2M_MAX: Maximum temperature at 2 meters
• T2M_MIN: Minimum temperature at 2 meters
• Area under Plantation (Ha): Size of the plantation area
• blossom_rainfall: Rainfall during the blossom season
• monsoon_rainfall: Rainfall during the monsoon season
• post_monsoon_rainfall: Rainfall after the monsoon
• N: Nitrogen content in soil
• P: Phosphorus content in soil
• K: Potassium content in soil
• Species: Type of coffee (e.g., Arabica or Robusta)

The need for coffee yield prediction

Predicting the yield of coffee is nothing short of a blessing as far as farmers are concerned. Having accurate information of the yield before the harvest can benefit them in the following ways:

1. Optimised Resource management- Enables farmers to use resources like pesticides, water etc efficiently when the yield is known

2. Tackling adverse climatic condition- Farmers can better adapt to the change in climatic condition by gaining insights into how the weather is affecting the future yield

3. Resisting Oppression-Prior knowledge of potential yield would assist farmers to negotiate better prices with buyers and also relieve them from the stress of uncertain yield.

4. Improved Supply chain Management- In a follow-up meeting with Karnataka Planters’ Association, they emphasized how prior knowledge of yield could ensure that processing units, warehouses, exporters and farmers are better prepared to handle the crop.

Model

The model is trained on 15 parameters to predict coffee yield which are based on land area, climatic conditions, and soil quality.

Key Steps:

1. Feature Engineering: - Key features include land area, climatic data (e.g., rainfall, temperature), soil NPK values, and other factors. Bearing area may also be used as per requirement.

2. Dataset Acquisition: - The datasets have been obtained from multiple open-source websites. The dataset is joined on the location and year values into a single dataframe.

3. Preprocessing: - Categorical variables like Species and Location are numerically encoded to prepare for machine learning algorithms. The numerical values are scaled and normalized to appropriate ranges. The target value is converted into production per unit by dividing it by Area under Plantation.

4. Model Selection: - The following models were tested with multiple hyperparameter settings using GridSearch:

• Linear Regression
• Ridge Regression
• Lasso Regression
• Decision Tree
• Random Forest Regressor
• Gradient Boosting
• XGBoost

5. Model Training and Evaluation:: - The model output is multiplied with Area under Plantation to bring back to original scale. The training used a K-fold cross-validation approach, ensuring more reliable metrics due to limited data availability. The K-fold approach trains the model with (k-1) folds for training and 1 fold for evaluation. This is done sequentially for all train-test splits which gives us more generalizable metrics for the final output.

Figure(1) Random Forest Regressor Architecture

Results

The tree-based models and gradient-boosting models achieved a strong performance with an R² score of 0.986. XGBoost and Random Forest delivered better results with lower error variance. Key features influencing the prediction include Area under Plantation, Species, and Location. Temperature, rainfall, and pressure were secondary contributors, while other factors showed minimal impact.

Figure(2) R^2 metric of various models in K-fold validation
Figure(3) Learned relative importance of input parameters by Random Forest

Workflow for Model Use

Users can run the model by following the steps outlined in the README file of the GitHub repository here. The model weights can be loaded with appropriate input data, and for missing data, mean values from historical data per location can be used, though it is recommended to use actual data where possible.

Figure(4) End-to-End Model Pipeline

Contribution

To know more Software team's our contribution visit the Contribution page

Conclusion

Predicting coffee yield is of utmost importance for both farmers and the end consumers of coffee. The need for a Machine Learning model with precise and accurate prediction of coffee yield was the need of the hour and hence this formed the basis of inspiration to build a model for the drink we all know and love. With various challenges while procuring datasets for training the model, to fine tuning the R^2 score and comparing with various ML models, this is definitely a significant step towards improving the Quality of life of both farmers and the end consumers. This also opens the door to other research groups and organizations to access organized, open source data for further research and innovation in the field of agriculture.

Future Direction

1. A good dataset is a crucial component for any machine learning application. As there is a scarcity of datasets related to coffee as a consequence of very few Large-scale coffee-producing states in India, our next step would be to collect additional data such as yield, rainfall, temperature, soil fertility levels etc. through means of contacting coffee estates/associations in the other coffee growing states.

This would not only increase the robustness and comprehensiveness of data but also help train our model to fit the data from coffee-producing estates which contribute majorly to coffee production in the country.

2. This approach lays the foundation for application of the model to predict the yield of other crops as well since most of the parameters remain the same.

3. In a meeting with the Coffee Board of India, they pointed out that any model used to predict yield of coffee has its own complications while deploying it on a large scale, since coffee production also involves various unmeasurable factors. To overcome this, we aim to collaborate with them to achieve this common goal

4. We aim to develop a Graphic User Interface (GUI) for the model and implement it in the form of a website. This is done with an objective that this resource is accessible to common people such as farmers, researchers or future iGEM teams. They can utilize the model for assistance for further innovation or research purposes

References

1. Kittichotsatsawat, Y., Tippayawong, N., & Tippayawong, K. Y. (2022). Prediction of arabica coffee production using artificial neural network and multiple linear regression techniques. Scientific Reports, 12(1), 14488.

2. Van Klompenburg, T., Kassahun, A., & Catal, C. (2020). Crop yield prediction using machine learning: A systematic literature review. Computers and Electronics in Agriculture, 177, 105709.

3. Kouadio, L., Deo, R. C., Byrareddy, V., Adamowski, J. F., & Mushtaq, S. (2018). Artificial intelligence approach for the prediction of Robusta coffee yield using soil fertility properties. Computers and electronics in agriculture, 155, 324-338.

4. Aworka, R., Cedric, L. S., Adoni, W. Y. H., Zoueu, J. T., Mutombo, F. K., Kimpolo, C. L. M., ... & Krichen, M. (2022). Agricultural decision system based on advanced machine learning models for yield prediction: Case of East African countries. Smart Agricultural Technology, 2, 100048