Chili Quality Classification

PythonXGBoostScikit-learnPandasEDA

Built a Two-Stage XGBoost classification pipeline to automate agricultural quality control. Explored thorough Exploratory Data Analysis (EDA) and compared decision trees, RF, SVM, and MLP models before isolating XGBoost due to a superior ~83% accuracy and ~82% macro F1 score. Emphasized end-to-end data science workflows over simple model APIs.

View Source

Precision Agriculture & Machine Learning

Two-Stage XGBoost Classification of Chili Fruit Grades

Predicting chili fruit quality before harvest using vegetative plant features. A hierarchical machine learning approach to reduce subjective manual grading and support data-driven agriculture.

Accuracy

83%

Macro F1

82%

Macro AUC

86%

Published At

ICoDSA 2025

The Context

Why Pre-Harvest Prediction Matters

In traditional chili farming, grading is performed manually post-harvest. This process is inherently inefficient, highly subjective, and prevents farmers from making proactive interventions.

By leveraging weekly vegetative features—such as plant height, leaf count, and stem diameter—this project aimed to predict final fruit quality *before* harvest. This shift enables early interventions, reduces waste, and standardizes quality control through data.

The Challenge

Overlapping Characteristics

Class Imbalance
Significant disproportion between top-tier grades and lower grades in natural agricultural datasets.
Feature Overlap
Grade A and Grade B plants exhibit almost identical vegetative characteristics during early weeks, confusing flat classifiers.
Interpretability
The solution needed to be transparent enough for agricultural stakeholders to trust the predictions.

Architecture

Hierarchical Two-Stage XGBoost

To combat the overlapping features between high-quality grades, we abandoned flat multi-class models in favor of a robust hierarchical decision structure.

Hierarchical Two-Stage Classification Pipeline

Input Features

Height, Stem, Leaf, Wk

Preprocessing

RobustScaler + Interact

Stage 1 Classifier

XGBoost (C vs Non-C)

Grade C

Final Output

Non-C Candidates

Intermediate

Stage 2 Classifier

XGBoost (A vs B)

Grade A

Final Output

Grade B

Final Output

Input Features

Height, Stem, Leaf, Wk

Preprocessing

RobustScaler + Interact

Stage 1 Classifier

XGBoost (C vs Non-C)

Grade C

Final Output

Non-C

Intermediate

Stage 2 Classifier

XGBoost (A vs B)

Grade A

Final Output

Grade B

Final Output

Stage 1: Grade C Separation

The first classifier isolates low-quality plants to simplify downstream classification. Because Grade C plants exhibit significantly stunted vegetative growth early on, this stage achieves very high precision and prevents them from skewing the A vs B decision boundary.

Stage 2: Grade A vs B Nuance

The second classifier focuses on distinguishing visually similar higher-quality grades. Once Grade C is filtered out, this specialized model leverages complex engineered interaction features to differentiate the subtle overlaps between premium Grade A and standard Grade B plants.

Data Pipeline

Feature Engineering & Processing

Vegetative Features

Raw data collected weekly during the vegetative phase.

Plant HeightStem DiameterLeaf CountObservation Week

Preprocessing Logic

Normalization: MinMax scaling applied to continuous variables to ensure stable boosting gradients.
Imbalance Handling: SMOTETomek used to synthesize minority classes while removing noisy Tomek links.

Feature Correlation Analysis

Strong correlation between plant height, leaf count, stem diameter, and observation week guided the creation of interaction features like Height_x_Diameter and Growth_Interaction.

Experimentation

Benchmarking Performance

Extensive experimentation was conducted comparing flat baseline models against the hierarchical approach.

Flat XGBoost Baseline

Accuracy67%

Macro F167%

Macro AUC83%

MLP Baseline

Accuracy67%

Macro F167%

Macro AUC85%

FINAL SELECTION

Two-Stage XGBoost

Accuracy83%

Macro F182%

Macro AUC86%

Performance Metric Comparison

The two-stage hierarchical approach dramatically outperforms flat baselines by effectively untangling the feature overlap between Grade A and Grade B classes after isolating Grade C.

Evaluation

Final Model Analytics

Detailed breakdown of the final Two-Stage XGBoost model's discriminatory power across all three agricultural quality grades.

Confusion Matrix Analysis

Grade A was classified most consistently, while Grade B remained the most challenging due to persistent feature overlap. Grade C was strongly isolated by Stage 1.

Class Discrimination (ROC-AUC)

The final model achieved strong class discrimination across all tiers, providing reliable probabilistic outputs rather than just hard classifications.

Grade A AUC:0.91

Grade B AUC:0.86

Grade C AUC:0.82

XGBoost Feature Importance (Gain)

Growth-related engineered features (Growth Rate, Daun per Minggu) contributed strongly to the model’s decision-making process, validating the hypothesis that temporal vegetative changes are the strongest indicators of eventual fruit quality.

Deployment

From Research to Application

The final Two-Stage XGBoost model was exported and wrapped into a production-ready Flask web application, allowing end-users to input weekly measurements and receive real-time grading predictions.

localhost:5000/predict

Deployment Preview Missing

Peer Reviewed

ICoDSA 2025 Publication

This research was formally published and presented at the International Conference on Data Science and Its Applications.

Impact & Reflection

Translating agricultural complexity into explainable logic.

This project demonstrated the value of breaking complex classification problems into hierarchical, logical steps. While deep learning (MLP) struggled to untangle overlapping features in a black box, the tree-based hierarchy not only improved accuracy by a massive margin but did so in a way that could be logically explained to agricultural stakeholders. Balancing rigorous statistical experimentation with the practical reality of building a usable deployment interface was the core success of this endeavor.

Python

XGBoost

Scikit-learn

Flask

Pandas

NumPy

Matplotlib

SMOTETomek

Explore More Projects

Continue exploring selected data, analytics, and machine learning projects.

SQL Data Analytics

Retail Inventory Optimization & Demand Analysis

A SQL-based exploration into how inventory, demand forecasting, and sales interact, with a focus on identifying inefficiencies and turning data into actionable business insights.

SQLPostgreSQLData AnalysisInventory+1 more

View Project

Internship / Data Analytics

Telkom Indonesia – Enterprise Dashboard Visualization System

Developed enterprise-grade operational dashboards for the Regional Enterprise and Government Service (REGS) division during an internship at Telkom Indonesia. Replaced manual spreadsheet workflows with a Looker Studio LOP dashboard and an interactive web-based LOB dashboard integrated with Google Sheets API, Firebase, and Google Cloud — significantly improving data visibility, collaboration, and monitoring efficiency.

Looker StudioGoogle Sheets APIFirebaseNode.js+4 more

View Project

Exploratory / Concept

AI Assistant for Data Issue Troubleshooting

Mapped historical issue occurrences spanning SAP BW/BPC environments directly to logical root causes. This exploratory project explores how AI logic can improve analyst efficiency and institutional knowledge reuse within complex ETL layers.

ConceptAI LogicSAP BW/BPCRoot Cause Analysis

View Project

Chili Quality Classification

Two-Stage XGBoost Classification of Chili Fruit Grades

Why Pre-Harvest Prediction Matters

Overlapping Characteristics

Class Imbalance

Feature Overlap

Interpretability

Hierarchical Two-Stage XGBoost

Hierarchical Two-Stage Classification Pipeline

Input Features

Preprocessing

Stage 1 Classifier

Grade C

Non-C Candidates

Stage 2 Classifier

Grade A

Grade B

Input Features

Preprocessing

Stage 1 Classifier

Grade C

Non-C

Stage 2 Classifier

Grade A

Grade B

Stage 1: Grade C Separation

Stage 2: Grade A vs B Nuance

Feature Engineering & Processing

Vegetative Features

Preprocessing Logic

Feature Correlation Analysis

Benchmarking Performance

Flat XGBoost Baseline

MLP Baseline

Two-Stage XGBoost

Performance Metric Comparison

Final Model Analytics

Confusion Matrix Analysis

Class Discrimination (ROC-AUC)

XGBoost Feature Importance (Gain)

From Research to Application

ICoDSA 2025 Publication

Translating agricultural complexity into explainable logic.

Explore More Projects

Retail Inventory Optimization & Demand Analysis

Telkom Indonesia – Enterprise Dashboard Visualization System

AI Assistant for Data Issue Troubleshooting