Chili Quality Classification
Built a Two-Stage XGBoost classification pipeline to automate agricultural quality control. Explored thorough Exploratory Data Analysis (EDA) and compared decision trees, RF, SVM, and MLP models before isolating XGBoost due to a superior ~83% accuracy and ~82% macro F1 score. Emphasized end-to-end data science workflows over simple model APIs.
Two-Stage XGBoost Classification of Chili Fruit Grades
Predicting chili fruit quality before harvest using vegetative plant features. A hierarchical machine learning approach to reduce subjective manual grading and support data-driven agriculture.
Accuracy
83%
Macro F1
82%
Macro AUC
86%
Published At
ICoDSA 2025
Why Pre-Harvest Prediction Matters
In traditional chili farming, grading is performed manually post-harvest. This process is inherently inefficient, highly subjective, and prevents farmers from making proactive interventions.
By leveraging weekly vegetative features—such as plant height, leaf count, and stem diameter—this project aimed to predict final fruit quality *before* harvest. This shift enables early interventions, reduces waste, and standardizes quality control through data.
Overlapping Characteristics
Class Imbalance
Significant disproportion between top-tier grades and lower grades in natural agricultural datasets.
Feature Overlap
Grade A and Grade B plants exhibit almost identical vegetative characteristics during early weeks, confusing flat classifiers.
Interpretability
The solution needed to be transparent enough for agricultural stakeholders to trust the predictions.
Hierarchical Two-Stage XGBoost
To combat the overlapping features between high-quality grades, we abandoned flat multi-class models in favor of a robust hierarchical decision structure.
Hierarchical Two-Stage Classification Pipeline
Input Features
Height, Stem, Leaf, Wk
Preprocessing
RobustScaler + Interact
Stage 1 Classifier
XGBoost (C vs Non-C)
Grade C
Final Output
Non-C Candidates
Intermediate
Stage 2 Classifier
XGBoost (A vs B)
Grade A
Final Output
Grade B
Final Output
Input Features
Height, Stem, Leaf, Wk
Preprocessing
RobustScaler + Interact
Stage 1 Classifier
XGBoost (C vs Non-C)
Grade C
Final Output
Non-C
Intermediate
Stage 2 Classifier
XGBoost (A vs B)
Grade A
Final Output
Grade B
Final Output
Stage 1: Grade C Separation
The first classifier isolates low-quality plants to simplify downstream classification. Because Grade C plants exhibit significantly stunted vegetative growth early on, this stage achieves very high precision and prevents them from skewing the A vs B decision boundary.
Stage 2: Grade A vs B Nuance
The second classifier focuses on distinguishing visually similar higher-quality grades. Once Grade C is filtered out, this specialized model leverages complex engineered interaction features to differentiate the subtle overlaps between premium Grade A and standard Grade B plants.
Feature Engineering & Processing
Vegetative Features
Raw data collected weekly during the vegetative phase.
Preprocessing Logic
Normalization: MinMax scaling applied to continuous variables to ensure stable boosting gradients.
Imbalance Handling: SMOTETomek used to synthesize minority classes while removing noisy Tomek links.
Feature Correlation Analysis

Strong correlation between plant height, leaf count, stem diameter, and observation week guided the creation of interaction features like Height_x_Diameter and Growth_Interaction.
Benchmarking Performance
Extensive experimentation was conducted comparing flat baseline models against the hierarchical approach.
Flat XGBoost Baseline
MLP Baseline
Two-Stage XGBoost
Performance Metric Comparison

The two-stage hierarchical approach dramatically outperforms flat baselines by effectively untangling the feature overlap between Grade A and Grade B classes after isolating Grade C.
Final Model Analytics
Detailed breakdown of the final Two-Stage XGBoost model's discriminatory power across all three agricultural quality grades.
Confusion Matrix Analysis

Grade A was classified most consistently, while Grade B remained the most challenging due to persistent feature overlap. Grade C was strongly isolated by Stage 1.
Class Discrimination (ROC-AUC)

The final model achieved strong class discrimination across all tiers, providing reliable probabilistic outputs rather than just hard classifications.
XGBoost Feature Importance (Gain)

Growth-related engineered features (Growth Rate, Daun per Minggu) contributed strongly to the model’s decision-making process, validating the hypothesis that temporal vegetative changes are the strongest indicators of eventual fruit quality.
From Research to Application
The final Two-Stage XGBoost model was exported and wrapped into a production-ready Flask web application, allowing end-users to input weekly measurements and receive real-time grading predictions.
Deployment Preview Missing

ICoDSA 2025 Publication
This research was formally published and presented at the International Conference on Data Science and Its Applications.


Translating agricultural complexity into explainable logic.
This project demonstrated the value of breaking complex classification problems into hierarchical, logical steps. While deep learning (MLP) struggled to untangle overlapping features in a black box, the tree-based hierarchy not only improved accuracy by a massive margin but did so in a way that could be logically explained to agricultural stakeholders. Balancing rigorous statistical experimentation with the practical reality of building a usable deployment interface was the core success of this endeavor.
Explore More Projects
Continue exploring selected data, analytics, and machine learning projects.

Retail Inventory Optimization & Demand Analysis
A SQL-based exploration into how inventory, demand forecasting, and sales interact, with a focus on identifying inefficiencies and turning data into actionable business insights.

Telkom Indonesia – Enterprise Dashboard Visualization System
Developed enterprise-grade operational dashboards for the Regional Enterprise and Government Service (REGS) division during an internship at Telkom Indonesia. Replaced manual spreadsheet workflows with a Looker Studio LOP dashboard and an interactive web-based LOB dashboard integrated with Google Sheets API, Firebase, and Google Cloud — significantly improving data visibility, collaboration, and monitoring efficiency.

AI Assistant for Data Issue Troubleshooting
Mapped historical issue occurrences spanning SAP BW/BPC environments directly to logical root causes. This exploratory project explores how AI logic can improve analyst efficiency and institutional knowledge reuse within complex ETL layers.

