AZ
Back to Projects

Chili Quality Classification

PythonXGBoostScikit-learnPandasEDA

Built a Two-Stage XGBoost classification pipeline to automate agricultural quality control. Explored thorough Exploratory Data Analysis (EDA) and compared decision trees, RF, SVM, and MLP models before isolating XGBoost due to a superior ~83% accuracy and ~82% macro F1 score. Emphasized end-to-end data science workflows over simple model APIs.

Precision Agriculture & Machine Learning

Two-Stage XGBoost Classification of Chili Fruit Grades

Predicting chili fruit quality before harvest using vegetative plant features. A hierarchical machine learning approach to reduce subjective manual grading and support data-driven agriculture.

Accuracy

83%

Macro F1

82%

Macro AUC

86%

Published At

ICoDSA 2025

The Context

Why Pre-Harvest Prediction Matters

In traditional chili farming, grading is performed manually post-harvest. This process is inherently inefficient, highly subjective, and prevents farmers from making proactive interventions.

By leveraging weekly vegetative features—such as plant height, leaf count, and stem diameter—this project aimed to predict final fruit quality *before* harvest. This shift enables early interventions, reduces waste, and standardizes quality control through data.

The Challenge

Overlapping Characteristics

  • Class Imbalance

    Significant disproportion between top-tier grades and lower grades in natural agricultural datasets.

  • Feature Overlap

    Grade A and Grade B plants exhibit almost identical vegetative characteristics during early weeks, confusing flat classifiers.

  • Interpretability

    The solution needed to be transparent enough for agricultural stakeholders to trust the predictions.

Architecture

Hierarchical Two-Stage XGBoost

To combat the overlapping features between high-quality grades, we abandoned flat multi-class models in favor of a robust hierarchical decision structure.

Hierarchical Two-Stage Classification Pipeline

Input Features

Height, Stem, Leaf, Wk

Preprocessing

RobustScaler + Interact

Stage 1 Classifier

XGBoost (C vs Non-C)

Grade C

Final Output

Non-C

Intermediate

Stage 2 Classifier

XGBoost (A vs B)

Grade A

Final Output

Grade B

Final Output

1

Stage 1: Grade C Separation

The first classifier isolates low-quality plants to simplify downstream classification. Because Grade C plants exhibit significantly stunted vegetative growth early on, this stage achieves very high precision and prevents them from skewing the A vs B decision boundary.

2

Stage 2: Grade A vs B Nuance

The second classifier focuses on distinguishing visually similar higher-quality grades. Once Grade C is filtered out, this specialized model leverages complex engineered interaction features to differentiate the subtle overlaps between premium Grade A and standard Grade B plants.

Data Pipeline

Feature Engineering & Processing

Vegetative Features

Raw data collected weekly during the vegetative phase.

Plant HeightStem DiameterLeaf CountObservation Week

Preprocessing Logic

  • Normalization: MinMax scaling applied to continuous variables to ensure stable boosting gradients.

  • Imbalance Handling: SMOTETomek used to synthesize minority classes while removing noisy Tomek links.

Feature Correlation Analysis

Feature Correlation Heatmap

Strong correlation between plant height, leaf count, stem diameter, and observation week guided the creation of interaction features like Height_x_Diameter and Growth_Interaction.

Experimentation

Benchmarking Performance

Extensive experimentation was conducted comparing flat baseline models against the hierarchical approach.

Flat XGBoost Baseline

Accuracy67%
Macro F167%
Macro AUC83%

MLP Baseline

Accuracy67%
Macro F167%
Macro AUC85%
FINAL SELECTION

Two-Stage XGBoost

Accuracy83%
Macro F182%
Macro AUC86%

Performance Metric Comparison

Model Performance Comparison Chart

The two-stage hierarchical approach dramatically outperforms flat baselines by effectively untangling the feature overlap between Grade A and Grade B classes after isolating Grade C.

Evaluation

Final Model Analytics

Detailed breakdown of the final Two-Stage XGBoost model's discriminatory power across all three agricultural quality grades.

Confusion Matrix Analysis

Final Model Confusion Matrix

Grade A was classified most consistently, while Grade B remained the most challenging due to persistent feature overlap. Grade C was strongly isolated by Stage 1.

Class Discrimination (ROC-AUC)

ROC-AUC Analysis

The final model achieved strong class discrimination across all tiers, providing reliable probabilistic outputs rather than just hard classifications.

Grade A AUC:0.91
Grade B AUC:0.86
Grade C AUC:0.82

XGBoost Feature Importance (Gain)

Feature Importance Analysis

Growth-related engineered features (Growth Rate, Daun per Minggu) contributed strongly to the model’s decision-making process, validating the hypothesis that temporal vegetative changes are the strongest indicators of eventual fruit quality.

Deployment

From Research to Application

The final Two-Stage XGBoost model was exported and wrapped into a production-ready Flask web application, allowing end-users to input weekly measurements and receive real-time grading predictions.

localhost:5000/predict

Deployment Preview Missing

Flask Web App Deployment Preview
Peer Reviewed

ICoDSA 2025 Publication

This research was formally published and presented at the International Conference on Data Science and Its Applications.

ICoDSA 2025 Presentation
Conference Session
Impact & Reflection

Translating agricultural complexity into explainable logic.

This project demonstrated the value of breaking complex classification problems into hierarchical, logical steps. While deep learning (MLP) struggled to untangle overlapping features in a black box, the tree-based hierarchy not only improved accuracy by a massive margin but did so in a way that could be logically explained to agricultural stakeholders. Balancing rigorous statistical experimentation with the practical reality of building a usable deployment interface was the core success of this endeavor.

Python
XGBoost
Scikit-learn
Flask
Pandas
NumPy
Matplotlib
SMOTETomek

Explore More Projects

Continue exploring selected data, analytics, and machine learning projects.

Retail Inventory Optimization & Demand Analysis
SQL Data Analytics

Retail Inventory Optimization & Demand Analysis

A SQL-based exploration into how inventory, demand forecasting, and sales interact, with a focus on identifying inefficiencies and turning data into actionable business insights.

SQLPostgreSQLData AnalysisInventory+1 more
View Project
View Retail Inventory Optimization & Demand Analysis
Telkom Indonesia – Enterprise Dashboard Visualization System
Internship / Data Analytics

Telkom Indonesia – Enterprise Dashboard Visualization System

Developed enterprise-grade operational dashboards for the Regional Enterprise and Government Service (REGS) division during an internship at Telkom Indonesia. Replaced manual spreadsheet workflows with a Looker Studio LOP dashboard and an interactive web-based LOB dashboard integrated with Google Sheets API, Firebase, and Google Cloud — significantly improving data visibility, collaboration, and monitoring efficiency.

Looker StudioGoogle Sheets APIFirebaseNode.js+4 more
View Project
View Telkom Indonesia – Enterprise Dashboard Visualization System
AI Assistant for Data Issue Troubleshooting
Exploratory / Concept

AI Assistant for Data Issue Troubleshooting

Mapped historical issue occurrences spanning SAP BW/BPC environments directly to logical root causes. This exploratory project explores how AI logic can improve analyst efficiency and institutional knowledge reuse within complex ETL layers.

ConceptAI LogicSAP BW/BPCRoot Cause Analysis
View Project
View AI Assistant for Data Issue Troubleshooting