AZ
Back to Projects

Retail Inventory Optimization & Demand Analysis

SQLPostgreSQLData AnalysisInventoryForecasting

A SQL-based exploration into how inventory, demand forecasting, and sales interact, with a focus on identifying inefficiencies and turning data into actionable business insights.

Dataset Overview

Going Beyond the Query

I didn't start this project just to practice SQL. After spending time around enterprise data systems, I kept noticing the same pattern: the numbers look fine on the surface, but something doesn't quite add up when you start connecting them to actual business decisions.

I picked retail because it's a domain where the data is close to the money. Every inventory decision affects cash flow, waste, and fulfillment. I used a high-fidelity Kaggle dataset tracking daily operations across 73,000+ records. To keep this case study performant in the browser, I ran a Python Pandas pipeline to pre-aggregate the raw data before feeding it into these charts.

retail_store_inventory.csv — Sample
Pay attention to the ratio between Inventory Level, Units Sold, and Demand Forecast.
73.1k Rows
11 Fields
DateStore IDProduct IDCategoryRegionInventory LevelUnits SoldUnits OrderedDemand ForecastPriceDiscountWeather ConditionHoliday/PromotionCompetitor PricingSeasonality
2022-01-01S001P0001GroceriesNorth23112755135.4733.520Rainy029.69Autumn
2022-01-01S001P0002ToysSouth20415066144.0463.0120Sunny066.16Autumn
2022-01-01S001P0003ToysWest102655174.0227.9910Sunny131.32Summer
2022-01-01S001P0004ToysNorth4696116462.1832.7210Cloudy134.74Autumn
2022-01-01S001P0005ElectronicsEast166141359.2673.640Sunny068.95Summer
2022-01-01S001P0006GroceriesSouth138128102139.8276.8310Sunny179.35Winter
2022-01-01S001P0007FurnitureEast35997167108.9234.1610Rainy136.55Winter
2022-01-01S001P0008ClothingNorth38031254329.7397.995Cloudy0100.09Spring
2022-01-01S001P0009ElectronicsWest183175135174.1520.7410Cloudy017.66Autumn
2022-01-01S001P0010ToysSouth1082819624.4759.990Rainy161.21Winter
2022-01-01S001P0011FurnitureSouth258150153152.7458.5310Sunny161.42Spring
2022-01-01S001P0012ClothingWest66247026.7558.2520Snowy062.21Spring
2022-01-01S001P0013ToysSouth96428541.4643.60Cloudy046.31Spring
2022-01-01S001P0014ClothingWest193121876.878.110Sunny080.06Spring
2022-01-01S001P0015ClothingNorth379369154363.4692.9915Snowy095.8Winter
Problem Framing

The Business Problem

When I started exploring the data, one thing stood out right away. Inventory levels were consistently sitting well above actual sales volumes. That gap was too wide and too stable to be coincidence.

Questions I wanted to answer

  • 1Was the business ordering too much to begin with?
  • 2Were the demand forecasts consistently too high?
  • 3Was this a planning model problem, not just an operations one?

Why Groceries Are Different

Overstocking electronics means capital sits idle. Overstocking groceries means product spoils. The stakes here are higher because the loss is total, not just delayed.

Data Narrative

The Data Story

From first exploration to a clear diagnosis of what was actually wrong.

Step 1: Exploration

Understanding the Landscape

Before looking for anything specific, I wanted to see how sales volume distributed across categories and regions. I was checking whether the problem had an obvious concentration point.

Sales Distribution by Region and Category

What I expected: I thought I'd find obvious imbalances, maybe one region lagging behind or one category dominating the rest.

What I found: Sales volume was almost perfectly uniform across all regions and categories. No outlier stood out.

Why it matters: That uniformity told me the issue wasn't localized. Whatever was driving the inefficiency was happening everywhere at once, which pointed to the planning model, not the operations.

Step 2: Pattern Finding

Tracking the Forecast Over Time

With no obvious concentration, I shifted my focus to time. I wanted to see how the demand forecast tracked against actual sold units month by month. A healthy model should sometimes overshoot, sometimes undershoot.

Monthly Forecast vs. Actual Sold

What I expected: A reasonable forecast should cross the actual line in both directions over time.

What I found: The forecast stayed above actual sold almost every single month. It never corrected downward in any meaningful way.

Why it matters: When a model is consistently biased in one direction, it stops being a forecast and starts being a standing instruction to over-order. Purchasing teams following this model had no reason to ever pull back.

Step 3: Deeper Diagnosis

Where the Error Was Worst

Knowing the bias was systemic, I wanted to find out which category-season combinations had the highest forecast error. Not all overestimates carry the same risk.

Average Forecast Error by Category and Season (%)

What I expected: I expected Electronics or Clothing to be the hardest to forecast, since trends in those categories shift faster.

What I found: Groceries had one of the highest and most consistent error rates, around 16%, across all seasons. It didn't spike; it was just reliably wrong.

Why it matters: A 16% overestimate on electronics means you bought too many units. A 16% overestimate on produce means it rots. The financial impact is in a different category altogether.

Step 4: Conclusion

Seeing the Full Picture

For the final check, I compared resting inventory levels directly against actual sales volume by region. I wanted to see the size of the buffer the business was maintaining.

Resting Inventory vs. Actual Sales by Region

What I expected: Inventory should sit reasonably close to sales volume, with just enough buffer for delivery delays and demand spikes.

What I found: The inventory buffer was significantly larger than needed in every single region. The gap wasn't slight; it was structural.

Why it matters: When I put this next to the forecast bias finding, the picture became clear. Because the forecasts were always high, procurement was always told to order more. The business was compensating for bad predictions with excess stock instead of fixing the predictions.

Execution

Structured SQL Analysis

These are the queries I used to work through the problem. Each one was written with a specific question in mind.

Part A

Descriptive Queries

query_01__sales_by_category.sql
SELECT
    "Category",
    SUM("Units Sold")     AS total_units_sold,
    SUM("Units Ordered")  AS total_units_ordered
FROM  retail_store_inventory
GROUP BY 1
ORDER BY total_units_sold DESC;
Output preview
Categorytotal_units_soldtotal_units_ordered
Furniture2,025,0171,617,699
Groceries2,000,4821,606,188
Clothing1,999,1661,614,885
... (2 more rows)

Why I wrote this

I wanted a quick baseline to check whether any category was dramatically different from the others before digging deeper.

What I found

All five categories came in close to each other. No single one was carrying the business or dragging it down. That ruled out a category-level root cause early.

query_02__seasonal_rankings.sql
SELECT
    "Seasonality",
    "Category",
    SUM("Units Sold") AS total_units_sold,
    RANK() OVER (
        PARTITION BY "Seasonality"
        ORDER BY SUM("Units Sold") DESC
    ) AS sales_rank
FROM  retail_store_inventory
GROUP BY 1, 2;
Output preview
SeasonalityCategorytotal_units_soldsales_rank
WinterFurniture508,7301
WinterGroceries505,0272
WinterClothing504,9723

Why I wrote this

I wanted to see whether seasonal shifts changed the category rankings, since inventory planning often follows those patterns.

What I found

Furniture topped Winter sales, but Groceries and Clothing were right behind — all tightly clustered. That uniformity confirmed this wasn't a category-level anomaly; it was a planning-level one.

Part B

Diagnostic Queries

Once I had the baseline, I wanted to quantify the gap. These queries were designed to measure the actual distance between what was forecasted, what was stocked, and what was actually sold.

query_03__monthly_inventory_pressure.sql
SELECT
    date_trunc('month', "Date"::date) AS month,
    SUM("Inventory Level")  AS total_inventory,
    SUM("Units Sold")      AS total_sold,
    SUM("Demand Forecast") AS total_forecast,
    SUM("Inventory Level") - SUM("Units Sold") AS inventory_gap,
    SUM("Demand Forecast") - SUM("Units Sold") AS forecast_error,
    ROUND(
        SUM("Inventory Level")::numeric
         / NULLIF(SUM("Units Sold"), 0),
        2
    ) AS inv_to_sales_ratio
FROM  retail_store_inventory
GROUP BY 1
ORDER BY 1;
Output preview
monthtotal_invtotal_soldinv_gapfc_errorratio
2022-01843,236419,938423,298+15,6522.01
2022-02772,235391,052381,183+14,4341.97
2022-03851,921426,073425,848+15,9922.00

Why I wrote this

I wanted to check whether the inventory pressure was a consistent monthly pattern or just a seasonal peak. The ratio column was the key metric.

What I found

The ratio hovered close to 2.0 every single month — meaning for every unit sold, roughly two units were sitting in stock. That wasn't a seasonal spike; that was structural.

query_04__category_season_risk_view.sql
SELECT
    "Category",
    "Seasonality",
    AVG("Inventory Level")                  AS avg_inventory,
    AVG("Units Sold")                      AS avg_sold,
    AVG("Inventory Level" - "Units Sold") AS avg_gap,
    ROUND(
        AVG("Inventory Level")::numeric
         / NULLIF(AVG("Units Sold"), 0),
        2
    ) AS inv_to_sales_ratio
FROM  retail_store_inventory
GROUP BY 1, 2
ORDER BY inv_to_sales_ratio DESC;
Output preview
CategorySeasonalityavg_invavg_soldavg_gapratio
GroceriesSpring275.2134.4140.82.05
GroceriesSummer274.8134.6140.22.04
ElectronicsSpring272.4134.2138.22.03

Why I wrote this

I needed to find which specific category-season pairs had the worst ratios, to help prioritize where to focus improvement efforts.

What I found

Groceries came out on top in Spring and Summer, but the margins were tight across all categories. The ratios were close to 2.0 everywhere — this wasn't a category outlier, it was a universal planning problem with Groceries slightly worse.

Synthesized Findings

What the Data Pointed To

Four patterns that showed up consistently across every angle of analysis.

Systemic, Not Isolated

The inefficiency showed up in every region and every month. There was no single warehouse to blame. The pattern was everywhere.

Forecast Always Too High

The demand forecast almost never came in below actual sales. It was a one-directional bias, which meant purchasing teams were consistently told to order more than needed.

Seasonal Shifts Overlooked

Category rankings shifted with the seasons, but the inventory buffers didn't adjust accordingly. The planning model didn't respond to seasonality the way it should have.

Groceries Carry Real Waste Risk

Overstocking perishables isn't just inefficient; it creates direct write-off losses. Groceries had both a high ratio and a high error rate, making them the most urgent segment to address.

Recommendations

What I Would Do Next

Based on what the data showed, here are the changes I'd prioritize.

01

Recalibrate the Forecast Model

The model has a consistent upward bias. The first step is adding a historical correction factor that pulls estimates back toward realized demand before they reach purchasing teams.

02

Differentiate Safety Stock by Category

Not every product needs the same buffer. Staple items with predictable demand can run leaner. Groceries, given the spoilage risk, need tighter weekly review rather than a bigger cushion.

03

Treat Perishables as a Priority Fix

Groceries showed up as the highest-ratio, highest-risk segment. Any improvement plan should start there, since the cost of getting it wrong is a write-off, not just a delay.

04

Track the Gap Weekly, Not Monthly

The forecast error was visible in monthly data, but catching it weekly would give teams the ability to course-correct before the numbers compound into a problem.

Reflection

The hardest shift wasn't learning SQL. It was learning when to stop describing and start diagnosing.

When I started this project, I assumed the interesting part would be writing complex queries. It turned out the more valuable skill was knowing what question each query was actually trying to answer. Running the seasonal ranking felt straightforward until I realized the rankings alone didn't explain the risk. I had to layer forecast error on top before the picture made sense. That's the part that surprised me most: useful analysis isn't just technically correct, it has to be pointed at the right problem.

Explore More Projects

Continue exploring selected data, analytics, and machine learning projects.

Chili Quality Classification
Data Science (Thesis)

Chili Quality Classification

Built a Two-Stage XGBoost classification pipeline to automate agricultural quality control. Explored thorough Exploratory Data Analysis (EDA) and compared decision trees, RF, SVM, and MLP models before isolating XGBoost due to a superior ~83% accuracy and ~82% macro F1 score. Emphasized end-to-end data science workflows over simple model APIs.

PythonXGBoostScikit-learnPandas+1 more
View Project
View Chili Quality Classification
Telkom Indonesia – Enterprise Dashboard Visualization System
Internship / Data Analytics

Telkom Indonesia – Enterprise Dashboard Visualization System

Developed enterprise-grade operational dashboards for the Regional Enterprise and Government Service (REGS) division during an internship at Telkom Indonesia. Replaced manual spreadsheet workflows with a Looker Studio LOP dashboard and an interactive web-based LOB dashboard integrated with Google Sheets API, Firebase, and Google Cloud — significantly improving data visibility, collaboration, and monitoring efficiency.

Looker StudioGoogle Sheets APIFirebaseNode.js+4 more
View Project
View Telkom Indonesia – Enterprise Dashboard Visualization System
AI Assistant for Data Issue Troubleshooting
Exploratory / Concept

AI Assistant for Data Issue Troubleshooting

Mapped historical issue occurrences spanning SAP BW/BPC environments directly to logical root causes. This exploratory project explores how AI logic can improve analyst efficiency and institutional knowledge reuse within complex ETL layers.

ConceptAI LogicSAP BW/BPCRoot Cause Analysis
View Project
View AI Assistant for Data Issue Troubleshooting