Supervised, Unsupervised, Reinforcement Learning: How ABC Pizza Picked the Wrong One First

We trained the model. It worked perfectly. It solved the wrong problem.

After the Tokyo incident, the ABC Pizza data science team got to work. The problem statement: predict optimal driver assignment for incoming orders. The team looked at three years of historical dispatch data — 60 million orders — and decided to start with clustering.

The logic seemed sound: group stores by behavior, group drivers by performance pattern, match clusters. Unsupervised learning. No labels needed. Two weeks of work, clean clusters, a beautiful dendrogram.

The operations team tried it in a shadow environment. Cluster assignments had zero correlation with on-time delivery rates.

The model was technically correct. It found real structure in the data. It clustered stores by order volume and geographic zone — not by the factors that actually drive dispatch success. We had answered a question nobody asked.

The right algorithm was hiding in plain sight the whole time. The lesson: choosing the learning paradigm is the first architectural decision in any ML project, and getting it wrong wastes weeks.

The Four Paradigms — One Decision Framework

Supervised Learning — Learning from Labeled Outcomes

The paradigm: You have historical examples where you know the correct answer. The model learns the mapping from inputs to outputs.

The ABC Pizza application:

The dispatch problem is fundamentally a prediction problem: given current conditions, what will the delivery time be? We have 60 million historical dispatches. Each one has features (weather, driver count, order queue, time of day, zone) and an outcome (actual delivery time in minutes). That's a labeled dataset. That's supervised learning.

Two variants:

Regression — the output is a continuous number.

Predict delivery time in minutes → regression
Predict store revenue next week → regression
Predict mortgage default probability → regression (outputs 0.0 to 1.0)

Classification — the output is a category.

On-time or late? → binary classification
Which document type? (W-2, pay stub, bank statement) → multi-class classification
Fraudulent transaction? → binary classification

At MortgageIQ, document classification is supervised learning: the model is trained on thousands of labeled document images (W-2: label 0, pay stub: label 1, bank statement: label 2). Given a new document image, it predicts the class. The label is the human-provided ground truth.

When supervised learning works:

You have historical data with known outcomes
The outcome you want to predict is the same type as the outcomes in your history
You have enough examples — rule of thumb: 10× more examples than features

When it fails:

Your historical data doesn't represent the future (Super Bowl Sunday, typhoon in Tokyo)
The outcome you care about isn't in your historical data
Labels are expensive or slow to obtain (radiology images need a doctor to label)

Unsupervised Learning — Finding Hidden Structure

The paradigm: No labels. The model finds patterns, groups, or structure in the data on its own.

The ABC Pizza mistake — and the correct use:

Our clustering mistake: we used unsupervised learning to find store groups, then used those groups to drive dispatch decisions. The clusters were real — but they didn't connect to the outcome we cared about (on-time delivery). Clustering is a discovery tool, not a prediction tool.

The correct use of clustering at ABC Pizza: store segmentation for operational insights. Which stores share similar failure patterns? Cluster by cancellation rate, peak hour timing, and driver churn. The resulting segments helped operations identify stores at risk before they degraded — not to drive individual dispatch decisions, but to trigger proactive staffing reviews.

Three unsupervised patterns worth knowing:

Clustering — group similar things together without labels.

K-Means: you specify the number of clusters; fast, works at scale
DBSCAN: discovers cluster count from data density; handles irregular shapes
Hierarchical: builds a tree of clusters; good for exploring at different granularities

Anomaly Detection — find the things that don't fit.

At ABC Pizza: flag driver routes that are statistically abnormal (possible GPS spoofing)
At MortgageIQ: flag loan applications with feature combinations that don't appear in historical data (possible fraud)
Isolation Forest is the go-to for tabular data; Autoencoders for high-dimensional or sequential data

Dimensionality Reduction — compress many features into fewer without losing the signal.

200 features in your dataset → the model trains slowly and overfits
PCA reduces to the dimensions that explain the most variance
t-SNE and UMAP are used for visualization: plot 200-dimensional data in 2D to see if clusters exist

When unsupervised learning works:

You want to explore data before building a predictive model
You don't have labels and can't get them
You're looking for anomalies in a system where "normal" is known but "abnormal" is rare

When it fails:

When you confuse discovery with prediction (our mistake)
When the clusters don't map to a business-meaningful grouping
When the data has too much noise — clusters become meaningless

Reinforcement Learning — Learning from Rewards

The paradigm: An agent takes actions in an environment, receives rewards or penalties, and learns a policy that maximizes cumulative reward over time.

This is fundamentally different from supervised and unsupervised learning. There are no labeled examples. There's no dataset to cluster. There's a sequence of decisions, and feedback on how good those decisions were.

The ABC Pizza long-game application:

Dispatch is a sequential decision problem. Assigning Driver A to Order X isn't evaluated in isolation — it affects whether Driver A is available for Order Y thirty minutes later. A greedy model that optimizes each assignment independently misses the global picture.

Reinforcement learning treats the entire dispatch system as an environment:

State: current orders, driver locations, estimated completion times, weather
Action: assign driver D to order O
Reward: +1 if delivered on time, -1 if late, -2 if cancelled

The RL agent learns a policy — a mapping from state to action — that maximizes total reward across all orders in a shift, not just the next one.

Why ABC Pizza hasn't shipped this yet:

RL requires a simulator. You can't train an RL agent on live production traffic — the early policy will make terrible decisions and real customers will experience them. Building a high-fidelity dispatch simulator that matches Tokyo's typhoon conditions is a significant engineering project. ABC Pizza is on this roadmap. It's not in production.

When reinforcement learning works:

Sequential decision-making where each action affects future options
A simulator is available (games, robotics, recommendation systems)
The reward signal is clear and frequent

When it fails:

When you need a simulator but don't have one
When the reward is sparse or delayed (rare feedback makes learning slow)
When the state space is too large without careful abstraction

Deep Learning — The Layer That Wraps Everything

Deep learning is not a fourth paradigm — it's a technique that can implement any of the three.

A deep learning model is a neural network: layers of mathematical transformations that learn complex representations from raw data. What makes it "deep" is the number of layers — modern models have hundreds.

Why deep learning?

Traditional ML algorithms require feature engineering — you tell the model what to look at (delivery distance, weather severity, time of day). Deep learning does its own feature learning — given raw pixels, it learns to detect edges, then shapes, then objects. Given raw text, it learns word relationships, then sentence meaning, then document context.

At ABC Pizza, deep learning appears in two places:

Document OCR — reading handwritten franchise partner receipts. A convolutional neural network (CNN) takes raw receipt images as input. No feature engineering needed. The network learns what "7" looks like across 10,000 variations of handwriting.

Demand forecasting — predicting order volume 4 hours ahead using weather, local events, historical patterns, and social media signals. A transformer model processes these as a sequence and produces a probabilistic forecast. Traditional regression couldn't handle the irregular temporal patterns; the transformer learns them.

At MortgageIQ, GPT-4o is a deep learning model — specifically a transformer trained on internet-scale text. The RAG architecture wraps it: retrieval is the feature engineering layer that tells the transformer what domain knowledge to use.

Deep learning tradeoffs:

	Traditional ML	Deep Learning
Data needed	Hundreds to thousands	Millions
Training time	Minutes to hours	Hours to days
Interpretability	High	Low
Feature engineering	Required	Optional
Compute cost	CPU sufficient	GPU required
Best for	Tabular data	Images, audio, text

The rule at ABC Pizza: tabular structured data → traditional ML first. If the model doesn't reach accuracy threshold after feature engineering, then consider deep learning. The complexity and compute cost of deep learning must be justified.

The Full Comparison

Paradigm	Has Labels?	Output	ABC Pizza Example	MortgageIQ Example
Supervised — Regression	Yes	Continuous	Delivery time prediction	Loan default probability
Supervised — Classification	Yes	Category	On-time vs late	Document type classifier
Unsupervised — Clustering	No	Groups	Store segmentation	Borrower behavior cohorts
Unsupervised — Anomaly	No	Outliers	Driver GPS fraud	Unusual loan application
Reinforcement Learning	No (rewards)	Policy	Dispatch optimizer (roadmap)	—
Deep Learning	Either	Any	Receipt OCR · Demand forecast	GPT-4o (the LLM itself)

How to Pick

The decision is simpler than the terminology suggests:

Do you have labeled historical outcomes? → Start with supervised learning
No labels, want to find groups or anomalies? → Unsupervised
Sequential decisions with a simulator? → Reinforcement learning
Raw images, audio, or text? → Deep learning (wraps whichever paradigm fits)
Tabular structured data? → Traditional ML before deep learning

The ABC Pizza team skipped step 1. We had labeled outcomes (60M historical dispatches with actual delivery times). We should have started with supervised regression. Instead we built a clustering model that answered a question nobody asked.

We lost two weeks. The lesson cost nothing except time.

Key Takeaways

The ABC Pizza problem: We used unsupervised clustering when we had 60M labeled examples that called for supervised regression. The clusters were technically correct — they just didn't predict dispatch performance. Two weeks of work, zero production value.

The transferable principle: Match the algorithm to the question, not to what's trendy or what the team knows best. The single most important question is: do you have labeled historical outcomes? If yes, supervised learning. Start there.

What I'd do differently: Before any modeling, write down the prediction question in one sentence. "Given X inputs, predict Y outcome." If Y is a number — regression. If Y is a category — classification. If there's no Y — unsupervised. This sentence forces clarity before the team commits to a direction.

Watch out for: Using unsupervised learning as a stepping stone to supervised learning without validating that the clusters are meaningful for the prediction task. Clustering finds structure. It doesn't guarantee that structure is predictive of what you care about.

What's Next

Post 3 — Feature Engineering: why the first dispatch model was 60% accurate and the second was 91% — and why the difference had nothing to do with the algorithm
Post 4 — Train/Validate/Test: why the model that aced the holdout set failed on Super Bowl Sunday

We trained the model. It worked perfectly. It solved the wrong problem.

The Four Paradigms — One Decision Framework

Supervised Learning — Learning from Labeled Outcomes

Unsupervised Learning — Finding Hidden Structure

Reinforcement Learning — Learning from Rewards

Deep Learning — The Layer That Wraps Everything

The Full Comparison

How to Pick

Key Takeaways

What's Next

Related