We trained the model. It worked perfectly. It solved the wrong problem.
After the Tokyo incident, the ABC Pizza data science team got to work. The problem statement: predict optimal driver assignment for incoming orders. The team looked at three years of historical dispatch data — 60 million orders — and decided to start with clustering.
The logic seemed sound: group stores by behavior, group drivers by performance pattern, match clusters. Unsupervised learning. No labels needed. Two weeks of work, clean clusters, a beautiful dendrogram.
The operations team tried it in a shadow environment. Cluster assignments had zero correlation with on-time delivery rates.
The model was technically correct. It found real structure in the data. It clustered stores by order volume and geographic zone — not by the factors that actually drive dispatch success. We had answered a question nobody asked.
The right algorithm was hiding in plain sight the whole time. The lesson: choosing the learning paradigm is the first architectural decision in any ML project, and getting it wrong wastes weeks.
The Four Paradigms — One Decision Framework
Supervised Learning — Learning from Labeled Outcomes
The paradigm: You have historical examples where you know the correct answer. The model learns the mapping from inputs to outputs.
The ABC Pizza application:
The dispatch problem is fundamentally a prediction problem: given current conditions, what will the delivery time be? We have 60 million historical dispatches. Each one has features (weather, driver count, order queue, time of day, zone) and an outcome (actual delivery time in minutes). That's a labeled dataset. That's supervised learning.
Two variants:
Regression — the output is a continuous number.
- Predict delivery time in minutes → regression
- Predict store revenue next week → regression
- Predict mortgage default probability → regression (outputs 0.0 to 1.0)
Classification — the output is a category.
- On-time or late? → binary classification
- Which document type? (W-2, pay stub, bank statement) → multi-class classification
- Fraudulent transaction? → binary classification
At MortgageIQ, document classification is supervised learning: the model is trained on thousands of labeled document images (W-2: label 0, pay stub: label 1, bank statement: label 2). Given a new document image, it predicts the class. The label is the human-provided ground truth.
When supervised learning works:
- You have historical data with known outcomes
- The outcome you want to predict is the same type as the outcomes in your history
- You have enough examples — rule of thumb: 10× more examples than features
When it fails:
- Your historical data doesn't represent the future (Super Bowl Sunday, typhoon in Tokyo)
- The outcome you care about isn't in your historical data
- Labels are expensive or slow to obtain (radiology images need a doctor to label)
Unsupervised Learning — Finding Hidden Structure
The paradigm: No labels. The model finds patterns, groups, or structure in the data on its own.
The ABC Pizza mistake — and the correct use:
Our clustering mistake: we used unsupervised learning to find store groups, then used those groups to drive dispatch decisions. The clusters were real — but they didn't connect to the outcome we cared about (on-time delivery). Clustering is a discovery tool, not a prediction tool.
The correct use of clustering at ABC Pizza: store segmentation for operational insights. Which stores share similar failure patterns? Cluster by cancellation rate, peak hour timing, and driver churn. The resulting segments helped operations identify stores at risk before they degraded — not to drive individual dispatch decisions, but to trigger proactive staffing reviews.
Three unsupervised patterns worth knowing:
Clustering — group similar things together without labels.
- K-Means: you specify the number of clusters; fast, works at scale
- DBSCAN: discovers cluster count from data density; handles irregular shapes
- Hierarchical: builds a tree of clusters; good for exploring at different granularities
Anomaly Detection — find the things that don't fit.
- At ABC Pizza: flag driver routes that are statistically abnormal (possible GPS spoofing)
- At MortgageIQ: flag loan applications with feature combinations that don't appear in historical data (possible fraud)
- Isolation Forest is the go-to for tabular data; Autoencoders for high-dimensional or sequential data
Dimensionality Reduction — compress many features into fewer without losing the signal.
- 200 features in your dataset → the model trains slowly and overfits
- PCA reduces to the dimensions that explain the most variance
- t-SNE and UMAP are used for visualization: plot 200-dimensional data in 2D to see if clusters exist
When unsupervised learning works:
- You want to explore data before building a predictive model
- You don't have labels and can't get them
- You're looking for anomalies in a system where "normal" is known but "abnormal" is rare
When it fails:
- When you confuse discovery with prediction (our mistake)
- When the clusters don't map to a business-meaningful grouping
- When the data has too much noise — clusters become meaningless
Reinforcement Learning — Learning from Rewards
The paradigm: An agent takes actions in an environment, receives rewards or penalties, and learns a policy that maximizes cumulative reward over time.
This is fundamentally different from supervised and unsupervised learning. There are no labeled examples. There's no dataset to cluster. There's a sequence of decisions, and feedback on how good those decisions were.
The ABC Pizza long-game application:
Dispatch is a sequential decision problem. Assigning Driver A to Order X isn't evaluated in isolation — it affects whether Driver A is available for Order Y thirty minutes later. A greedy model that optimizes each assignment independently misses the global picture.
Reinforcement learning treats the entire dispatch system as an environment:
- State: current orders, driver locations, estimated completion times, weather
- Action: assign driver D to order O
- Reward: +1 if delivered on time, -1 if late, -2 if cancelled
The RL agent learns a policy — a mapping from state to action — that maximizes total reward across all orders in a shift, not just the next one.
Why ABC Pizza hasn't shipped this yet:
RL requires a simulator. You can't train an RL agent on live production traffic — the early policy will make terrible decisions and real customers will experience them. Building a high-fidelity dispatch simulator that matches Tokyo's typhoon conditions is a significant engineering project. ABC Pizza is on this roadmap. It's not in production.
When reinforcement learning works:
- Sequential decision-making where each action affects future options
- A simulator is available (games, robotics, recommendation systems)
- The reward signal is clear and frequent
When it fails:
- When you need a simulator but don't have one
- When the reward is sparse or delayed (rare feedback makes learning slow)
- When the state space is too large without careful abstraction
Deep Learning — The Layer That Wraps Everything
Deep learning is not a fourth paradigm — it's a technique that can implement any of the three.
A deep learning model is a neural network: layers of mathematical transformations that learn complex representations from raw data. What makes it "deep" is the number of layers — modern models have hundreds.
Why deep learning?
Traditional ML algorithms require feature engineering — you tell the model what to look at (delivery distance, weather severity, time of day). Deep learning does its own feature learning — given raw pixels, it learns to detect edges, then shapes, then objects. Given raw text, it learns word relationships, then sentence meaning, then document context.
At ABC Pizza, deep learning appears in two places:
Document OCR — reading handwritten franchise partner receipts. A convolutional neural network (CNN) takes raw receipt images as input. No feature engineering needed. The network learns what "7" looks like across 10,000 variations of handwriting.
Demand forecasting — predicting order volume 4 hours ahead using weather, local events, historical patterns, and social media signals. A transformer model processes these as a sequence and produces a probabilistic forecast. Traditional regression couldn't handle the irregular temporal patterns; the transformer learns them.
At MortgageIQ, GPT-4o is a deep learning model — specifically a transformer trained on internet-scale text. The RAG architecture wraps it: retrieval is the feature engineering layer that tells the transformer what domain knowledge to use.
Deep learning tradeoffs:
| Traditional ML | Deep Learning | |
|---|---|---|
| Data needed | Hundreds to thousands | Millions |
| Training time | Minutes to hours | Hours to days |
| Interpretability | High | Low |
| Feature engineering | Required | Optional |
| Compute cost | CPU sufficient | GPU required |
| Best for | Tabular data | Images, audio, text |
The rule at ABC Pizza: tabular structured data → traditional ML first. If the model doesn't reach accuracy threshold after feature engineering, then consider deep learning. The complexity and compute cost of deep learning must be justified.
The Full Comparison
| Paradigm | Has Labels? | Output | ABC Pizza Example | MortgageIQ Example |
|---|---|---|---|---|
| Supervised — Regression | Yes | Continuous | Delivery time prediction | Loan default probability |
| Supervised — Classification | Yes | Category | On-time vs late | Document type classifier |
| Unsupervised — Clustering | No | Groups | Store segmentation | Borrower behavior cohorts |
| Unsupervised — Anomaly | No | Outliers | Driver GPS fraud | Unusual loan application |
| Reinforcement Learning | No (rewards) | Policy | Dispatch optimizer (roadmap) | — |
| Deep Learning | Either | Any | Receipt OCR · Demand forecast | GPT-4o (the LLM itself) |
How to Pick
The decision is simpler than the terminology suggests:
- Do you have labeled historical outcomes? → Start with supervised learning
- No labels, want to find groups or anomalies? → Unsupervised
- Sequential decisions with a simulator? → Reinforcement learning
- Raw images, audio, or text? → Deep learning (wraps whichever paradigm fits)
- Tabular structured data? → Traditional ML before deep learning
The ABC Pizza team skipped step 1. We had labeled outcomes (60M historical dispatches with actual delivery times). We should have started with supervised regression. Instead we built a clustering model that answered a question nobody asked.
We lost two weeks. The lesson cost nothing except time.
Key Takeaways
The ABC Pizza problem: We used unsupervised clustering when we had 60M labeled examples that called for supervised regression. The clusters were technically correct — they just didn't predict dispatch performance. Two weeks of work, zero production value.
The transferable principle: Match the algorithm to the question, not to what's trendy or what the team knows best. The single most important question is: do you have labeled historical outcomes? If yes, supervised learning. Start there.
What I'd do differently: Before any modeling, write down the prediction question in one sentence. "Given X inputs, predict Y outcome." If Y is a number — regression. If Y is a category — classification. If there's no Y — unsupervised. This sentence forces clarity before the team commits to a direction.
Watch out for: Using unsupervised learning as a stepping stone to supervised learning without validating that the clusters are meaningful for the prediction task. Clustering finds structure. It doesn't guarantee that structure is predictive of what you care about.
What's Next
- Post 3 — Feature Engineering: why the first dispatch model was 60% accurate and the second was 91% — and why the difference had nothing to do with the algorithm
- Post 4 — Train/Validate/Test: why the model that aced the holdout set failed on Super Bowl Sunday