The rules engine broke at 3AM in Tokyo.
It was a typhoon. The ABC Pizza dispatch system — the engine that routes every order from store kitchen to delivery driver across 20,000 locations worldwide — had never seen a weather event of that severity in the Tokyo region. The rules were written in 2011 by engineers who had never been to Tokyo. They encoded what they knew: rain increases delivery time, traffic reduces driver speed, peak hours affect capacity.
None of the rules covered the interaction between a Category 3 typhoon, a city-wide transit shutdown, and 40% of drivers calling out simultaneously.
The system returned null. Orders stacked. Customers waited. Nobody knew what was wrong until a support ticket came in from the Tokyo operations team at 6AM Chicago time.
That incident started a conversation that ended with a machine learning platform. But before we could build anything, we had to answer a question that sounds simple and isn't: what is machine learning, actually?
This post is the answer. Not the textbook answer — the production engineer's answer. The map of the whole landscape, so that every subsequent post in this series has a home.
The Core Idea in One Sentence
Machine learning is the practice of building systems that improve their behavior from data rather than from explicitly written rules.
The dispatch rules were written by humans. When reality exceeded what the humans had imagined, the rules failed. A machine learning model learns what "good dispatch" looks like from thousands of historical examples — and generalizes to scenarios the engineers never anticipated.
That's the shift. From code that encodes human knowledge to code that learns from human outcomes.
The ML Landscape — Every Component
This is the full map. Every term you will encounter in ML has a home in this diagram.
Five layers. Every ML project touches all five — the ones that fail usually skip layers 4 and 5.
Layer 1: Data
The most important layer. The one most engineers underinvest in.
The dispatch model at ABC Pizza is only as good as the historical dispatch data it learns from. If the training data only covers normal weather conditions, the model will be confidently wrong about typhoons — exactly like the rules engine was, but harder to debug because the failure is statistical, not explicit.
Raw Data is everything the system can observe: order timestamps, store locations, driver GPS coordinates, weather API data, historical delivery times, cancellation rates.
Feature Engineering is the craft of transforming raw data into signals the model can learn from. Raw GPS coordinates are not useful. Distance from store to customer is. Time since last driver assignment is. Whether it's currently raining is. Feature engineering is where domain knowledge — knowing what actually matters for dispatch — gets encoded into the model's inputs.
The dataset is the result: structured rows of (features → outcome) pairs. For dispatch: (weather, driver_count, order_queue_depth, store_zone, time_of_day) → delivery_time_minutes.
At MortgageIQ, the equivalent is the loan knowledge base: raw markdown documents transformed into chunks → embedded into vectors → retrieved by semantic similarity. Feature engineering for LLMs is chunking strategy, embedding model choice, and retrieval scoring. The principle is identical.
Layer 2: Learning Types
There are four fundamental learning paradigms. Every ML algorithm fits into one.
Supervised Learning
The model learns from labeled historical examples — cases where you know the correct answer.
ABC Pizza dispatch: given these conditions (features), a human-optimal dispatch took 28 minutes (label). Train on 2 million historical dispatches. The model learns to predict dispatch time for new conditions it's never seen.
When to use it: You have historical data with known outcomes. Prediction, classification, regression.
Unsupervised Learning
The model finds hidden structure in data without labels.
ABC Pizza operations: which stores behave similarly to each other? No label needed — the model clusters stores by order pattern, peak hours, and cancellation rate. Stores in cluster 3 might share a staffing pattern that predicts Saturday failure.
When to use it: You want to discover patterns, segment data, detect anomalies.
Reinforcement Learning
The model learns by taking actions and receiving rewards — like training a dog, but for software.
ABC Pizza optimization: assign Driver A to Order X, observe the outcome (on-time: +1, late: -1), update the policy. Over millions of dispatches, the model learns an assignment policy that maximizes on-time delivery globally — not just per order.
When to use it: Sequential decision-making, optimization problems where outcomes are delayed.
Deep Learning
Neural networks — layers of mathematical transformations that learn complex representations.
ABC Pizza document OCR: reading handwritten receipts from franchise partners requires deep learning — a convolutional neural network trained on millions of labeled receipt images. Rules cannot parse handwriting. Deep learning can.
When to use it: Images, audio, text, video — any unstructured data where traditional feature engineering fails.
Layer 3: Training
Training is the process of exposing the model to data and adjusting its internal parameters until it produces accurate predictions.
Experiment Tracking — at ABC Pizza, the data science team ran 47 dispatch model experiments: different features, different algorithms, different hyperparameters. Without MLflow, they had no record of which experiment produced which model. The model that shipped to production was identified by a filename: dispatch_model_final_v2_FINAL.pkl. Nobody knew what was in it.
This is the story behind Post 7. Experiment tracking is not a nice-to-have — it is the audit trail for model decisions.
Evaluation — a model is only useful if its predictions are accurate enough to act on. Accuracy is the starting point: what percentage of predictions are correct? But in regulated or high-stakes domains, accuracy isn't enough. Precision (when you predict X, how often is it actually X?) and Recall (of all the actual X cases, how many did you catch?) matter differently depending on the cost of each error type. A fraud detection model that misses 20% of fraud is worse than one that flags 5% of legitimate transactions.
Model Registry — the versioned record of trained models. dispatch-model-v3.2 trained on data from 2024-01 to 2025-12, evaluated on the December holdout set, promoted to production on 2026-01-15. This is Post 9's story.
Layer 4: Deployment
A trained model that isn't deployed is a research artifact. Deployment is where ML creates business value — and where most production failures happen.
Real-time endpoints serve predictions in milliseconds. The dispatch system needs a response in under 200ms — a driver assignment recommendation must appear before the driver's phone locks. Real-time endpoints (Azure Managed Online Endpoints, AKS-hosted services) are the deployment pattern.
Batch inference runs predictions on large datasets on a schedule. Nightly store performance scoring, weekly customer churn prediction, monthly demand forecasting — these don't need millisecond latency. Batch pipelines are cheaper and simpler.
A/B testing — you never replace a working model with a new one blindly. Champion-challenger: 90% of traffic goes to the current model, 10% to the new candidate. Measure both. Promote when the challenger is statistically better. This is how ABC Pizza promoted dispatch model v3 over v2 without a service disruption.
Layer 5: Operations
The layer that makes ML sustainable in production. The layer most teams skip until something breaks.
Model monitoring — model accuracy degrades over time as the world changes. The dispatch model trained in 2024 learned patterns from 2024 data. By 2026, EV delivery vehicles have different range characteristics, a new housing development changed route times, and remote work shifted peak hours. The model doesn't know. It keeps predicting from 2024 patterns.
This is called drift. Data drift: the inputs to the model change distribution. Concept drift: the relationship between inputs and outputs changes. Post 13 covers this in full.
Retraining pipelines automatically detect drift and trigger retraining on fresh data. Without this, someone has to notice that accuracy is declining — and in a production system, that means a business metric (late deliveries, customer complaints) degrades before the engineering team acts.
Governance — in regulated domains, every model decision must be explainable and auditable. MortgageIQ's AI governance post covers this for LLM-based systems. The same principles apply to traditional ML: which model version made this prediction? What features drove it? Is the model producing disparate outcomes across demographic groups?
The ABC Pizza Platform — Where We Started
After the Tokyo incident, here is what ABC Pizza had:
Data: ✅ 3 years of dispatch history in Azure SQL
✅ Weather API integration
❌ No feature engineering pipeline
❌ Data in 4 different formats across regions
Learning: ❌ No models — rules engine only
❌ No data science team
Training: ❌ No experiment tracking
❌ No evaluation framework
❌ No model registry
Deployment: ❌ No ML endpoints
❌ Dispatch logic embedded in monolith
Operations: ❌ No monitoring
❌ No retraining
❌ No governance
By the end of the 20-post series, every cell in that table will be checked. The series documents the journey from the Tokyo incident to a production ML platform that handles 60M+ orders per year.
Why the Rules Engine Was Never Going to Be Enough
Rules are explicit knowledge. They encode what engineers know at the time they're written. They are:
- Brittle — they break on inputs outside their design range
- Static — they don't improve as conditions change
- Incomplete — nobody can enumerate every combination of weather, driver availability, order volume, and route condition that a 20,000-store global platform will encounter
- Expensive to maintain — every new scenario requires an engineer to write a new rule
Machine learning is implicit knowledge. The model encodes what historical outcomes teach it. It:
- Generalizes — it handles scenarios it hasn't seen before, by interpolating from similar cases
- Improves — retraining on fresh data keeps it current
- Scales — the same model handles Tokyo and Toronto and São Paulo without regional rule sets
- Discovers — it finds patterns humans didn't know to look for
The typhoon in Tokyo was the rule engine's last stand. The ML platform that replaced it has not returned null since.
Key Takeaways
The ABC Pizza problem: A rules engine that encoded 2011 knowledge failed catastrophically on a 2026 scenario nobody had anticipated — a typhoon, a city-wide shutdown, and 40% driver absence simultaneously. The system returned null. The business failed silently for hours.
The transferable principle: Rules encode what you know. ML learns from what happened. Every system that operates at scale will eventually encounter a scenario outside the rules' design range. The question is whether you find out from a monitoring alert or from a support ticket.
What I'd do differently: Start feature engineering before the model. The Tokyo incident was partly a data quality problem — weather data was regional, driver GPS was sampled at 5-minute intervals, and order cancellations weren't logged consistently. A better dataset would have caught the anomaly earlier even with a simple model.
Watch out for: Skipping Layers 4 and 5. Every team is excited about training a model. Almost no team plans for deployment infrastructure and drift monitoring before the first model ships. By the time drift becomes a problem, the team has moved on to the next project and nobody owns the degrading model.
What's Next in This Series
This post is the map. Every subsequent post is one component of this map, told through the story of building ABC Pizza's ML platform:
- Post 2 — The three learning types: how we tried clustering before regression and got it backwards
- Post 3 — Feature engineering: why the Tokyo model's first version was 60% accurate and the second was 91%
- Post 4 — Train/validate/test: why the model that aced the test set failed on Super Bowl Sunday
- Post 5 — From notebook to production: the 6-week gap between "it works on my machine" and deployed