Two Models, One Platform: Fine-Tuning XGBoost and GPT-4o in Azure for ABC Pizza

We almost fine-tuned the wrong model.

After the Tokyo incident, the ABC Pizza engineering team had a clear goal: build a custom dispatch model trained on 3 years of our own history. The initial plan was to fine-tune GPT-4o on Azure AI Foundry. The data team was prepping JSONL files. The ML engineer had the Python SDK open.

Then someone asked: "Why are we using a language model to predict a number?"

Dispatch time prediction is a regression problem. The output is a number — minutes. The inputs are structured tabular data: weather severity, driver count, queue depth, zone, time of day, event flag. GPT-4o is a language model trained to predict the next token in a sequence of text. It is extraordinarily good at language. It is not the right architecture for structured tabular regression.

We stopped. We reassessed. We ended up building two custom models — not one:

XGBoost on Azure ML — predicts dispatch time in minutes. Tabular regression. Trained in 20 minutes. Costs $8 per run. 94% accurate.
GPT-4o fine-tuned on Azure AI Foundry — explains dispatch decisions to franchise managers in plain English. Language task. Trained in 16 hours. Costs $900 per run.

Same Azure platform. Two completely different paradigms. Each solving a problem the other cannot.

This post covers both — what each fine-tuning approach involves, how they differ, and the architectural judgment that determines which one your problem needs.

The Core Question Before Any Fine-Tuning Decision

The rule: If you can describe the output as a number, a category, or a probability — reach for tabular ML. If the output is language — reach for an LLM. This single question would have saved the ABC Pizza team two weeks of misdirected data preparation.

Model 1: XGBoost on Azure ML — Predicting the Number

Why XGBoost for Dispatch Prediction

XGBoost (Extreme Gradient Boosting) is the dominant algorithm for structured tabular data. It wins Kaggle competitions. It powers Uber's ETA model, Airbnb's pricing engine, and Spotify's churn prediction. For any task where:

The data is structured rows and columns
The output is a number or category
You need millisecond inference
Explainability matters (regulators, operations teams, franchise partners)

XGBoost is the starting point. Not GPT-4o. Not a neural network. XGBoost.

Why not a neural network? Neural networks require large datasets and GPU compute to outperform gradient boosting on tabular data. For datasets under ~1M rows, XGBoost almost always wins on accuracy, training speed, and inference latency. ABC Pizza has 60M dispatch records — XGBoost can handle this at scale with Azure ML's distributed training.

Feature Engineering for Tabular ML

This is the work that determines 80% of model quality — and it has nothing to do with the algorithm.

Raw dispatch data:

order_id: 84732
timestamp: 2026-01-17 18:45:23
store_id: CHI-LP-007
weather_code: SNOW_HEAVY
drivers_online: 4
queue_depth: 14

Engineered features the model actually trains on:

features = {
    # Time features
    "hour_of_day": 18,
    "day_of_week": 4,          # Friday
    "is_weekend": 0,
    "minutes_to_peak": 15,     # derived: peak is 19:00

    # Weather features
    "snowfall_cmph": 8.2,
    "wind_speed_kph": 34,
    "visibility_km": 0.8,
    "weather_severity_score": 0.87,  # composite

    # Operational features
    "drivers_tier1": 2,
    "drivers_tier2": 2,
    "queue_depth": 14,
    "queue_per_driver": 3.5,   # derived
    "avg_order_distance_km": 2.3,

    # Zone features
    "zone_id": "CHI-LP-NORTH",
    "zone_surge_multiplier": 3.4,
    "zone_event_flag": 1,      # Bears home game
    "zone_historical_delay_p90": 28,  # 90th percentile delay this zone

    # Interaction features
    "snow_x_queue": 8.2 * 14,      # interaction term
    "event_x_surge": 1 * 3.4,      # interaction term
}

The interaction features — snow_x_queue, event_x_surge — are where domain expertise turns into model performance. A data scientist without dispatch domain knowledge would not create these. A senior dispatcher would immediately recognize them as what actually drives cascading failures.

This is why feature engineering is a separate post (Post 3). It deserves the full treatment.

Azure ML Training Pipeline

from azure.ai.ml import MLClient, command
from azure.ai.ml.entities import Environment
import mlflow

# Register training data as Azure ML Dataset
ml_client = MLClient.from_config()
dispatch_data = ml_client.data.get("abc-pizza-dispatch-features", version="3.1")

# Define training job
job = command(
    code="./src/train",
    command="python train_xgboost.py --data {inputs.dispatch_data} --output {outputs.model}",
    inputs={"dispatch_data": dispatch_data},
    outputs={"model": Output(type="mlflow_model")},
    compute="cpu-cluster-8core",
    environment="azureml:sklearn-xgboost-env:1.0",
    experiment_name="abc-pizza-dispatch-v3",
    display_name="xgboost-dispatch-baseline",
)
ml_client.jobs.create_or_update(job)

# train_xgboost.py
import xgboost as xgb
import mlflow
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import shap

def train(data_path, output_path):
    df = load_features(data_path)
    X, y = df.drop("dispatch_time_minutes", axis=1), df["dispatch_time_minutes"]
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

    with mlflow.start_run():
        model = xgb.XGBRegressor(
            n_estimators=500,
            max_depth=7,
            learning_rate=0.05,
            subsample=0.8,
            colsample_bytree=0.8,
            random_state=42,
        )
        model.fit(X_train, y_train,
                  eval_set=[(X_val, y_val)],
                  early_stopping_rounds=20,
                  verbose=100)

        mae = mean_absolute_error(y_val, model.predict(X_val))

        # Log to MLflow
        mlflow.log_params(model.get_params())
        mlflow.log_metric("val_mae_minutes", mae)
        mlflow.log_metric("val_accuracy_5min", accuracy_within_n(y_val, model.predict(X_val), n=5))

        # SHAP explainability — not available for LLMs
        explainer = shap.TreeExplainer(model)
        shap_values = explainer.shap_values(X_val[:1000])
        mlflow.log_figure(shap.summary_plot(shap_values, X_val[:1000], show=False), "shap_summary.png")

        mlflow.xgboost.log_model(model, "model")

HyperDrive — Automated Hyperparameter Tuning

Manually picking XGBoost hyperparameters is guesswork. HyperDrive searches the hyperparameter space systematically using Bayesian optimization — each trial informs the next:

from azure.ai.ml.sweep import Choice, Uniform, BayesianSamplingAlgorithm

sweep_job = job.sweep(
    sampling_algorithm=BayesianSamplingAlgorithm(),
    primary_metric="val_mae_minutes",
    goal="Minimize",
    max_total_trials=40,
    max_concurrent_trials=8,
)

sweep_job.search_space = {
    "n_estimators": Choice([200, 300, 500, 700]),
    "max_depth": Choice([4, 5, 6, 7, 8]),
    "learning_rate": Uniform(0.01, 0.15),
    "subsample": Uniform(0.6, 1.0),
}

40 trials, 8 running in parallel, Bayesian sampling. Best trial: n_estimators=500, max_depth=7, learning_rate=0.05 — MAE of 3.2 minutes.

SHAP Explainability — The Feature That LLMs Can't Match

The XGBoost model can explain every prediction. A franchise partner in Tokyo asks: "Why did the model predict 47 minutes for this order?"

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(order_features)

# Output:
# Base prediction: 18.4 min
# + zone_event_flag (Bears game):     +12.3 min
# + snowfall_cmph (8.2 cm/hr):        +8.7 min
# + queue_per_driver (3.5):           +6.1 min
# + snow_x_queue interaction:         +4.2 min
# - drivers_tier1 (2 available):      -2.8 min
# = Final prediction: 46.9 min ≈ 47 min

This level of transparency is required for franchise operations. When a prediction drives a driver assignment decision that costs a partner $200 in overtime, they need to understand why. SHAP makes that possible.

GPT-4o fine-tuned on JSONL data cannot produce this. Token attribution (attention weights) gives a rough approximation. SHAP gives exact feature contributions. For regulated, high-stakes operational decisions — explainability is a non-negotiable requirement, and tabular ML delivers it.

Results: XGBoost on Azure ML

Metric	Rules Engine	General Model	Fine-Tuned XGBoost
Dispatch accuracy (±5 min)	61%	71%	94%
Surge scenario accuracy	22%	43%	91%
Storm scenario accuracy	18%	38%	89%
Inference latency (p99)	<1ms	1,200ms	8ms
Cost per 1M predictions	~$0	~$300	~$0.40
Explainability	Rules only	None	Full SHAP
Training time	N/A	N/A	22 minutes
Training cost	N/A	N/A	~$8

The latency and cost columns tell the real story. At 60M orders per year, GPT-4o inference at $300/1M predictions = $18,000/year just for dispatch time prediction. XGBoost at $0.40/1M = $24/year. For a number prediction task, the LLM is 750× more expensive with no accuracy advantage.

Model 2: GPT-4o Fine-Tuned on Azure AI Foundry — Explaining the Decision

XGBoost predicts 47 minutes. That number needs to reach a franchise manager as an explanation — not a raw output. "47 minutes" means nothing without context. "Your delivery will take 47 minutes — there's a Bears game causing a 340% surge in Lincoln Park and heavy snow is limiting driver range" is actionable.

This is a language task. GPT-4o is the right model.

The full fine-tuning story for language tasks — PII handling, labeling rubric, compliance constraints, Azure AI Foundry walkthrough, and production evaluation — is documented in detail in the MortgageIQ post. The pattern is identical for ABC Pizza's franchise communication use case:

The training data structure:

{"messages": [
  {"role": "system", "content": "You are ABC Pizza's dispatch advisor. Explain delivery predictions clearly to franchise managers. Use plain English. Include the key factors driving the estimate."},
  {"role": "user", "content": "Prediction: 47 min | Zone: Lincoln-Park-North | Surge: 3.4x | Weather: Heavy snow | Event: Bears home game | Queue: 14 orders | Tier-1 drivers: 2"},
  {"role": "assistant", "content": "Estimated delivery time is 47 minutes — about 25 minutes longer than your zone average. Two factors are driving this: there's a Bears home game causing a 340% order surge in your zone, and heavy snow is limiting your Tier-1 drivers to nearby orders only. Recommend activating your backup driver pool now. Surge typically peaks at kickoff (19:00) and clears within 90 minutes."}
]}

Azure AI Foundry fine-tuning call:

from azure.ai.foundry import FineTuningJob

job = FineTuningJob(
    model="gpt-4o-2024-11-20",
    training_file="dispatch-explanations-train.jsonl",
    validation_file="dispatch-explanations-val.jsonl",
    hyperparameters={
        "n_epochs": 3,
        "batch_size": 16,
        "learning_rate_multiplier": 0.05,
    },
    suffix="abc-pizza-advisor-v1",
)

Side-by-Side: Two Fine-Tuning Paradigms

Dimension	XGBoost on Azure ML	GPT-4o on Azure AI Foundry
Task	Regression — predict minutes	Language — explain the decision
Output	A number	A paragraph
Training data format	CSV / DataFrame	JSONL message pairs
Training time	22 minutes	16 hours
Compute type	CPU cluster	GPU cluster (A100)
Cost per training run	~$8	~$900
Hyperparameter tuning	HyperDrive (Bayesian)	Manual epochs/lr
Explainability	Full SHAP values	Limited token attribution
Inference latency	8ms	800ms–1.5s
Cost per 1M inferences	~$0.40	~$300
Retraining trigger	Accuracy drift >3%	Tone/quality degradation
Evaluation metric	MAE, accuracy ±5 min	Readability, manager satisfaction
Azure service	Azure ML + MLflow	Azure AI Foundry + PromptFlow
Right for	Numbers, predictions, classification	Language, tone, communication

Pre-Training Philosophy: What Each Model Already Knows

XGBoost starts from scratch. There is no pre-trained XGBoost. Every tree is grown from your data. This means the model knows only what your data teaches it — nothing about the world outside your training set. It is entirely dependent on feature quality and data coverage.

GPT-4o starts from the internet. It already understands language, context, tone, domain vocabulary, and how operational decisions are communicated in logistics and food service. Fine-tuning adds your specific style and operational context on top of a foundation that would take years to rebuild from scratch.

This is why XGBoost needs millions of rows to be reliable, and GPT-4o fine-tuning can produce meaningful results with thousands of examples. The foundation model carries the weight of general knowledge. You only need to teach it the delta.

Post-Training: Both Models Degrade — Differently

XGBoost degrades when the data distribution shifts. New zones, new event types, seasonal pattern changes — the model has never seen them. Accuracy drops on those segments first. Azure ML's data drift monitoring catches this: when the distribution of incoming features diverges from the training distribution by more than a threshold, retraining is triggered.

GPT-4o degrades when the communication requirements change. New operational terminology, new escalation procedures, regulatory language updates — the model's tone drifts from current standards. This is harder to detect automatically. It requires periodic human evaluation — a panel of senior franchise managers scoring model outputs against the current rubric.

The retraining loop for each:

What I've Seen Fail

1. Using GPT-4o for tabular prediction. Almost happened at ABC Pizza. The symptoms: the team was building JSONL files from structured dispatch records — converting a CSV problem into a text problem. When you find yourself serializing a DataFrame to JSON to feed an LLM, stop. You are using the wrong tool.

2. Not running HyperDrive before declaring a model "good enough." The first XGBoost run with manually chosen hyperparameters achieved 87% accuracy. The team wanted to ship it. HyperDrive found a configuration that hit 94% in 40 trials. The 7-point gain was entirely in hyperparameter selection — same data, same algorithm. Always tune before you declare done.

3. Treating SHAP as optional. Franchise partners are business owners. When a model-driven decision costs them money, they want an explanation. "The AI decided" is not an acceptable answer. SHAP feature importance is the audit trail for every XGBoost prediction. Build it in from the start, not as an afterthought when the first angry call comes in.

4. Fine-tuning GPT-4o before exhausting RAG. For the manager explanation task, the team initially tried fine-tuning before building a RAG layer. The fine-tuned model explained decisions well for common scenarios but hallucinated context for edge cases. Adding RAG first — retrieving the actual prediction factors (SHAP values, zone data, event context) as grounding — and then fine-tuning on tone produced a better result at lower cost.

5. One retraining strategy for both models. XGBoost needs data-driven retraining triggers. GPT-4o needs quality-driven retraining triggers. Teams that apply the same monitoring strategy to both end up either retraining the LLM too frequently (expensive) or missing tone drift in the LLM because they're only watching accuracy metrics.

Key Takeaways

The ABC Pizza problem: We almost used a language model to predict a number. The right tool was XGBoost — trained in 22 minutes at $8, delivering 94% accuracy with full SHAP explainability and 8ms inference. GPT-4o fine-tuning was the right tool for a different sub-problem: explaining decisions to franchise managers in plain English.

The transferable principle: The output type determines the model family. Numbers and categories → tabular ML (XGBoost, Azure ML). Language and explanation → LLMs (GPT-4o, Azure AI Foundry). This decision is architectural, not technical preference. Getting it wrong wastes weeks and produces a model that's expensive, slow, and unexplainable.

What I'd do differently: Draw the "output type" decision before writing a line of code. One sentence: "The output is [a number / a category / a sentence]." That sentence determines your entire model family, training pipeline, evaluation strategy, and infrastructure. Write it on the whiteboard before the first meeting ends.

Watch out for: The LLM default. In 2026, every ML conversation gravitates toward LLMs because they're powerful, accessible, and exciting. For 80% of enterprise ML problems — structured data prediction, classification, anomaly detection — XGBoost with HyperDrive on Azure ML is faster, cheaper, more accurate, and more explainable. Match the tool to the problem, not to the trend.