GPT-4o costs $30 per million input tokens. At 5,000 queries per day with 2,000 input tokens each — a modest production RAG system — that's $3,000 per month in input costs alone, before output tokens, before embeddings, before re-indexing. Most engineering teams do not model this before they ship.
The CFO will ask. The question is whether you have an answer before or after the first invoice.
Why AI Cost Is Different From Every Other Cloud Cost
Server cost scales with compute. Storage cost scales with data. AI cost scales with usage patterns — how complex the query is, how much context was retrieved, which model handled it, whether the response was cached, and how many tokens the model generated.
The difference matters because:
- A 2,000-token input query and a 200-token input query can cost 10x differently
- A GPT-4o call and a GPT-4o-mini call for the same task can cost 200x differently
- A cached response and a fresh response can cost 0 vs. $0.03
None of these variations are visible to a standard cloud cost alert. By the time Azure Cost Management sends a budget alert, the month is half over and the damage is done.
FinOps for AI is the discipline of designing cost governance into the system architecture before the first token is spent, not as a remediation after the first spike.
The Numbers Every Engineer Should Know
| Model | Input cost | Output cost | Best for |
|---|---|---|---|
| GPT-4o | $30 / 1M tokens | $60 / 1M tokens | Complex reasoning, tool calling, structured extraction |
| GPT-4o-mini | $0.15 / 1M tokens | $0.60 / 1M tokens | Classification, summarization, simple Q&A |
| text-embedding-3-large | $0.13 / 1M tokens | — | High-quality document + query embedding |
| text-embedding-3-small | $0.02 / 1M tokens | — | Acceptable quality, 6.5x cheaper |
The GPT-4o to GPT-4o-mini cost ratio for input: 200:1. A task that costs $30 on GPT-4o costs $0.15 on GPT-4o-mini. For every task that GPT-4o-mini can handle adequately, routing to the cheaper model is a 99.5% cost reduction.
Worked example — MortgageIQ at production scale:
Scenario: 5,000 queries/day, average 2,000 input tokens + 500 output tokens per query
All GPT-4o:
Input: 5,000 × 2,000 tokens × $30/1M = $300/day = $9,000/month
Output: 5,000 × 500 tokens × $60/1M = $150/day = $4,500/month
Total: ~$13,500/month
Routed — 70% GPT-4o-mini, 30% GPT-4o:
GPT-4o (30%): 1,500 × 2,000 × $30/1M = $90/day
GPT-4o-mini (70%): 3,500 × 2,000 × $0.15/1M = $1.05/day
Output (blended): ~$50/day
Total: ~$141/day = ~$4,230/month
Savings: ~$9,270/month — 69% reduction
This is why model routing is an architecture decision, not an optimization pass.
The Four Levers of AI Cost Governance
Lever 1: Model Routing
Not every query needs GPT-4o. The canonical routing strategy:
The classifier itself is a GPT-4o-mini call — ~50 tokens, ~$0.000008 per query. The routing decision pays for itself if it correctly routes even 1 in 10,000 queries away from GPT-4o.
Tasks that route to GPT-4o-mini:
- FAQ retrieval with simple context (single chunk, factual answer)
- Topic classification ("is this question about credit scores or closing costs?")
- Response summarization for display
- Input intent detection
Tasks that stay on GPT-4o:
- Multi-step reasoning ("Can I afford this house given my income and existing debt?")
- Tool-calling with multiple tools in a single turn
- Structured data extraction from complex documents
- Any task where GPT-4o-mini was evaluated and failed quality thresholds
The routing decision lives in Azure API Management as a policy — not in application code. This means routing rules can change without a deployment.
Lever 2: Context Budget Enforcement
Every retrieved chunk added to a prompt costs tokens. A system that injects 10 chunks per query costs more than a system that injects 3 — and often produces worse answers, because the model's attention dilutes across irrelevant context.
The 2,000-token retrieval budget (from MortgageIQ):
- 3 chunks of ~500 tokens each = 1,500 tokens of context
- 500 tokens for base system prompt
- 2,000 tokens total before the user's question
- User question typically 10–50 tokens
This keeps the prompt well within GPT-4o's optimal attention range and bounds cost precisely.
The token budget calculation that matters:
Monthly context cost = queries/day × days × retrieved_tokens × model_input_price
= 5,000 × 30 × 2,000 × ($30 / 1,000,000)
= $9,000/month just from retrieved context
Cut retrieved_tokens from 2,000 to 1,000:
= $4,500/month — saves $4,500 without changing the model
Context budget is the highest-leverage cost lever after model routing. It requires no infrastructure change — just a hard cap in the retrieval layer.
Lever 3: Prompt Caching
Azure OpenAI supports prompt caching on inputs longer than 1,024 tokens. When the prefix of a prompt matches a recently cached prefix, the cached portion costs 50% less.
For a RAG system with a static system prompt:
System prompt: 500 tokens — always the same
Retrieved context: 1,500 tokens — varies per query
User question: 20–50 tokens — varies per query
Cacheable prefix: the 500-token system prompt
If the system prompt is pinned (not dynamically generated), Azure OpenAI caches it automatically. On a 5,000 query/day system, 500 tokens × 5,000 queries × 50% savings = 1.25M tokens/month cached at half price.
This is free — it requires only that the system prompt not be regenerated on every request. A surprising number of systems regenerate the system prompt dynamically, defeating caching.
Lever 4: Semantic Caching
A semantic cache stores previous answers and returns a cached answer when the new query is a near-duplicate. Unlike exact-match HTTP caching, semantic caching matches by embedding similarity — "What credit score for FHA?" and "FHA loan minimum credit score?" are different strings but semantically equivalent queries.
Cache hit rates in production: 20–40% for FAQ-style applications, lower for open-ended queries. At 30% hit rate on 5,000 queries/day, that's 1,500 LLM calls eliminated daily.
Implementation on Azure: Redis or Cosmos DB with a vector index. Each entry stores the query embedding, the answer, and a TTL (time-to-live to handle knowledge base updates). A similarity threshold of 0.92 is a reasonable starting point — high enough to avoid returning a cached answer to a semantically different question.
The TTL matters: if the knowledge base updates and an answer in the cache is now incorrect, the old answer must expire. Cache TTL should match the knowledge base update frequency — daily for frequently changing content, weekly for stable policy documents.
Building a Cost Attribution Dashboard
Cost governance requires visibility. Azure Monitor provides the raw data; the engineering decision is how to structure it.
What to instrument:
| Signal | Source | Use |
|---|---|---|
| Token usage (input + output) | APIM response headers | Per-query cost calculation |
| Model used | APIM routing policy log | GPT-4o vs. mini split |
| Cache hit/miss | Application telemetry | Semantic cache effectiveness |
| Retrieval token count | RetrievalService metrics | Context budget compliance |
| Query classification | Classifier output | Routing decision audit |
The cost dashboard equation:
Daily cost =
(GPT-4o queries × avg_input_tokens × $0.000030) +
(GPT-4o queries × avg_output_tokens × $0.000060) +
(GPT-4o-mini queries × avg_input_tokens × $0.00000015) +
(embedding calls × avg_tokens × $0.00000013) +
(cache miss rate × query_volume × full_pipeline_cost)
Build this as an Azure Monitor workbook. Set a budget alert at 80% of monthly allocation — not at 100%, which gives you no time to react.
The metric that matters most: cost per query, not total cost. Total cost grows with usage; that's expected. Cost per query growing means your retrieval is getting noisier, your prompts are getting longer, or your cache hit rate is dropping. Each has a different fix.
The Model Update Tax
Azure OpenAI model deployments can be configured to auto-update to the latest minor version. This is convenient. It is also a hidden cost risk.
A model update can change:
- Output verbosity — the model generates more tokens per answer
- Format adherence — structured output prompts may require retuning
- Retrieval behavior — the model may cite sources differently, changing the citation validation rate
Each of these changes has a cost implication. A 20% increase in average output token count is a 20% increase in output cost — invisible unless you're tracking cost per query, not just total cost.
Fix: pin model versions in deployment config. Run cost benchmarks on model upgrades before switching production traffic. Treat a model version change like a dependency version change — evaluate before you upgrade.
The FinOps Checklist Before You Ship
- Token budget enforced in retrieval layer — hard cap, not a soft limit
- Model routing defined — which tasks go to GPT-4o-mini vs. GPT-4o
- Prompt caching enabled — static system prompt, not dynamically regenerated
- Cost per query baseline established — know your starting point before optimizing
- Azure Cost Management alert set at 80% of monthly budget
- Semantic cache planned — even if not built in Phase 1, design the TTL and similarity threshold
- Model version pinned — no auto-updates without evaluation and cost benchmark
- Token usage attributed — APIM logs by user/team/endpoint
These are architecture decisions, not operational afterthoughts. The best time to build them is before the first query goes through the system.
What I've Seen Fail
No routing strategy until the bill arrives. A team ships with GPT-4o for all tasks. Three months later, usage grows, the bill arrives, and they need an emergency routing implementation. The routing policy should have been in APIM from day one — adding it retroactively requires testing that a cold start doesn't.
Static system prompt regenerated per request. The prompt caching benefit requires the cacheable prefix to be identical across requests. A system that inserts a timestamp or a request ID into the system prompt defeats caching on every call. This costs 50% more on the cacheable portion with no benefit.
Embedding cost ignored. Teams model GPT-4o costs carefully and forget that re-indexing the knowledge base triggers batch embedding calls. 1,000 documents × 2,000 tokens average × text-embedding-3-large pricing = $0.26 per full re-index. At daily re-indexing, that's $8/month — negligible. At hourly re-indexing triggered by document changes, it adds up. Design the indexing trigger frequency with cost in mind.
Cache TTL too long. A mortgage rate FAQ has stale closing cost estimates cached for 30 days after the knowledge base was updated. Users receive outdated information. Fix: match TTL to content update frequency, not to "set it and forget it."
MortgageIQ source code (Phase 4A cost baseline): github.com/shivojha/azure-ai-loan-copilot Project page: MortgageIQ — Azure AI Loan Copilot