← All Posts
architectureMarch 22, 2026azureopenaigenairagsemantic-kernelai-foundryarchitecture

The Enterprise GenAI Stack on Azure: What Actually Works in Production

The Azure GenAI platform is five services working together — and most teams assemble them wrong. Here's what the integrated stack looks like and why each component exists.

Most enterprises with an Azure OpenAI subscription do not have an Azure GenAI architecture. They have a React app calling POST /chat/completions. That's a starting point, not a system.

The difference between a GPT-4o wrapper and a production GenAI system is five services working together: Azure OpenAI, Azure AI Foundry, Azure AI Search, Semantic Kernel, and Azure API Management. Each exists for a specific reason. Most teams use one and wonder why the others are there.


The Stack and Why Each Component Exists

Azure OpenAI — the model layer. GPT-4o for complex reasoning, GPT-4o-mini for high-volume classification and summarization, text-embedding-3-large for document and query vectorization. Data never leaves your Azure tenant — this is why enterprises choose Azure OpenAI over the OpenAI API directly: data processing agreements, SOC 2, HIPAA, and GLBA compliance are handled by Microsoft.

Azure AI Foundry — the management plane. Model deployments, fine-tuning pipelines, evaluation runs, and Prompt Flow for visual orchestration. If Azure OpenAI is the engine, Foundry is the dashboard. Most teams skip Foundry in early stages; they regret it when they need to evaluate a prompt change across 100 test queries and have no framework to do it.

Azure AI Search — the retrieval layer. Hybrid search combining BM25 keyword matching with vector similarity, plus a built-in semantic ranker. This is the service that makes RAG work at scale — persistent indexes, CI-triggered re-indexing, and sub-200ms query latency. Most teams start with local file search (fast to build, limited to exact keyword matching) and hit the vocabulary gap problem at week 3.

Semantic Kernel — the orchestration layer. Microsoft's open-source SDK (C# and Python) for building AI applications: managing prompts, calling tools, maintaining memory across turns, and orchestrating multi-step agent workflows. The alternative is to wire all of this manually. Some teams do; they end up rebuilding a subset of Semantic Kernel, worse.

Azure API Management — the governance layer. Rate limiting by user or team, token metering for cost allocation, authentication, model routing (GPT-4o vs. GPT-4o-mini based on request classification), and a single audit trail for all AI calls. Most teams add this last, after the CFO asks about the Azure bill.


The Request Path — End to End

The APIM layer adds token usage to the response headers. This is how you build a cost attribution dashboard without modifying application code — APIM logs every request with token counts; Azure Monitor aggregates it by user, team, or endpoint.


The Three Mistakes Teams Make When Assembling This Stack

Mistake 1: Skipping Azure AI Foundry because it "adds complexity"

The first thing teams skip is Foundry. The reasoning: "We're just calling Azure OpenAI — why do we need another service?"

Three months later: they have 12 different system prompts across 3 environments, no evaluation history, and no idea which prompt version is deployed in production.

Foundry's Prompt Flow is the answer to "what was the prompt when this answer was generated?" without that answer, debugging a production regression is archaeology. Every prompt change should go through a Foundry evaluation run before deployment.

Mistake 2: Local file retrieval in production

Local keyword search is valid for prototyping — zero infrastructure, fast to build, good enough to prove the RAG pattern. It fails in production when users ask questions in their own vocabulary rather than the document's vocabulary.

"How much cash do I need upfront?" doesn't match "closing-costs.md" on keyword overlap. "Cash" and "upfront" don't appear in the document. Azure AI Search's vector retrieval understands that "cash upfront" and "closing costs" are semantically equivalent. The fix is a one-line dependency injection change if the retrieval layer was properly abstracted from day one.

Mistake 3: No model routing strategy

GPT-4o costs $30/1M input tokens. GPT-4o-mini costs $0.15/1M input tokens — 200x cheaper. For a system processing 50,000 queries/day, routing 70% of simple classification and summarization tasks to GPT-4o-mini and reserving GPT-4o for complex reasoning reduces the monthly LLM cost by 60–70%.

APIM is where the routing decision lives. Classify the request by complexity (a fast GPT-4o-mini call) and route accordingly. This is the FinOps pattern that matters before volume grows, not after.


What the Stack Looks Like at Each Phase

Teams don't build this all at once. The right approach is incremental — add each component when a specific need justifies it.

PhaseComponentsWhat you getWhat you're missing
PrototypeAzure OpenAI onlyGPT-4o answersNo retrieval, no governance, no traceability
Demo+ Local file retrieval + Basic promptGrounded answers, source citationsVocabulary gap in retrieval, no model management
Pilot+ Azure AI Search + Semantic KernelHybrid retrieval, orchestrationNo model routing, no evaluation, no cost governance
Production+ Azure AI Foundry + APIM + CosmosFull platform — evaluation, routing, audit

The architecture of MortgageIQ maps to the Demo phase: Azure OpenAI + local file retrieval + a custom retrieval abstraction that makes the Azure AI Search upgrade a single DI swap. Phase 4B adds Azure AI Search. The production path adds Foundry and APIM.


The Compliance Advantage of Azure OpenAI

Every enterprise AI conversation eventually hits this question: where does the data go?

Azure OpenAI's answer is explicit: prompt data, completions, and fine-tuning data are processed within your Azure tenant. Microsoft's data processing agreements for Azure OpenAI satisfy GLBA (financial), HIPAA (healthcare), and SOC 2 (general enterprise) with no custom compliance posture.

The alternative — calling the OpenAI API directly — means data leaves your tenant and is processed by OpenAI Inc. under their terms. For a mortgage company handling borrower PII, for a healthcare system with patient data, for any enterprise under GDPR: this is not a technical decision. It's a compliance decision, and the answer is Azure OpenAI.

This is the single most important reason the Azure GenAI stack exists as a distinct thing from "call the OpenAI API."


The Integration That Most Teams Overlook: Semantic Kernel + APIM

Semantic Kernel handles AI orchestration — prompt management, tool calling, memory, multi-agent workflows. APIM handles governance — auth, rate limiting, routing, audit. Teams tend to treat these as separate concerns that never talk to each other.

The integration point is token metering. APIM intercepts every Azure OpenAI call, reads the x-ms-region and token usage from the response, and logs it to Azure Monitor. Without this, cost attribution requires parsing application logs — a manual process that doesn't scale.

With APIM as the single entry point for all AI calls:

  • Token usage is attributed to the authenticated user or team automatically
  • Rate limits prevent a single team from consuming the entire quota
  • Model routing (GPT-4o vs. GPT-4o-mini) is a policy, not code
  • Every AI call has an audit trail by default — required in regulated industries

The architectural principle: APIM is the firewall for AI calls, not just for REST calls. Every call to Azure OpenAI — from Semantic Kernel, from direct SDK usage, from Prompt Flow — should pass through APIM. Teams that bypass APIM for "internal" AI calls lose visibility into 40% of their token spend on average.


What I've Seen Fail

Prompt management in application configuration. The system prompt is in appsettings.json. Someone updates it in production via an environment variable override. The change isn't tracked. Three weeks later, answer quality drops and the team can't identify when or why. Fix: prompts in git, deployment via Foundry Prompt Flow, evaluation before merge.

Vector index with stale content. The knowledge base was indexed in January. It's now March. Three policy documents were updated. The index wasn't re-triggered. The model is confidently answering from outdated guidelines. Fix: CI-triggered re-indexing on every PR that changes content in the knowledge base directory.

Single model for all tasks. GPT-4o handles every query — classification, summarization, complex reasoning, embedding. The monthly bill is 5x what it needs to be. Fix: model routing at APIM layer; GPT-4o-mini for anything that doesn't require multi-step reasoning.

No token budget in the retrieval layer. The system retrieves 10 chunks per query and injects all of them. The prompt grows to 8,000 tokens. GPT-4o's attention dilutes. Answer quality drops. Cost increases. Fix: hard token cap in the retrieval layer; 2,000 tokens covers 3 substantial chunks with budget to spare.


The Platform, Not the Model

The enterprise GenAI gap isn't access to GPT-4o. Every organization has that. The gap is the platform around it: the retrieval layer that grounds answers in real knowledge, the governance layer that controls cost and access, the evaluation framework that detects quality regression, and the orchestration layer that coordinates multi-step AI workflows.

The companies winning with GenAI in 2025–2026 are not the ones with the best prompts. They're the ones that built the platform first.


MortgageIQ — a working implementation of the RAG layer of this stack: github.com/shivojha/azure-ai-loan-copilot Project page: MortgageIQ — Azure AI Loan Copilot