The Complete Azure AI Stack: 9 Layers, 40+ Services, One Reference

Most teams jump straight to Layer 4 (OpenAI) and wonder why they have a demo, not a platform.

A production AI platform is not a model. It is a 9-layer system where the model is the least interesting part.

User Channels          → who talks to your AI
API Gateway            → who gets in and what's allowed
Agent Orchestration    → how your AI thinks and coordinates
AI Services            → what cognitive capabilities power it
Data & Memory          → what your AI knows and remembers
Messaging              → how components stay decoupled
Compute                → where it runs
MLOps & Observability  → whether it's improving or degrading
Governance             → whether it's compliant and secure

Overall Architecture Diagram

Architecture reference · Azure AI

The Complete Azure AI Stack

9 layers · 40+ services · channels → governance

01User Channels

Every surface where humans or systems reach your AI platform

Web / Mobile App

React, iOS, Android frontends that call your AI APIs

Teams / Copilot

Native M365 embedding — AI inside where work happens

REST / SDK Clients

Python, .NET, JS SDKs for developer-to-agent access

Bot Service

Omnichannel voice, chat, and email AI interactions

Power Platform

No-code AI access for business users via Power Apps

API calls↓

02API Gateway & Security

Single front door — auth, throttling, and prompt safety before any model is touched

Azure API Management

Token metering, rate limits, routing, and usage analytics across all AI services

Entra ID

OAuth 2.0, RBAC roles, and Managed Identity — zero secrets in code

Content Safety

Blocks prompt injection, jailbreaks, and harmful outputs before they reach models

authenticated requests↓

03Agent Orchestration

Where your agent logic lives — the brains that coordinate models, tools, and memory

Azure AI Foundry

Central control plane: model deployments, projects, evaluations, and prompt management

Semantic Kernel

Microsoft's agent SDK — plugins, planners, and multi-agent orchestration in .NET/Python

Prompt Flow

Visual DAG pipeline builder for chaining LLM calls with CI/CD and evaluation gates

LangChain / AutoGen

OSS frameworks for complex multi-agent patterns when you need more flexibility

model + tool calls↓

04Azure AI Services

The cognitive muscle — every model and API that gives your platform intelligence

Azure OpenAI

GPT-4o, o1, GPT-4o-mini, embeddings, DALL-E — reasoning and generation core

Document Intelligence

Extracts structured data from PDFs, forms, and images using prebuilt and custom models

AI Search

Hybrid keyword + vector retrieval — the backbone of every RAG pipeline on Azure

Speech

STT, TTS, and real-time translation for voice-enabled AI interfaces

Language

NER, sentiment, summarization, and PII detection for text understanding

Vision

Image analysis, OCR, and spatial understanding for visual AI use cases

Model Catalog

Llama, Mistral, Phi — deploy OSS models alongside OpenAI within Foundry

reads context · writes state↓

05Data & Memory

Three memory tiers — operational state, long-term RAG knowledge, and analytical history

Cosmos DB

Agent state, session memory, and immutable audit logs — transactional and fast

AI Search (vector)

Long-term semantic memory store — chunk, embed, index, retrieve for RAG agents

Redis Cache

Short-term working memory and prompt caching to cut latency and token spend

Data Lake Gen2

Raw file landing zone — bronze/silver/gold zones for all AI input and output history

Azure Data Factory

Scheduled ETL pipelines that move operational data into the analytical layer nightly

Microsoft Fabric

Unified Synapse + ADF + Power BI — one platform for all analytics and BI on AI outputs

events + integration↓

06Messaging & Integration

The nervous system — decouples agents, streams events, and routes human approval workflows

Service Bus

Durable message queues that decouple agent stages and handle retries and dead-letters

Event Hub

High-throughput streaming firehose for audit logs, telemetry, and SIEM ingestion

Event Grid

Reactive event routing — Blob uploads trigger Document Intelligence pipelines instantly

Logic Apps

Human-in-the-loop approval workflows and low-code integration with external systems

runs on↓

07Compute & Infrastructure

Where your agent microservices actually run — from serverless functions to GPU training clusters

Azure Functions

Serverless event-driven handlers — perfect for lightweight agent triggers and adapters

AKS

Kubernetes for production agent containers that need scale, HA, and zone redundancy

Container Apps

Serverless containers with KEDA autoscaling — simpler than AKS for most agent workloads

Azure ML Compute

GPU clusters for model fine-tuning, training jobs, and batch inference at scale

Blob Storage

Source documents, model artifacts, and pipeline checkpoints — the universal file store

monitored by↓

08MLOps & Observability

Keeps models accurate, reliable, and improving over time — version everything, measure everything

Azure Machine Learning

Model registry, experiment tracking, fine-tuning pipelines, and champion/challenger deploys

App Insights + Monitor

Token usage, latency traces, error rates, and custom alerts for every agent invocation

Foundry Evaluations

Automated groundedness, coherence, and relevance scoring against golden datasets

Responsible AI Dashboard

Fairness analysis, error analysis, and explainability reports for regulatory compliance

governed by↓

09Governance, Security & Compliance

The horizontal wrapper — no service is public, no secret is hardcoded, no action is unaudited

Key Vault

Central secrets store — API keys, certs, and connection strings retrieved at runtime only

Purview

Data catalog, lineage tracking, and PII classification across every AI data source

Defender for Cloud

Continuous security posture management and threat detection across the entire stack

Azure Policy

Enforces tagging, region, SKU, and compliance guardrails — prevents drift at scale

Private Endpoints

Every AI service on a private VNet — zero public internet exposure to models or data

#AzureAI #CloudArchitecture #AIEngineering9 layers · 40+ services · production-ready

Layer 01 · User Channels

Every surface where humans or systems reach your AI platform

What

User channels are the entry points — the frontends, apps, and APIs through which users interact with your AI system. They are not AI themselves; they are the surfaces that route requests down to the AI stack.

Why

Your AI platform is worthless if nobody can reach it. But more importantly: different users need different surfaces. A claims processor uses Teams. A developer uses a REST SDK. A branch manager uses Power Apps. One platform must serve all of them without building five separate AI backends.

How

Each channel connects to the API Gateway (Layer 2), not directly to AI services. This is critical: the channel never has direct access to a model. It fires an HTTPS call to Azure APIM, which enforces auth and routes to the orchestration layer.

User → Channel (React/Teams/Power Apps) → HTTPS → Azure APIM

When to use each channel

Channel	Use When
Web / Mobile App	Custom UX required — React SPA, iOS/Android app with embedded AI chat
Teams / Copilot	M365 enterprise — embed AI where employees already work
REST / SDK Clients	Developer-to-agent access — Python, .NET, JS SDKs calling AI APIs directly
Bot Service	Omnichannel voice + chat — customer support, IVR, email automation
Power Platform	Business user self-service — no-code AI flows in Power Apps / Power Automate

Who owns it

Web/Mobile: Application team (frontend engineers)
Teams/Copilot: M365 admin + developer team
REST/SDK: Platform engineering team — publishes the client SDK
Bot Service: Conversational AI team
Power Platform: Business analysts + low-code developers with IT governance

Key Notes

Never connect a channel directly to an AI service. Always route through APIM.
Teams Copilot Studio is the fastest path to M365 embedding — no custom code needed for basic scenarios.
Bot Service supports DirectLine, Microsoft Teams, Slack, email, SMS in one registration.
Power Platform uses connectors; AI Builder wraps Azure OpenAI for no-code scenarios.
The channel owns the UX. The orchestration layer owns the AI logic. Keep these concerns separate.

Layer 02 · API Gateway & Security

Single front door — auth, throttling, and prompt safety before any model is touched

What

The API gateway is the first Azure service that processes every AI request. It handles:

Authentication — who is this caller?
Authorization — are they allowed to do this?
Rate limiting — how many tokens/requests per minute?
Prompt filtering — is this request safe to pass to a model?

On Azure, this layer is: Azure APIM + Entra ID + Azure AI Content Safety.

Why

Without a gateway, every AI service is independently exposed. You get:

No unified token metering (impossible to control spend)
No central audit log (can't prove who called what)
Credentials scattered across apps (security nightmare)
No jailbreak protection (models are directly exploitable)

At MortgageIQ, the gateway was the first thing built. Every downstream service is on a private VNet. APIM is the only public surface.

How

Azure API Management (APIM):

Inbound policy pipeline:
  validate-jwt (Entra ID token)
  → rate-limit-by-key (subscription key or user OID)
  → call Content Safety API
  → route to backend (Foundry endpoint or Azure OpenAI)

Outbound policy pipeline:
  → log to Event Hub (audit trail)
  → return response with usage headers

Entra ID:

App registrations for each channel (Web, Bot, Teams bot)
API permissions scoped to minimum required roles
Managed Identity for service-to-service — no client secrets
RBAC roles: Cognitive Services OpenAI User for inference, Cognitive Services OpenAI Contributor for management

Azure AI Content Safety:

Deployed as a separate Cognitive Services resource
Called from APIM inbound policy — blocks before model receives request
Categories: Hate, Violence, Sexual, Self-harm (each rated 0–6)
Jailbreak detection (separate classifier — enable explicitly)
Threshold configuration: severity >= 2 → block in regulated industries

When to use

Always for production AI systems. No exceptions.
APIM is also your developer portal — publish AI API products with OpenAPI docs for internal consumers.
Content Safety is especially critical in consumer-facing and regulated (fintech, healthcare) contexts.

Who owns it

Platform engineering / cloud ops team — APIM policies, rate limits, routing
Security team — Entra ID app registrations, RBAC assignments, Content Safety thresholds
Compliance team — audit log retention policy, alert thresholds

Key Notes

APIM token metering: azure-openai-token-limit built-in policy — limits tokens per minute per subscription.
Entra ID Managed Identity: az identity create + assign to APIM. No secrets. No rotation.
Content Safety thresholds differ by industry: severity ≥ 4 for consumer apps, ≥ 2 for healthcare/fintech.
APIM has a built-in AI Gateway feature set (2024): semantic caching, PTU load balancing, token tracking.
Don't use API keys for production. Use OAuth 2.0 client credentials with Managed Identity everywhere.
APIM SKU: Developer (non-prod) → Standard v2 (prod, $0.30/million calls) → Premium (multi-region, VNet).

Layer 03 · Agent Orchestration

Where your agent logic lives — the brains that coordinate models, tools, and memory

What

Orchestration is the code and platform that decides:

Which model to call (and when to escalate to a more capable one)
What tools to invoke (search, database, external APIs)
How to manage conversation state and memory
How to chain multiple AI steps into a coherent workflow
How to handle failures, retries, and fallbacks

On Azure, the orchestration platform is Azure AI Foundry + Semantic Kernel + Prompt Flow.

Why

Models don't think for themselves. A raw call to GPT-4o is autocomplete with good vocabulary. Intelligence emerges from the orchestration layer: the system prompt, the retrieved context, the tool calls, the multi-step reasoning chain.

This is where the architectural leverage is. Most teams ship 90% of the business value here, not in fine-tuning.

How

Azure AI Foundry (control plane):

Foundry Project
  ├── Connected Resources (Azure OpenAI, AI Search, Storage)
  ├── Model Deployments (GPT-4o, GPT-4o-mini, o1)
  ├── Prompt Management (system prompts versioned)
  ├── Evaluation runs (groundedness, coherence, relevance)
  └── Tracing (end-to-end request visibility)

Semantic Kernel (agent SDK):

kernel = Kernel()
kernel.add_service(AzureChatCompletion(
    deployment_name="gpt-4o",
    endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"]
))

# Plugins = tools the agent can call
kernel.add_plugin(LoanStatusPlugin(), "LoanStatus")
kernel.add_plugin(DocumentRetrieverPlugin(), "Documents")

# Planner: agent decides which tools to call
planner = FunctionCallingStepwisePlanner(kernel)
result = await planner.invoke(kernel, "What is the status of loan #ML-2847?")

Prompt Flow (pipeline builder):

Visual DAG: input → retrieval → prompt composition → model call → output validation
YAML-defined flows deployable as managed endpoints
Built-in evaluation nodes (groundedness score, citation check)
CI/CD integration: flows promoted through dev → staging → prod with evaluation gates

LangChain / AutoGen (OSS frameworks):

Use when Semantic Kernel doesn't support the pattern (e.g., AutoGen multi-agent debates)
LangGraph for stateful agent graphs with conditional branching
Always wrap with Foundry tracing for observability

When to use which

Tool	Use When
Semantic Kernel	Production .NET or Python agents — Microsoft-supported, Foundry-native
Prompt Flow	RAG pipelines, batch evaluation, CI/CD-gated deployment
LangChain	Rapid prototyping, rich OSS ecosystem needed
AutoGen	Multi-agent debate/review patterns, agent-to-agent communication

Who owns it

AI engineering team — builds agent logic, prompt templates, tool plugins
Platform engineering — manages Foundry projects, deployment endpoints, access control
QA / ML team — runs Foundry evaluations against golden datasets

Key Notes

Foundry = Microsoft's replacement for Azure OpenAI Studio. All new LLM work goes here.
Semantic Kernel Planner = auto-routing to tools based on user intent. No hardcoded if/else.
Prompt Flow evaluation: groundedness = is the answer supported by retrieved context? Crucial for RAG.
Tool calling (function calling) is how agents interact with external systems. Always define a JSON schema for each tool.
Multi-agent pattern: orchestrator agent + specialist agents (retrieval agent, reasoning agent, compliance agent).
Cost lever: GPT-4o-mini for simple routing decisions, GPT-4o for complex reasoning. o1 only for multi-step math/logic.

Layer 04 · Azure AI Services

The cognitive muscle — every model and API that gives your platform intelligence

What

Azure AI Services are the actual AI capabilities — the trained models and APIs that do the cognitive work. This is the layer most people think of as "AI." It is Layer 4 of 9.

The key services:

Azure OpenAI — LLMs (GPT-4o, o1, GPT-4o-mini), embeddings, image generation
Document Intelligence — structured extraction from unstructured documents
AI Search — hybrid retrieval (keyword + vector) — the RAG backbone
Speech — STT/TTS, real-time translation
Language — NER, sentiment, PII detection, summarization
Vision — image analysis, OCR, spatial understanding
Model Catalog — Llama, Mistral, Phi — OSS models alongside OpenAI

Why — and How to Choose the Right Capability

The most common mistake: using GPT-4o for everything.

Task                          → Right Service
─────────────────────────────────────────────────────
Generate/explain language     → Azure OpenAI (GPT-4o)
Complex multi-step reasoning  → Azure OpenAI (o1)
Fast/cheap generation         → Azure OpenAI (GPT-4o-mini)
Extract fields from a PDF     → Document Intelligence
Find relevant docs in corpus  → AI Search (hybrid)
Transcribe audio              → Speech (STT)
Detect PII in text            → Language (PII detection)
Read text from image          → Vision (OCR) or Doc Intel
Predict a number from data    → Azure ML (XGBoost, tabular)
Generate image                → Azure OpenAI (DALL-E 3)

How — Azure OpenAI in Detail

Model families and when to use:

Model	Use Case	Cost (input/output per 1M tokens)
GPT-4o	Complex reasoning, multi-modal, document understanding	~$2.50 / $10
GPT-4o-mini	High-volume simple tasks, routing, summarization	~$0.15 / $0.60
o1	Multi-step math, code, legal/compliance reasoning	~$15 / $60
text-embedding-3-large	Vector embeddings for RAG	~$0.13/M tokens

Deployment types:

Standard (pay-per-token): Variable throughput. Use for dev, low-volume prod.
Provisioned Throughput Units (PTU): Reserved capacity, predictable latency, lower per-token cost at volume. Break-even at ~100M tokens/month.

How — AI Search in Detail

AI Search is the backbone of every RAG pipeline:

1. Indexing pipeline:
   Document → chunk (512 tokens, 50 overlap) → embed (text-embedding-3-large)
   → index with vector field + keyword fields

2. Retrieval at query time:
   User query → embed → vector search (cosine similarity top-K)
               + keyword search (BM25)
               → hybrid merge (RRF — Reciprocal Rank Fusion)
               → semantic re-ranker (cross-encoder)
               → top-3 chunks returned

3. Prompt composition:
   System prompt + retrieved chunks + user query → LLM

SKUs: Basic ($0.101/hr) → Standard S1 ($0.300/hr) → Storage Optimized L1 ($2.699/hr). Use S1 for most production RAG.

How — Document Intelligence in Detail

Prebuilt models: Invoice, Receipt, W-2, Health Insurance Card, US Mortgage (1003, HUD-1)
Custom models: train on your documents (min 5 labeled samples, 50 recommended)
At MortgageIQ: used prebuilt-mortgage model to extract loan fields from 1003 applications — 94% field accuracy, replacing a manual data entry step that took 3 hours per loan.
Output: JSON key-value pairs with confidence scores. Confidence < 0.7 → route to human review.

When

Always use the most specialized service for the job:

Don't use GPT-4o to extract invoice totals → use Document Intelligence
Don't use GPT-4o to search documents → use AI Search hybrid retrieval + GPT-4o for synthesis
Don't use GPT-4o to detect PII → use Language PII detection (faster, cheaper, more accurate for NER tasks)

Who owns it

AI engineering — model selection, prompt design, fine-tuning
Platform engineering — endpoint deployment, quota management, PTU reservations
Finance/FinOps — token budget governance, PTU vs pay-per-token decision

Key Notes

GPT-4o is multi-modal (text + image + audio). GPT-4o-mini is text-only at launch.
o1 uses internal chain-of-thought — you pay for reasoning tokens. Don't use for simple tasks.
AI Search RRF: combines vector + keyword scores. Always better than vector-only at recall.
Semantic ranker (L2 re-ranking) in AI Search adds ~50-100ms but significantly improves precision.
PTU break-even: if you're spending >$15K/month on standard, PTU is likely cheaper.
Model Catalog: Phi-3-mini is 3.8B parameters — fits on a 4GB GPU, great for on-device/edge scenarios.
Document Intelligence confidence threshold: 0.7 is industry standard for STP (straight-through processing). Below that = human review queue.

Layer 05 · Data & Memory

Three memory tiers — operational state, long-term RAG knowledge, and analytical history

What

AI systems have three distinct memory needs that map to three different Azure services:

Memory Type	Scope	Service	Latency
Working memory	Current session / request	Redis Cache	<1ms
Episodic memory	Conversation history, agent state	Cosmos DB	2-5ms
Semantic memory	Domain knowledge (RAG corpus)	AI Search (vector)	20-50ms
Analytical memory	Historical patterns, training data	Data Lake + Fabric	seconds-minutes

Why

Without the right memory architecture:

Agents lose context between turns (no episodic memory) → frustrating user experience
Agents hallucinate facts not in their training (no semantic memory) → RAG is the fix
You can't improve the model over time (no analytical memory) → drift goes undetected
Repeated identical queries cost full token spend (no working memory/cache) → 40% cost overrun

How — Cosmos DB (Episodic Memory)

{
  "id": "session-{userId}-{timestamp}",
  "userId": "user-12345",
  "agentId": "loan-advisor",
  "turns": [
    {"role": "user", "content": "What is my rate?", "timestamp": "2026-03-24T09:00:00Z"},
    {"role": "assistant", "content": "Your 30yr fixed rate is 6.875%...", "timestamp": "2026-03-24T09:00:02Z"}
  ],
  "state": {"loanId": "ML-2847", "stage": "underwriting"},
  "ttl": 86400
}

Container: agent-sessions, partition key: /userId
SKU: Serverless for dev/test. Provisioned throughput (400-4000 RU/s) for production.
TTL (time-to-live) on session documents: 24h for transient conversations, indefinite for audit logs.

How — Redis Cache (Working Memory)

cache_key = f"prompt:{hash(system_prompt + user_query)}"
cached_response = redis_client.get(cache_key)

if cached_response:
    return cached_response  # 0 tokens spent

response = await openai_client.chat.completions.create(...)
redis_client.setex(cache_key, 3600, response.choices[0].message.content)

Azure Cache for Redis: C2 (6GB) → P1 (6GB, premium, persistence) → P5 (26GB, cluster)
Cache hit rate of 30-40% on common queries = 30-40% token cost reduction
Also used for: rate limit counters, distributed locks for agent coordination, feature flag values

How — AI Search Vector (Semantic Memory)

# Indexing (runs as pipeline on document upload)
chunks = chunk_document(doc, size=512, overlap=50)
embeddings = openai.embeddings.create(input=chunks, model="text-embedding-3-large")

search_client.upload_documents([
    {
        "id": f"{doc_id}-chunk-{i}",
        "content": chunk,
        "embedding": embedding,
        "source": doc_metadata["filename"],
        "category": doc_metadata["category"]
    }
    for i, (chunk, embedding) in enumerate(zip(chunks, embeddings.data))
])

# Retrieval (at query time)
results = search_client.search(
    search_text=user_query,
    vector_queries=[VectorizedQuery(vector=query_embedding, k_nearest_neighbors=5, fields="embedding")],
    query_type=QueryType.SEMANTIC,
    semantic_configuration_name="my-semantic-config",
    top=3
)

How — Data Lake Gen2 + Microsoft Fabric (Analytical Memory)

Bronze/Silver/Gold medallion architecture:

Bronze: raw inputs (PDFs, API payloads, LLM request/response logs) → append-only
Silver: cleaned, parsed, PII-scrubbed, schema-validated data
Gold: model training features, evaluation datasets, business metrics

Microsoft Fabric unifies:

Data Factory — scheduled pipelines (bronze → silver → gold nightly ETL)
Synapse Analytics — SQL and Spark for large-scale feature engineering
Power BI — dashboards for AI platform metrics (token usage, accuracy trends)

When

Redis: Cache repeated prompts, store rate limit counters, agent coordination locks.
Cosmos DB: Any stateful agent that needs to remember context across turns or days.
AI Search vector: All RAG knowledge bases. Only other option is pgvector (niche) or Pinecone (external).
Data Lake: Everything that needs to be kept long-term for training, compliance, or audit.
Fabric: When your data team needs SQL/BI on top of the data lake. Otherwise ADF + Synapse alone is fine.

Who owns it

AI engineering — Redis cache strategy, Cosmos DB schema, AI Search index design
Data engineering — Data Lake zones, ADF pipelines, Fabric workspaces
Platform engineering — Redis SKU sizing, Cosmos DB RU provisioning, DR configuration

Key Notes

Cosmos DB partition key is critical for performance. /userId for session store. /tenantId for multi-tenant.
AI Search: text-embedding-3-large (3072 dimensions) outperforms ada-002 (1536) on most benchmarks.
Redis prompt caching: TTL should match the volatility of your knowledge base. RAG knowledge = 1hr. Static prompts = 24hr.
Data Lake Gen2 = ADLS Gen2. Hierarchical namespace enabled. RBAC + ACLs at folder level.
Fabric licensing: F2 SKU (~$262/month) is the entry point. Includes Power BI Premium features.
Cosmos DB serverless: max 5000 RU/s burst. Fine for dev, risky for high-traffic production (use provisioned).
Remember: AI Search is in both Layer 4 (as an AI service) and Layer 5 (as vector memory). It's one service doing two jobs — retrieval API and vector store simultaneously.

Layer 06 · Messaging & Integration

The nervous system — decouples agents, streams events, and routes human approval workflows

What

Messaging services decouple the components of your AI system so they can scale, fail, and evolve independently. This layer handles:

Async agent communication — one agent triggers another without waiting
Event streaming — high-throughput audit logs, telemetry, and real-time AI context
Reactive triggers — events that fire AI pipelines automatically
Human-in-the-loop — routing decisions to humans when confidence is low

Services: Service Bus · Event Hub · Event Grid · Logic Apps

Why

Synchronous AI pipelines fail at scale:

A 5-second LLM call blocks downstream services
A model timeout takes down the entire request chain
No retry = permanent data loss
Tight coupling = impossible to upgrade individual components

At MortgageIQ: loan document upload triggers a Document Intelligence extraction, which triggers PII scrubbing, which triggers embedding, which triggers the RAG index update — all async via Event Grid + Service Bus. No component knows the others exist.

How

Service Bus — Durable Async Queues

Queue pattern: one sender, one receiver
Topic/subscription pattern: one sender, many receivers (fan-out)

AI use cases:
  - LLM inference request queue (buffer spikes, prevent token limit overruns)
  - Human review queue (confidence < threshold → route to agent for human approval)
  - Retraining trigger queue (drift alert → enqueue retraining job)
  - Dead letter queue: failed LLM calls with full payload for manual review

Session-enabled queues: guarantee FIFO ordering for multi-turn conversation processing
Message lock: 5 min default. Extend for long-running LLM chains.

Event Hub — High-Throughput Streaming

event_data = EventData(json.dumps({
    "timestamp": datetime.utcnow().isoformat(),
    "requestId": request_id,
    "model": "gpt-4o",
    "promptTokens": usage.prompt_tokens,
    "completionTokens": usage.completion_tokens,
    "latencyMs": latency_ms,
    "groundednessScore": groundedness_score,
    "userId": user_id,
    "sessionId": session_id
}))
await producer.send_batch([event_data])

Retention: 1-7 days standard, 90 days with dedicated cluster
Throughput: up to 40 MB/s per partition. 32 partitions = massive parallelism
Downstream: Event Hub → Azure Monitor (alerts), → Synapse (analytics), → SIEM (Sentinel)
SKU: Standard (1 TU = 1 MB/s, $0.028/hr) → Premium for mission-critical

Event Grid — Reactive Triggers

Blob upload → Event Grid → Azure Function → Document Intelligence extraction
Model registry update → Event Grid → Logic App → Slack notification to ML team
Cosmos DB change feed → Event Grid → reindex in AI Search

Event Grid is push-based (vs Event Hub which is pull-based). Use Event Grid for reactive triggers, Event Hub for streaming data.

Logic Apps — Human-in-the-Loop

AI agent flags low-confidence response (score < 0.7)
  → Service Bus message → Logic App trigger
  → Teams Adaptive Card sent to reviewer
  → Reviewer approves/edits
  → Response released to user
  → Decision logged to Cosmos DB for future training

When

Service	Use When
Service Bus	Agent-to-agent async, reliable delivery, dead-letter, FIFO ordering needed
Event Hub	High-volume streaming: audit logs, telemetry, SIEM ingestion (>1M events/day)
Event Grid	Reactive event routing: blob triggers, change events, webhook fan-out
Logic Apps	Human approval workflows, low-code integration with external SaaS

Who owns it

Platform engineering — Service Bus topology, Event Hub partitioning, retention config
AI engineering — message schemas, dead-letter handling, consumer group design
Operations — Logic App approval workflows, on-call alert routing

Key Notes

Service Bus dead-letter queue: process these daily. They represent failed AI calls — critical for debugging.
Event Hub consumer groups: one per downstream consumer (analytics, SIEM, retraining). Never share a consumer group.
Event Grid vs Service Bus: Grid is push/reactive (for triggers), Bus is pull/durable (for reliable delivery). Different tools, different jobs.
Logic Apps Standard = Azure Functions runtime under the hood. Better for complex workflows.
At MortgageIQ: Service Bus queue absorbs document upload spikes. Average 5K loans/month, but month-end spikes to 800/day. Queue smooths it.
Remember: Kafka vs Event Hub. Event Hub has a Kafka-compatible endpoint. You can point Kafka producers/consumers at Event Hub without code changes — useful for migrating from self-managed Kafka.

Layer 07 · Compute & Infrastructure

Where your agent microservices actually run — from serverless functions to GPU training clusters

What

Compute is the infrastructure that executes your agent code. Five options on Azure, each optimized for a different workload profile:

Service	Best For
Azure Functions	Lightweight event-driven triggers, adapters, simple agent handlers
AKS	Production agent microservices requiring HA, scaling, full Kubernetes control
Container Apps	Serverless containers — simpler than AKS, KEDA-based autoscaling
Azure ML Compute	GPU clusters for fine-tuning, training jobs, batch inference
Blob Storage	Source documents, model artifacts, pipeline checkpoints

Why

Compute choice determines:

Cost — Functions scale to zero; AKS has minimum node costs
Latency — cold starts on Functions (100-3000ms) vs always-on AKS pods
Operational complexity — Container Apps > Functions > AKS in ops burden
GPU access — only Azure ML Compute and AKS with GPU node pools provide this

How

Azure Functions — Serverless Triggers

@app.function_name("DocumentIndexer")
@app.blob_trigger(arg_name="blob", path="uploads/{name}", connection="StorageConn")
async def index_document(blob: InputStream):
    content = blob.read()
    chunks = chunk_text(content)
    embeddings = await embed(chunks)
    await search_client.upload_documents(embeddings)

SKU: Consumption (pay per invocation, cold starts) vs Premium (always warm, VNet, $0.173/hr)
Premium plan for AI agents: eliminates cold starts, enables VNet integration for private endpoint access
Max execution time: 10 min (Consumption), unlimited (Premium/Dedicated)

AKS — Production Agent Containers

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: loan-agent-hpa
spec:
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Node SKUs: Standard_D4s_v5 (4 vCPU, 16GB) for CPU agents. Standard_NC6s_v3 (GPU) for inference.
AKS with Azure CNI Overlay for private networking — all pods on VNet.
Workload Identity (replaces pod identity) for Managed Identity to Key Vault from pods.

Container Apps — Simpler than AKS

rules:
  - name: queue-based-scaling
    type: azure-servicebus
    metadata:
      queueName: loan-review-queue
      messageCount: "10"

No node management. Microsoft manages the underlying infrastructure.
KEDA built-in: scale to zero on Service Bus queue depth, HTTP traffic, or custom metrics.
Good for: background processors, API adapters, document intelligence pipelines.
Not good for: stateful workloads, custom networking requirements, GPU workloads.

Azure ML Compute — Training and Fine-Tuning

compute = AmlCompute(
    name="gpu-cluster",
    size="Standard_NC6s_v3",  # 1x V100 GPU, $3.06/hr
    min_instances=0,
    max_instances=4,
    idle_time_before_scale_down=120
)

job = command(
    code="./src",
    command="python train.py --epochs 3 --lr 2e-5",
    environment="azureml:pytorch-gpu:1",
    compute="gpu-cluster",
    experiment_name="gpt4o-mortgage-finetune"
)
ml_client.jobs.create_or_update(job)

Scale to zero when idle = no cost between training runs.
GPU options: NC6s_v3 (V100, $3.06/hr) → NC24ads_A100_v4 (A100, $9.07/hr)
For fine-tuning in Foundry: GPU cluster provisioned automatically — you only specify the hyperparameters.

When to use which

Lightweight trigger (<30s, event-driven)     → Azure Functions (Consumption)
API with VNet requirements, always warm      → Azure Functions Premium
Production microservice, full control        → AKS
Async worker, scale-to-zero, simple ops      → Container Apps
ML training, fine-tuning, batch inference    → Azure ML Compute

Who owns it

Platform engineering / SRE — AKS cluster management, node pools, upgrade cycles
AI engineering — Container Apps and Functions for agent workloads
ML engineering — Azure ML Compute clusters, training job configs

Key Notes

Container Apps environment = shared infrastructure (VNet, Log Analytics). Multiple apps per environment.
AKS Workload Identity: eliminates pod-level service principals. Federated identity to Entra ID.
Azure ML Compute auto-scales to zero. Always set min_instances=0 for training clusters to avoid idle GPU spend.
Functions Premium plan is required for VNet integration — critical for accessing private endpoints.
Blob Storage geo-redundancy: ZRS (zone-redundant) for RA, GRS for DR. For model artifacts: ZRS sufficient.
Cost trap: NC-series GPU VMs never scale below 1 node if min_instances=1. Always set to 0 for batch workloads.

Layer 08 · MLOps & Observability

Keeps models accurate, reliable, and improving over time — version everything, measure everything

What

MLOps is the set of practices and tools that treat AI systems like production software:

Version control for models and data (not just code)
CI/CD for ML pipelines and model deployments
Monitoring for model accuracy, drift, latency, and token spend
Evaluation for quality gates before any model change goes to production
Explainability for regulatory compliance and debugging

Services: Azure ML · Application Insights + Monitor · Foundry Evaluations · Responsible AI Dashboard

Why

Without MLOps:

You don't know when your model started degrading (no drift detection)
You can't reproduce a past model (no versioning)
You can't prove a model change improved things (no evaluation gates)
You can't explain a decision to an auditor (no explainability)

At MortgageIQ: RESPA requires documented AI decision trails. Responsible AI Dashboard + Cosmos DB audit logs are the compliance answer.

How

Azure Machine Learning — Model Registry

model = ml_client.models.create_or_update(
    Model(
        path="./outputs/model",
        name="mortgage-risk-xgboost",
        version="2.1.0",
        description="XGBoost trained on 2025 loan data, MLflow metrics attached",
        tags={"accuracy": "0.912", "dataset_version": "2025-Q4", "approved_by": "ml-lead"}
    )
)

MLflow on Azure ML — Experiment Tracking

with mlflow.start_run(run_name="xgboost-v2-hyperdrive"):
    mlflow.log_params({"n_estimators": 200, "max_depth": 6, "learning_rate": 0.1})
    mlflow.log_metrics({
        "train_accuracy": 0.942,
        "val_accuracy": 0.912,
        "val_auc": 0.961
    })
    mlflow.xgboost.log_model(model, "model")
    mlflow.log_artifact("shap_summary.png")

Every experiment automatically versioned. Compare runs side-by-side in Azure ML Studio UI.

Application Insights — AI Agent Observability

tracer = trace.get_tracer("loan-agent")

with tracer.start_as_current_span("llm-call") as span:
    span.set_attribute("model", "gpt-4o")
    span.set_attribute("prompt_tokens", 847)
    span.set_attribute("session_id", session_id)

    response = await openai_client.chat.completions.create(...)

    span.set_attribute("completion_tokens", response.usage.completion_tokens)
    span.set_attribute("latency_ms", latency)
    span.set_attribute("groundedness_score", groundedness)

Key metrics to track:

Token usage per request (cost attribution)
Latency p50/p95/p99 per model call
Groundedness score per response
Error rate by error type (content filter hit, timeout, model error)

Foundry Evaluations — Quality Gates

evaluation = Evaluation(
    display_name="gpt4o-mortgage-eval-2026-03",
    evaluators={
        "groundedness": GroundednessEvaluator(model_config=model_config),
        "coherence": CoherenceEvaluator(model_config=model_config),
        "relevance": RelevanceEvaluator(model_config=model_config)
    },
    evaluator_configurations={
        "groundedness": EvaluatorConfiguration(threshold={"groundedness": 4.0})
    }
)
# Gate: block deployment if groundedness < 4.0/5.0

Responsible AI Dashboard — Explainability

Fairness analysis: model performance disaggregated by protected attributes (loan type, geography)
Error analysis: which data cohorts have highest error rates?
SHAP values: feature importance per prediction
Counterfactual analysis: what would change this decision?
At MortgageIQ: required by ECOA (Equal Credit Opportunity Act). Dashboard output attached to model card.

Drift Detection (automated)

monitor = MonitorDefinition(
    compute=ServerlessSparkCompute(instance_type="standard_e4s_v3"),
    monitoring_signals={
        "data_drift": DataDriftSignal(
            reference_data=ReferenceData(
                type=MonitorInputDataType.FIXED,
                input_data=Input(type=AssetTypes.MLTABLE, path="azureml:ref-dataset:1")
            ),
            target_dataset=TargetDataset(
                dataset=MonitoringTarget(
                    endpoint_deployment_id="azureml:loan-endpoint:loan-scorer"
                )
            ),
            metric_thresholds=DataDriftMetricThreshold(
                numerical=NumericalDriftMetrics(wasserstein_distance=0.2),
                categorical=CategoricalDriftMetrics(chi_squared_test=0.1)
            )
        )
    },
    alert_notification=AlertNotification(emails=["ml-ops@mortgageiq.com"])
)

When

Azure ML: Every model training run, fine-tuning, and deployment. Non-negotiable.
App Insights: Every production agent invocation. Configure from day one, not as an afterthought.
Foundry Evals: Every model version change and every prompt change in production.
Responsible AI: Every model that influences a regulated decision (credit, healthcare, hiring).
Drift detection: Weekly for stable models, daily for high-velocity production models.

Who owns it

ML engineering — experiment tracking, model registry, drift detection
Platform / SRE — Application Insights dashboards, alert routing, on-call runbooks
Compliance / Legal — Responsible AI Dashboard approval for regulated models

Key Notes

MLflow is the default tracking SDK in Azure ML. All runs auto-logged to the workspace.
Champion/challenger: deploy new model to 10% traffic, compare metrics for 48h, promote or rollback.
Groundedness ≥ 4.0/5.0 and relevance ≥ 4.0/5.0 are typical quality gate thresholds for enterprise RAG.
Wasserstein distance > 0.2 = significant data drift. Chi-squared p-value < 0.05 = categorical drift.
App Insights + Log Analytics = KQL. Learn this: traces | where customDimensions["model"] == "gpt-4o" | summarize avg(todouble(customDimensions["latency_ms"])) by bin(timestamp, 1h).
Remember: MLOps ≠ DevOps. Code versioning is git. Model versioning is Azure ML registry. Data versioning is Data Lake + Azure ML datasets. Three separate versioning concerns.

Layer 09 · Governance, Security & Compliance

The horizontal wrapper — no service is public, no secret is hardcoded, no action is unaudited

What

Governance is not a layer you add at the end. It is a horizontal concern that wraps every other layer. It enforces:

Zero secrets in code — Key Vault
Data visibility and lineage — Purview
Continuous threat detection — Defender for Cloud
Compliance guardrails at infrastructure level — Azure Policy
No AI service on public internet — Private Endpoints

Why

AI systems introduce new governance risks that traditional systems don't have:

Model outputs can leak PII from training data
Prompt injection can exfiltrate data from RAG corpora
Token spend can spiral uncontrolled without governance
Regulators (OCC, CFPB, HIPAA, GDPR) require audit trails for AI decisions
A single publicly-exposed AI endpoint is a direct attack surface

At MortgageIQ: Fannie Mae and RESPA compliance required every AI decision logged, PII scrubbed before the model sees it, all services on private VNet, and model outputs auditable for 7 years.

How

Key Vault — Zero Secrets in Code

from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient

credential = DefaultAzureCredential()  # Uses Managed Identity in Azure
client = SecretClient(vault_url="https://miq-prod-kv.vault.azure.net/", credential=credential)

openai_key = client.get_secret("azure-openai-key").value

All AI service keys, connection strings, and certificates stored in Key Vault
Managed Identity assigned to each compute resource (Functions, AKS pods, Container Apps)
Key Vault access policies → replaced by Azure RBAC: Key Vault Secrets User role
Soft delete + purge protection: deleted secrets recoverable for 90 days

Purview — Data Lineage and PII Classification

Purview scans:
  Azure SQL → classifies columns: SSN, DOB, Account Number
  Data Lake → classifies files: contains PII
  Azure OpenAI → tracks: what data was used in fine-tuning?

Lineage graph:
  1003 Form PDF → Doc Intelligence → Cosmos DB → AI Search → GPT-4o prompt
  Purview captures every hop. Regulator can see exactly where borrower data traveled.

Data Map: catalog of all data assets with automatic sensitivity classification
Sensitivity labels: Public, Internal, Confidential, Highly Confidential — synced from Microsoft 365
Purview + Azure Policy: prevent data labeled "Highly Confidential" from being sent to external endpoints

Defender for Cloud — Continuous Security Posture

Secure Score: quantified security posture (0-100). Target: 80+ for regulated workloads.
Defender for APIs: monitors APIM endpoints for anomalous patterns
Defender for Storage: scans Blob Storage for malware on upload — critical for document ingestion pipelines
Recommendations: auto-generated list of security gaps with remediation steps

Azure Policy — Infrastructure Compliance Guardrails

{
  "mode": "All",
  "policyRule": {
    "if": {
      "allOf": [
        {"field": "type", "equals": "Microsoft.CognitiveServices/accounts"},
        {"field": "location", "notIn": ["eastus", "eastus2", "westus2"]}
      ]
    },
    "then": {"effect": "Deny"}
  }
}

Common AI governance policies:

Require private endpoints on all Cognitive Services
Deny public network access on Azure OpenAI
Require CMK (Customer Managed Keys) for AI Search
Enforce tagging: CostCenter, Environment, DataClassification on all resources

Private Endpoints — Zero Public Internet Exposure

Every AI service should have:
  publicNetworkAccess: Disabled
  privateEndpointConnections: [one per VNet]

Topology:
  AKS (agent pods) → Private Endpoint → Azure OpenAI  [no internet hop]
  AKS (agent pods) → Private Endpoint → AI Search       [no internet hop]
  AKS (agent pods) → Private Endpoint → Cosmos DB        [no internet hop]
  APIM → Public Internet (this is the only intended public surface)

Azure Private DNS Zones: resolve openai.azure.com to private IP within VNet
Hub-spoke VNet: private endpoints in hub VNet, peered to spoke VNets for each environment

When

Key Vault: Day one. If you deployed a Cognitive Services resource with its key in an env var, fix it before moving on.
Purview: When you have PII in your data pipeline or face regulatory audit requirements.
Defender: Enable at subscription level immediately. Cost is negligible vs the risk.
Azure Policy: Set guardrails at the Management Group level so every subscription inherits them.
Private Endpoints: Any production AI service. ~$0.01/hr per endpoint — negligible.

Who owns it

Security / InfoSec team — Key Vault, Defender, private endpoints, network topology
Compliance / Legal — Purview data catalog, sensitivity labels, retention policies
Platform engineering — Azure Policy definitions, Management Group hierarchy, subscription governance
AI engineering — consumes Key Vault (uses Managed Identity), implements PII scrubbing before LLM calls

Key Notes

Managed Identity eliminates credential rotation. No expiry, no rotation, no secret sprawl.
Private Endpoint + DNS: without Private DNS Zone, your VNet resources still resolve to public IP. Both required.
Purview sensitivity labels flow from M365 compliance center. One classification framework across email, SharePoint, and Azure data.
Defender for Cloud Secure Score deductions: public AI endpoints, missing MFA, unused privileged accounts.
Azure Policy DeployIfNotExists vs Deny: Deny blocks immediately. DINE auto-remediates after deployment.
Remember: "Zero Trust" is not a product. It's a principle: verify every request (Entra ID), minimize access (RBAC), assume breach (private endpoints + Defender). Key Vault + Private Endpoints + Managed Identity is the Azure Zero Trust implementation for AI.
Customer Managed Keys (CMK): required for Highly Confidential data at rest. Azure OpenAI, AI Search, Cosmos DB all support CMK. Performance cost: <5% for most workloads.

Cross-Cutting: Architecture Questions

What would you do differently if building this from scratch?

Start with APIM and Entra ID — not with the OpenAI endpoint. Every team skips this and regrets it at compliance audit time.
Private endpoints from day one — retrofitting private networking is a 2-week project. It's a 2-hour project up front.
Event Hub for all AI call logging — you will need this for drift detection, cost attribution, and debugging. If you didn't log it at call time, it's gone.
AI Search for RAG, not custom vector DBs — the hybrid retrieval and semantic ranker in AI Search outperform most custom setups. Don't build what Azure already ships.

How do you handle cost governance for AI?

Token spend visibility: APIM → Event Hub → Log Analytics → Power BI dashboard
Budget alerts: Azure Cost Management budgets at subscription level
Throttling: APIM token-limit policy per subscription key
PTU evaluation: monthly Azure Cost Analysis — if GPT-4o standard > $15K/month, model PTU
Right-sizing: route simple queries to GPT-4o-mini (~20x cheaper than GPT-4o)
Cache: Redis prompt cache reduces token spend by 30-40% on common queries

How do you ensure AI outputs are compliant?

Pre-model: PII scrubbing (Azure Language PII detection) before prompt composition
In-model: system prompt engineering — explicit RESPA/compliance constraints
Post-model: output validation (regex + classifier for prohibited content)
Audit trail: every request/response logged to Cosmos DB (immutable) + Event Hub
Explainability: SHAP values for tabular ML, groundedness scores for RAG
Human review: confidence < threshold → Service Bus queue → Logic App → human

Kafka vs Azure messaging services?

	Kafka (self-managed)	Event Hub (Kafka-compatible)	Service Bus
Use case	High-throughput event streaming	Same as Kafka, managed	Reliable message queuing
Throughput	Unlimited	40 MB/s per partition	80 GB/day (Premium)
Operational cost	High (cluster mgmt)	Low (managed)	Low (managed)
Ordering	Per partition	Per partition	Per session
Migration	N/A	Kafka SDK compatible	Different protocol

At MortgageIQ (UWM): GCP Pub/Sub for loan event streaming. On Azure: Event Hub is the right answer — Kafka SDK compatible, no code changes needed for teams already writing Kafka producers.

Quick Reference — Decision Trees

Which compute for my agent?

Is it event-triggered and < 10 min?
  YES → Azure Functions (Consumption or Premium if VNet needed)
  NO  →
    Does it need GPU?
      YES → Azure ML Compute (training) or AKS GPU node pool (inference)
      NO  →
        Do I need full Kubernetes control?
          YES → AKS
          NO  → Container Apps (scale to zero, simpler ops)

Which data store for my AI context?

Current request context (< 1s TTL)              → Redis
Conversation history (hours/days)               → Cosmos DB
Domain knowledge for RAG                        → AI Search (vector)
Long-term analytics / training data             → Data Lake Gen2
Reporting and BI on AI metrics                  → Microsoft Fabric

Which model for my task?

Generate language (explain, summarize, draft)   → GPT-4o
High-volume simple generation                   → GPT-4o-mini
Complex reasoning (math, legal, code)           → o1
Predict a number from structured data           → XGBoost / Azure ML AutoML
Extract fields from a document                  → Document Intelligence
Search for relevant documents                   → AI Search (hybrid + semantic)
Transcribe speech                               → Azure Speech (STT)
Detect PII in text                              → Azure Language (PII detection)