Most teams jump straight to Layer 4 (OpenAI) and wonder why they have a demo, not a platform.
A production AI platform is not a model. It is a 9-layer system where the model is the least interesting part.
User Channels → who talks to your AI
API Gateway → who gets in and what's allowed
Agent Orchestration → how your AI thinks and coordinates
AI Services → what cognitive capabilities power it
Data & Memory → what your AI knows and remembers
Messaging → how components stay decoupled
Compute → where it runs
MLOps & Observability → whether it's improving or degrading
Governance → whether it's compliant and secure
Overall Architecture Diagram
Layer 01 · User Channels
Every surface where humans or systems reach your AI platform
What
User channels are the entry points — the frontends, apps, and APIs through which users interact with your AI system. They are not AI themselves; they are the surfaces that route requests down to the AI stack.
Why
Your AI platform is worthless if nobody can reach it. But more importantly: different users need different surfaces. A claims processor uses Teams. A developer uses a REST SDK. A branch manager uses Power Apps. One platform must serve all of them without building five separate AI backends.
How
Each channel connects to the API Gateway (Layer 2), not directly to AI services. This is critical: the channel never has direct access to a model. It fires an HTTPS call to Azure APIM, which enforces auth and routes to the orchestration layer.
User → Channel (React/Teams/Power Apps) → HTTPS → Azure APIM
When to use each channel
| Channel | Use When |
|---|---|
| Web / Mobile App | Custom UX required — React SPA, iOS/Android app with embedded AI chat |
| Teams / Copilot | M365 enterprise — embed AI where employees already work |
| REST / SDK Clients | Developer-to-agent access — Python, .NET, JS SDKs calling AI APIs directly |
| Bot Service | Omnichannel voice + chat — customer support, IVR, email automation |
| Power Platform | Business user self-service — no-code AI flows in Power Apps / Power Automate |
Who owns it
- Web/Mobile: Application team (frontend engineers)
- Teams/Copilot: M365 admin + developer team
- REST/SDK: Platform engineering team — publishes the client SDK
- Bot Service: Conversational AI team
- Power Platform: Business analysts + low-code developers with IT governance
Key Notes
- Never connect a channel directly to an AI service. Always route through APIM.
- Teams Copilot Studio is the fastest path to M365 embedding — no custom code needed for basic scenarios.
- Bot Service supports DirectLine, Microsoft Teams, Slack, email, SMS in one registration.
- Power Platform uses connectors; AI Builder wraps Azure OpenAI for no-code scenarios.
- The channel owns the UX. The orchestration layer owns the AI logic. Keep these concerns separate.
Layer 02 · API Gateway & Security
Single front door — auth, throttling, and prompt safety before any model is touched
What
The API gateway is the first Azure service that processes every AI request. It handles:
- Authentication — who is this caller?
- Authorization — are they allowed to do this?
- Rate limiting — how many tokens/requests per minute?
- Prompt filtering — is this request safe to pass to a model?
On Azure, this layer is: Azure APIM + Entra ID + Azure AI Content Safety.
Why
Without a gateway, every AI service is independently exposed. You get:
- No unified token metering (impossible to control spend)
- No central audit log (can't prove who called what)
- Credentials scattered across apps (security nightmare)
- No jailbreak protection (models are directly exploitable)
At MortgageIQ, the gateway was the first thing built. Every downstream service is on a private VNet. APIM is the only public surface.
How
Azure API Management (APIM):
Inbound policy pipeline:
validate-jwt (Entra ID token)
→ rate-limit-by-key (subscription key or user OID)
→ call Content Safety API
→ route to backend (Foundry endpoint or Azure OpenAI)
Outbound policy pipeline:
→ log to Event Hub (audit trail)
→ return response with usage headers
Entra ID:
- App registrations for each channel (Web, Bot, Teams bot)
- API permissions scoped to minimum required roles
- Managed Identity for service-to-service — no client secrets
- RBAC roles:
Cognitive Services OpenAI Userfor inference,Cognitive Services OpenAI Contributorfor management
Azure AI Content Safety:
- Deployed as a separate Cognitive Services resource
- Called from APIM inbound policy — blocks before model receives request
- Categories: Hate, Violence, Sexual, Self-harm (each rated 0–6)
- Jailbreak detection (separate classifier — enable explicitly)
- Threshold configuration: severity >= 2 → block in regulated industries
When to use
- Always for production AI systems. No exceptions.
- APIM is also your developer portal — publish AI API products with OpenAPI docs for internal consumers.
- Content Safety is especially critical in consumer-facing and regulated (fintech, healthcare) contexts.
Who owns it
- Platform engineering / cloud ops team — APIM policies, rate limits, routing
- Security team — Entra ID app registrations, RBAC assignments, Content Safety thresholds
- Compliance team — audit log retention policy, alert thresholds
Key Notes
- APIM token metering:
azure-openai-token-limitbuilt-in policy — limits tokens per minute per subscription. - Entra ID Managed Identity:
az identity create+ assign to APIM. No secrets. No rotation. - Content Safety thresholds differ by industry: severity ≥ 4 for consumer apps, ≥ 2 for healthcare/fintech.
- APIM has a built-in AI Gateway feature set (2024): semantic caching, PTU load balancing, token tracking.
- Don't use API keys for production. Use OAuth 2.0 client credentials with Managed Identity everywhere.
- APIM SKU: Developer (non-prod) → Standard v2 (prod, $0.30/million calls) → Premium (multi-region, VNet).
Layer 03 · Agent Orchestration
Where your agent logic lives — the brains that coordinate models, tools, and memory
What
Orchestration is the code and platform that decides:
- Which model to call (and when to escalate to a more capable one)
- What tools to invoke (search, database, external APIs)
- How to manage conversation state and memory
- How to chain multiple AI steps into a coherent workflow
- How to handle failures, retries, and fallbacks
On Azure, the orchestration platform is Azure AI Foundry + Semantic Kernel + Prompt Flow.
Why
Models don't think for themselves. A raw call to GPT-4o is autocomplete with good vocabulary. Intelligence emerges from the orchestration layer: the system prompt, the retrieved context, the tool calls, the multi-step reasoning chain.
This is where the architectural leverage is. Most teams ship 90% of the business value here, not in fine-tuning.
How
Azure AI Foundry (control plane):
Foundry Project
├── Connected Resources (Azure OpenAI, AI Search, Storage)
├── Model Deployments (GPT-4o, GPT-4o-mini, o1)
├── Prompt Management (system prompts versioned)
├── Evaluation runs (groundedness, coherence, relevance)
└── Tracing (end-to-end request visibility)
Semantic Kernel (agent SDK):
kernel = Kernel()
kernel.add_service(AzureChatCompletion(
deployment_name="gpt-4o",
endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_API_KEY"]
))
# Plugins = tools the agent can call
kernel.add_plugin(LoanStatusPlugin(), "LoanStatus")
kernel.add_plugin(DocumentRetrieverPlugin(), "Documents")
# Planner: agent decides which tools to call
planner = FunctionCallingStepwisePlanner(kernel)
result = await planner.invoke(kernel, "What is the status of loan #ML-2847?")
Prompt Flow (pipeline builder):
- Visual DAG: input → retrieval → prompt composition → model call → output validation
- YAML-defined flows deployable as managed endpoints
- Built-in evaluation nodes (groundedness score, citation check)
- CI/CD integration: flows promoted through dev → staging → prod with evaluation gates
LangChain / AutoGen (OSS frameworks):
- Use when Semantic Kernel doesn't support the pattern (e.g., AutoGen multi-agent debates)
- LangGraph for stateful agent graphs with conditional branching
- Always wrap with Foundry tracing for observability
When to use which
| Tool | Use When |
|---|---|
| Semantic Kernel | Production .NET or Python agents — Microsoft-supported, Foundry-native |
| Prompt Flow | RAG pipelines, batch evaluation, CI/CD-gated deployment |
| LangChain | Rapid prototyping, rich OSS ecosystem needed |
| AutoGen | Multi-agent debate/review patterns, agent-to-agent communication |
Who owns it
- AI engineering team — builds agent logic, prompt templates, tool plugins
- Platform engineering — manages Foundry projects, deployment endpoints, access control
- QA / ML team — runs Foundry evaluations against golden datasets
Key Notes
- Foundry = Microsoft's replacement for Azure OpenAI Studio. All new LLM work goes here.
- Semantic Kernel Planner = auto-routing to tools based on user intent. No hardcoded if/else.
- Prompt Flow evaluation: groundedness = is the answer supported by retrieved context? Crucial for RAG.
- Tool calling (function calling) is how agents interact with external systems. Always define a JSON schema for each tool.
- Multi-agent pattern: orchestrator agent + specialist agents (retrieval agent, reasoning agent, compliance agent).
- Cost lever: GPT-4o-mini for simple routing decisions, GPT-4o for complex reasoning. o1 only for multi-step math/logic.
Layer 04 · Azure AI Services
The cognitive muscle — every model and API that gives your platform intelligence
What
Azure AI Services are the actual AI capabilities — the trained models and APIs that do the cognitive work. This is the layer most people think of as "AI." It is Layer 4 of 9.
The key services:
- Azure OpenAI — LLMs (GPT-4o, o1, GPT-4o-mini), embeddings, image generation
- Document Intelligence — structured extraction from unstructured documents
- AI Search — hybrid retrieval (keyword + vector) — the RAG backbone
- Speech — STT/TTS, real-time translation
- Language — NER, sentiment, PII detection, summarization
- Vision — image analysis, OCR, spatial understanding
- Model Catalog — Llama, Mistral, Phi — OSS models alongside OpenAI
Why — and How to Choose the Right Capability
The most common mistake: using GPT-4o for everything.
Task → Right Service
─────────────────────────────────────────────────────
Generate/explain language → Azure OpenAI (GPT-4o)
Complex multi-step reasoning → Azure OpenAI (o1)
Fast/cheap generation → Azure OpenAI (GPT-4o-mini)
Extract fields from a PDF → Document Intelligence
Find relevant docs in corpus → AI Search (hybrid)
Transcribe audio → Speech (STT)
Detect PII in text → Language (PII detection)
Read text from image → Vision (OCR) or Doc Intel
Predict a number from data → Azure ML (XGBoost, tabular)
Generate image → Azure OpenAI (DALL-E 3)
How — Azure OpenAI in Detail
Model families and when to use:
| Model | Use Case | Cost (input/output per 1M tokens) |
|---|---|---|
| GPT-4o | Complex reasoning, multi-modal, document understanding | ~$2.50 / $10 |
| GPT-4o-mini | High-volume simple tasks, routing, summarization | ~$0.15 / $0.60 |
| o1 | Multi-step math, code, legal/compliance reasoning | ~$15 / $60 |
| text-embedding-3-large | Vector embeddings for RAG | ~$0.13/M tokens |
Deployment types:
- Standard (pay-per-token): Variable throughput. Use for dev, low-volume prod.
- Provisioned Throughput Units (PTU): Reserved capacity, predictable latency, lower per-token cost at volume. Break-even at ~100M tokens/month.
How — AI Search in Detail
AI Search is the backbone of every RAG pipeline:
1. Indexing pipeline:
Document → chunk (512 tokens, 50 overlap) → embed (text-embedding-3-large)
→ index with vector field + keyword fields
2. Retrieval at query time:
User query → embed → vector search (cosine similarity top-K)
+ keyword search (BM25)
→ hybrid merge (RRF — Reciprocal Rank Fusion)
→ semantic re-ranker (cross-encoder)
→ top-3 chunks returned
3. Prompt composition:
System prompt + retrieved chunks + user query → LLM
SKUs: Basic ($0.101/hr) → Standard S1 ($0.300/hr) → Storage Optimized L1 ($2.699/hr). Use S1 for most production RAG.
How — Document Intelligence in Detail
- Prebuilt models: Invoice, Receipt, W-2, Health Insurance Card, US Mortgage (1003, HUD-1)
- Custom models: train on your documents (min 5 labeled samples, 50 recommended)
- At MortgageIQ: used
prebuilt-mortgagemodel to extract loan fields from 1003 applications — 94% field accuracy, replacing a manual data entry step that took 3 hours per loan. - Output: JSON key-value pairs with confidence scores. Confidence < 0.7 → route to human review.
When
Always use the most specialized service for the job:
- Don't use GPT-4o to extract invoice totals → use Document Intelligence
- Don't use GPT-4o to search documents → use AI Search hybrid retrieval + GPT-4o for synthesis
- Don't use GPT-4o to detect PII → use Language PII detection (faster, cheaper, more accurate for NER tasks)
Who owns it
- AI engineering — model selection, prompt design, fine-tuning
- Platform engineering — endpoint deployment, quota management, PTU reservations
- Finance/FinOps — token budget governance, PTU vs pay-per-token decision
Key Notes
- GPT-4o is multi-modal (text + image + audio). GPT-4o-mini is text-only at launch.
- o1 uses internal chain-of-thought — you pay for reasoning tokens. Don't use for simple tasks.
- AI Search RRF: combines vector + keyword scores. Always better than vector-only at recall.
- Semantic ranker (L2 re-ranking) in AI Search adds ~50-100ms but significantly improves precision.
- PTU break-even: if you're spending >$15K/month on standard, PTU is likely cheaper.
- Model Catalog: Phi-3-mini is 3.8B parameters — fits on a 4GB GPU, great for on-device/edge scenarios.
- Document Intelligence confidence threshold: 0.7 is industry standard for STP (straight-through processing). Below that = human review queue.
Layer 05 · Data & Memory
Three memory tiers — operational state, long-term RAG knowledge, and analytical history
What
AI systems have three distinct memory needs that map to three different Azure services:
| Memory Type | Scope | Service | Latency |
|---|---|---|---|
| Working memory | Current session / request | Redis Cache | <1ms |
| Episodic memory | Conversation history, agent state | Cosmos DB | 2-5ms |
| Semantic memory | Domain knowledge (RAG corpus) | AI Search (vector) | 20-50ms |
| Analytical memory | Historical patterns, training data | Data Lake + Fabric | seconds-minutes |
Why
Without the right memory architecture:
- Agents lose context between turns (no episodic memory) → frustrating user experience
- Agents hallucinate facts not in their training (no semantic memory) → RAG is the fix
- You can't improve the model over time (no analytical memory) → drift goes undetected
- Repeated identical queries cost full token spend (no working memory/cache) → 40% cost overrun
How — Cosmos DB (Episodic Memory)
{
"id": "session-{userId}-{timestamp}",
"userId": "user-12345",
"agentId": "loan-advisor",
"turns": [
{"role": "user", "content": "What is my rate?", "timestamp": "2026-03-24T09:00:00Z"},
{"role": "assistant", "content": "Your 30yr fixed rate is 6.875%...", "timestamp": "2026-03-24T09:00:02Z"}
],
"state": {"loanId": "ML-2847", "stage": "underwriting"},
"ttl": 86400
}
- Container:
agent-sessions, partition key:/userId - SKU: Serverless for dev/test. Provisioned throughput (400-4000 RU/s) for production.
- TTL (time-to-live) on session documents: 24h for transient conversations, indefinite for audit logs.
How — Redis Cache (Working Memory)
cache_key = f"prompt:{hash(system_prompt + user_query)}"
cached_response = redis_client.get(cache_key)
if cached_response:
return cached_response # 0 tokens spent
response = await openai_client.chat.completions.create(...)
redis_client.setex(cache_key, 3600, response.choices[0].message.content)
- Azure Cache for Redis: C2 (6GB) → P1 (6GB, premium, persistence) → P5 (26GB, cluster)
- Cache hit rate of 30-40% on common queries = 30-40% token cost reduction
- Also used for: rate limit counters, distributed locks for agent coordination, feature flag values
How — AI Search Vector (Semantic Memory)
# Indexing (runs as pipeline on document upload)
chunks = chunk_document(doc, size=512, overlap=50)
embeddings = openai.embeddings.create(input=chunks, model="text-embedding-3-large")
search_client.upload_documents([
{
"id": f"{doc_id}-chunk-{i}",
"content": chunk,
"embedding": embedding,
"source": doc_metadata["filename"],
"category": doc_metadata["category"]
}
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings.data))
])
# Retrieval (at query time)
results = search_client.search(
search_text=user_query,
vector_queries=[VectorizedQuery(vector=query_embedding, k_nearest_neighbors=5, fields="embedding")],
query_type=QueryType.SEMANTIC,
semantic_configuration_name="my-semantic-config",
top=3
)
How — Data Lake Gen2 + Microsoft Fabric (Analytical Memory)
Bronze/Silver/Gold medallion architecture:
Bronze: raw inputs (PDFs, API payloads, LLM request/response logs) → append-only
Silver: cleaned, parsed, PII-scrubbed, schema-validated data
Gold: model training features, evaluation datasets, business metrics
Microsoft Fabric unifies:
- Data Factory — scheduled pipelines (bronze → silver → gold nightly ETL)
- Synapse Analytics — SQL and Spark for large-scale feature engineering
- Power BI — dashboards for AI platform metrics (token usage, accuracy trends)
When
- Redis: Cache repeated prompts, store rate limit counters, agent coordination locks.
- Cosmos DB: Any stateful agent that needs to remember context across turns or days.
- AI Search vector: All RAG knowledge bases. Only other option is pgvector (niche) or Pinecone (external).
- Data Lake: Everything that needs to be kept long-term for training, compliance, or audit.
- Fabric: When your data team needs SQL/BI on top of the data lake. Otherwise ADF + Synapse alone is fine.
Who owns it
- AI engineering — Redis cache strategy, Cosmos DB schema, AI Search index design
- Data engineering — Data Lake zones, ADF pipelines, Fabric workspaces
- Platform engineering — Redis SKU sizing, Cosmos DB RU provisioning, DR configuration
Key Notes
- Cosmos DB partition key is critical for performance.
/userIdfor session store./tenantIdfor multi-tenant. - AI Search:
text-embedding-3-large(3072 dimensions) outperformsada-002(1536) on most benchmarks. - Redis prompt caching: TTL should match the volatility of your knowledge base. RAG knowledge = 1hr. Static prompts = 24hr.
- Data Lake Gen2 = ADLS Gen2. Hierarchical namespace enabled. RBAC + ACLs at folder level.
- Fabric licensing: F2 SKU (~$262/month) is the entry point. Includes Power BI Premium features.
- Cosmos DB serverless: max 5000 RU/s burst. Fine for dev, risky for high-traffic production (use provisioned).
- Remember: AI Search is in both Layer 4 (as an AI service) and Layer 5 (as vector memory). It's one service doing two jobs — retrieval API and vector store simultaneously.
Layer 06 · Messaging & Integration
The nervous system — decouples agents, streams events, and routes human approval workflows
What
Messaging services decouple the components of your AI system so they can scale, fail, and evolve independently. This layer handles:
- Async agent communication — one agent triggers another without waiting
- Event streaming — high-throughput audit logs, telemetry, and real-time AI context
- Reactive triggers — events that fire AI pipelines automatically
- Human-in-the-loop — routing decisions to humans when confidence is low
Services: Service Bus · Event Hub · Event Grid · Logic Apps
Why
Synchronous AI pipelines fail at scale:
- A 5-second LLM call blocks downstream services
- A model timeout takes down the entire request chain
- No retry = permanent data loss
- Tight coupling = impossible to upgrade individual components
At MortgageIQ: loan document upload triggers a Document Intelligence extraction, which triggers PII scrubbing, which triggers embedding, which triggers the RAG index update — all async via Event Grid + Service Bus. No component knows the others exist.
How
Service Bus — Durable Async Queues
Queue pattern: one sender, one receiver
Topic/subscription pattern: one sender, many receivers (fan-out)
AI use cases:
- LLM inference request queue (buffer spikes, prevent token limit overruns)
- Human review queue (confidence < threshold → route to agent for human approval)
- Retraining trigger queue (drift alert → enqueue retraining job)
- Dead letter queue: failed LLM calls with full payload for manual review
Session-enabled queues: guarantee FIFO ordering for multi-turn conversation processing
Message lock: 5 min default. Extend for long-running LLM chains.
Event Hub — High-Throughput Streaming
event_data = EventData(json.dumps({
"timestamp": datetime.utcnow().isoformat(),
"requestId": request_id,
"model": "gpt-4o",
"promptTokens": usage.prompt_tokens,
"completionTokens": usage.completion_tokens,
"latencyMs": latency_ms,
"groundednessScore": groundedness_score,
"userId": user_id,
"sessionId": session_id
}))
await producer.send_batch([event_data])
- Retention: 1-7 days standard, 90 days with dedicated cluster
- Throughput: up to 40 MB/s per partition. 32 partitions = massive parallelism
- Downstream: Event Hub → Azure Monitor (alerts), → Synapse (analytics), → SIEM (Sentinel)
- SKU: Standard (1 TU = 1 MB/s, $0.028/hr) → Premium for mission-critical
Event Grid — Reactive Triggers
Blob upload → Event Grid → Azure Function → Document Intelligence extraction
Model registry update → Event Grid → Logic App → Slack notification to ML team
Cosmos DB change feed → Event Grid → reindex in AI Search
Event Grid is push-based (vs Event Hub which is pull-based). Use Event Grid for reactive triggers, Event Hub for streaming data.
Logic Apps — Human-in-the-Loop
AI agent flags low-confidence response (score < 0.7)
→ Service Bus message → Logic App trigger
→ Teams Adaptive Card sent to reviewer
→ Reviewer approves/edits
→ Response released to user
→ Decision logged to Cosmos DB for future training
When
| Service | Use When |
|---|---|
| Service Bus | Agent-to-agent async, reliable delivery, dead-letter, FIFO ordering needed |
| Event Hub | High-volume streaming: audit logs, telemetry, SIEM ingestion (>1M events/day) |
| Event Grid | Reactive event routing: blob triggers, change events, webhook fan-out |
| Logic Apps | Human approval workflows, low-code integration with external SaaS |
Who owns it
- Platform engineering — Service Bus topology, Event Hub partitioning, retention config
- AI engineering — message schemas, dead-letter handling, consumer group design
- Operations — Logic App approval workflows, on-call alert routing
Key Notes
- Service Bus dead-letter queue: process these daily. They represent failed AI calls — critical for debugging.
- Event Hub consumer groups: one per downstream consumer (analytics, SIEM, retraining). Never share a consumer group.
- Event Grid vs Service Bus: Grid is push/reactive (for triggers), Bus is pull/durable (for reliable delivery). Different tools, different jobs.
- Logic Apps Standard = Azure Functions runtime under the hood. Better for complex workflows.
- At MortgageIQ: Service Bus queue absorbs document upload spikes. Average 5K loans/month, but month-end spikes to 800/day. Queue smooths it.
- Remember: Kafka vs Event Hub. Event Hub has a Kafka-compatible endpoint. You can point Kafka producers/consumers at Event Hub without code changes — useful for migrating from self-managed Kafka.
Layer 07 · Compute & Infrastructure
Where your agent microservices actually run — from serverless functions to GPU training clusters
What
Compute is the infrastructure that executes your agent code. Five options on Azure, each optimized for a different workload profile:
| Service | Best For |
|---|---|
| Azure Functions | Lightweight event-driven triggers, adapters, simple agent handlers |
| AKS | Production agent microservices requiring HA, scaling, full Kubernetes control |
| Container Apps | Serverless containers — simpler than AKS, KEDA-based autoscaling |
| Azure ML Compute | GPU clusters for fine-tuning, training jobs, batch inference |
| Blob Storage | Source documents, model artifacts, pipeline checkpoints |
Why
Compute choice determines:
- Cost — Functions scale to zero; AKS has minimum node costs
- Latency — cold starts on Functions (100-3000ms) vs always-on AKS pods
- Operational complexity — Container Apps > Functions > AKS in ops burden
- GPU access — only Azure ML Compute and AKS with GPU node pools provide this
How
Azure Functions — Serverless Triggers
@app.function_name("DocumentIndexer")
@app.blob_trigger(arg_name="blob", path="uploads/{name}", connection="StorageConn")
async def index_document(blob: InputStream):
content = blob.read()
chunks = chunk_text(content)
embeddings = await embed(chunks)
await search_client.upload_documents(embeddings)
- SKU: Consumption (pay per invocation, cold starts) vs Premium (always warm, VNet, $0.173/hr)
- Premium plan for AI agents: eliminates cold starts, enables VNet integration for private endpoint access
- Max execution time: 10 min (Consumption), unlimited (Premium/Dedicated)
AKS — Production Agent Containers
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: loan-agent-hpa
spec:
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- Node SKUs:
Standard_D4s_v5(4 vCPU, 16GB) for CPU agents.Standard_NC6s_v3(GPU) for inference. - AKS with Azure CNI Overlay for private networking — all pods on VNet.
- Workload Identity (replaces pod identity) for Managed Identity to Key Vault from pods.
Container Apps — Simpler than AKS
rules:
- name: queue-based-scaling
type: azure-servicebus
metadata:
queueName: loan-review-queue
messageCount: "10"
- No node management. Microsoft manages the underlying infrastructure.
- KEDA built-in: scale to zero on Service Bus queue depth, HTTP traffic, or custom metrics.
- Good for: background processors, API adapters, document intelligence pipelines.
- Not good for: stateful workloads, custom networking requirements, GPU workloads.
Azure ML Compute — Training and Fine-Tuning
compute = AmlCompute(
name="gpu-cluster",
size="Standard_NC6s_v3", # 1x V100 GPU, $3.06/hr
min_instances=0,
max_instances=4,
idle_time_before_scale_down=120
)
job = command(
code="./src",
command="python train.py --epochs 3 --lr 2e-5",
environment="azureml:pytorch-gpu:1",
compute="gpu-cluster",
experiment_name="gpt4o-mortgage-finetune"
)
ml_client.jobs.create_or_update(job)
- Scale to zero when idle = no cost between training runs.
- GPU options:
NC6s_v3(V100, $3.06/hr) →NC24ads_A100_v4(A100, $9.07/hr) - For fine-tuning in Foundry: GPU cluster provisioned automatically — you only specify the hyperparameters.
When to use which
Lightweight trigger (<30s, event-driven) → Azure Functions (Consumption)
API with VNet requirements, always warm → Azure Functions Premium
Production microservice, full control → AKS
Async worker, scale-to-zero, simple ops → Container Apps
ML training, fine-tuning, batch inference → Azure ML Compute
Who owns it
- Platform engineering / SRE — AKS cluster management, node pools, upgrade cycles
- AI engineering — Container Apps and Functions for agent workloads
- ML engineering — Azure ML Compute clusters, training job configs
Key Notes
- Container Apps environment = shared infrastructure (VNet, Log Analytics). Multiple apps per environment.
- AKS Workload Identity: eliminates pod-level service principals. Federated identity to Entra ID.
- Azure ML Compute auto-scales to zero. Always set
min_instances=0for training clusters to avoid idle GPU spend. - Functions Premium plan is required for VNet integration — critical for accessing private endpoints.
- Blob Storage geo-redundancy: ZRS (zone-redundant) for RA, GRS for DR. For model artifacts: ZRS sufficient.
- Cost trap: NC-series GPU VMs never scale below 1 node if
min_instances=1. Always set to 0 for batch workloads.
Layer 08 · MLOps & Observability
Keeps models accurate, reliable, and improving over time — version everything, measure everything
What
MLOps is the set of practices and tools that treat AI systems like production software:
- Version control for models and data (not just code)
- CI/CD for ML pipelines and model deployments
- Monitoring for model accuracy, drift, latency, and token spend
- Evaluation for quality gates before any model change goes to production
- Explainability for regulatory compliance and debugging
Services: Azure ML · Application Insights + Monitor · Foundry Evaluations · Responsible AI Dashboard
Why
Without MLOps:
- You don't know when your model started degrading (no drift detection)
- You can't reproduce a past model (no versioning)
- You can't prove a model change improved things (no evaluation gates)
- You can't explain a decision to an auditor (no explainability)
At MortgageIQ: RESPA requires documented AI decision trails. Responsible AI Dashboard + Cosmos DB audit logs are the compliance answer.
How
Azure Machine Learning — Model Registry
model = ml_client.models.create_or_update(
Model(
path="./outputs/model",
name="mortgage-risk-xgboost",
version="2.1.0",
description="XGBoost trained on 2025 loan data, MLflow metrics attached",
tags={"accuracy": "0.912", "dataset_version": "2025-Q4", "approved_by": "ml-lead"}
)
)
MLflow on Azure ML — Experiment Tracking
with mlflow.start_run(run_name="xgboost-v2-hyperdrive"):
mlflow.log_params({"n_estimators": 200, "max_depth": 6, "learning_rate": 0.1})
mlflow.log_metrics({
"train_accuracy": 0.942,
"val_accuracy": 0.912,
"val_auc": 0.961
})
mlflow.xgboost.log_model(model, "model")
mlflow.log_artifact("shap_summary.png")
Every experiment automatically versioned. Compare runs side-by-side in Azure ML Studio UI.
Application Insights — AI Agent Observability
tracer = trace.get_tracer("loan-agent")
with tracer.start_as_current_span("llm-call") as span:
span.set_attribute("model", "gpt-4o")
span.set_attribute("prompt_tokens", 847)
span.set_attribute("session_id", session_id)
response = await openai_client.chat.completions.create(...)
span.set_attribute("completion_tokens", response.usage.completion_tokens)
span.set_attribute("latency_ms", latency)
span.set_attribute("groundedness_score", groundedness)
Key metrics to track:
- Token usage per request (cost attribution)
- Latency p50/p95/p99 per model call
- Groundedness score per response
- Error rate by error type (content filter hit, timeout, model error)
Foundry Evaluations — Quality Gates
evaluation = Evaluation(
display_name="gpt4o-mortgage-eval-2026-03",
evaluators={
"groundedness": GroundednessEvaluator(model_config=model_config),
"coherence": CoherenceEvaluator(model_config=model_config),
"relevance": RelevanceEvaluator(model_config=model_config)
},
evaluator_configurations={
"groundedness": EvaluatorConfiguration(threshold={"groundedness": 4.0})
}
)
# Gate: block deployment if groundedness < 4.0/5.0
Responsible AI Dashboard — Explainability
- Fairness analysis: model performance disaggregated by protected attributes (loan type, geography)
- Error analysis: which data cohorts have highest error rates?
- SHAP values: feature importance per prediction
- Counterfactual analysis: what would change this decision?
- At MortgageIQ: required by ECOA (Equal Credit Opportunity Act). Dashboard output attached to model card.
Drift Detection (automated)
monitor = MonitorDefinition(
compute=ServerlessSparkCompute(instance_type="standard_e4s_v3"),
monitoring_signals={
"data_drift": DataDriftSignal(
reference_data=ReferenceData(
type=MonitorInputDataType.FIXED,
input_data=Input(type=AssetTypes.MLTABLE, path="azureml:ref-dataset:1")
),
target_dataset=TargetDataset(
dataset=MonitoringTarget(
endpoint_deployment_id="azureml:loan-endpoint:loan-scorer"
)
),
metric_thresholds=DataDriftMetricThreshold(
numerical=NumericalDriftMetrics(wasserstein_distance=0.2),
categorical=CategoricalDriftMetrics(chi_squared_test=0.1)
)
)
},
alert_notification=AlertNotification(emails=["ml-ops@mortgageiq.com"])
)
When
- Azure ML: Every model training run, fine-tuning, and deployment. Non-negotiable.
- App Insights: Every production agent invocation. Configure from day one, not as an afterthought.
- Foundry Evals: Every model version change and every prompt change in production.
- Responsible AI: Every model that influences a regulated decision (credit, healthcare, hiring).
- Drift detection: Weekly for stable models, daily for high-velocity production models.
Who owns it
- ML engineering — experiment tracking, model registry, drift detection
- Platform / SRE — Application Insights dashboards, alert routing, on-call runbooks
- Compliance / Legal — Responsible AI Dashboard approval for regulated models
Key Notes
- MLflow is the default tracking SDK in Azure ML. All runs auto-logged to the workspace.
- Champion/challenger: deploy new model to 10% traffic, compare metrics for 48h, promote or rollback.
- Groundedness ≥ 4.0/5.0 and relevance ≥ 4.0/5.0 are typical quality gate thresholds for enterprise RAG.
- Wasserstein distance > 0.2 = significant data drift. Chi-squared p-value < 0.05 = categorical drift.
- App Insights + Log Analytics = KQL. Learn this:
traces | where customDimensions["model"] == "gpt-4o" | summarize avg(todouble(customDimensions["latency_ms"])) by bin(timestamp, 1h). - Remember: MLOps ≠ DevOps. Code versioning is git. Model versioning is Azure ML registry. Data versioning is Data Lake + Azure ML datasets. Three separate versioning concerns.
Layer 09 · Governance, Security & Compliance
The horizontal wrapper — no service is public, no secret is hardcoded, no action is unaudited
What
Governance is not a layer you add at the end. It is a horizontal concern that wraps every other layer. It enforces:
- Zero secrets in code — Key Vault
- Data visibility and lineage — Purview
- Continuous threat detection — Defender for Cloud
- Compliance guardrails at infrastructure level — Azure Policy
- No AI service on public internet — Private Endpoints
Why
AI systems introduce new governance risks that traditional systems don't have:
- Model outputs can leak PII from training data
- Prompt injection can exfiltrate data from RAG corpora
- Token spend can spiral uncontrolled without governance
- Regulators (OCC, CFPB, HIPAA, GDPR) require audit trails for AI decisions
- A single publicly-exposed AI endpoint is a direct attack surface
At MortgageIQ: Fannie Mae and RESPA compliance required every AI decision logged, PII scrubbed before the model sees it, all services on private VNet, and model outputs auditable for 7 years.
How
Key Vault — Zero Secrets in Code
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
credential = DefaultAzureCredential() # Uses Managed Identity in Azure
client = SecretClient(vault_url="https://miq-prod-kv.vault.azure.net/", credential=credential)
openai_key = client.get_secret("azure-openai-key").value
- All AI service keys, connection strings, and certificates stored in Key Vault
- Managed Identity assigned to each compute resource (Functions, AKS pods, Container Apps)
- Key Vault access policies → replaced by Azure RBAC:
Key Vault Secrets Userrole - Soft delete + purge protection: deleted secrets recoverable for 90 days
Purview — Data Lineage and PII Classification
Purview scans:
Azure SQL → classifies columns: SSN, DOB, Account Number
Data Lake → classifies files: contains PII
Azure OpenAI → tracks: what data was used in fine-tuning?
Lineage graph:
1003 Form PDF → Doc Intelligence → Cosmos DB → AI Search → GPT-4o prompt
Purview captures every hop. Regulator can see exactly where borrower data traveled.
- Data Map: catalog of all data assets with automatic sensitivity classification
- Sensitivity labels: Public, Internal, Confidential, Highly Confidential — synced from Microsoft 365
- Purview + Azure Policy: prevent data labeled "Highly Confidential" from being sent to external endpoints
Defender for Cloud — Continuous Security Posture
- Secure Score: quantified security posture (0-100). Target: 80+ for regulated workloads.
- Defender for APIs: monitors APIM endpoints for anomalous patterns
- Defender for Storage: scans Blob Storage for malware on upload — critical for document ingestion pipelines
- Recommendations: auto-generated list of security gaps with remediation steps
Azure Policy — Infrastructure Compliance Guardrails
{
"mode": "All",
"policyRule": {
"if": {
"allOf": [
{"field": "type", "equals": "Microsoft.CognitiveServices/accounts"},
{"field": "location", "notIn": ["eastus", "eastus2", "westus2"]}
]
},
"then": {"effect": "Deny"}
}
}
Common AI governance policies:
- Require private endpoints on all Cognitive Services
- Deny public network access on Azure OpenAI
- Require CMK (Customer Managed Keys) for AI Search
- Enforce tagging:
CostCenter,Environment,DataClassificationon all resources
Private Endpoints — Zero Public Internet Exposure
Every AI service should have:
publicNetworkAccess: Disabled
privateEndpointConnections: [one per VNet]
Topology:
AKS (agent pods) → Private Endpoint → Azure OpenAI [no internet hop]
AKS (agent pods) → Private Endpoint → AI Search [no internet hop]
AKS (agent pods) → Private Endpoint → Cosmos DB [no internet hop]
APIM → Public Internet (this is the only intended public surface)
- Azure Private DNS Zones: resolve
openai.azure.comto private IP within VNet - Hub-spoke VNet: private endpoints in hub VNet, peered to spoke VNets for each environment
When
- Key Vault: Day one. If you deployed a Cognitive Services resource with its key in an env var, fix it before moving on.
- Purview: When you have PII in your data pipeline or face regulatory audit requirements.
- Defender: Enable at subscription level immediately. Cost is negligible vs the risk.
- Azure Policy: Set guardrails at the Management Group level so every subscription inherits them.
- Private Endpoints: Any production AI service. ~$0.01/hr per endpoint — negligible.
Who owns it
- Security / InfoSec team — Key Vault, Defender, private endpoints, network topology
- Compliance / Legal — Purview data catalog, sensitivity labels, retention policies
- Platform engineering — Azure Policy definitions, Management Group hierarchy, subscription governance
- AI engineering — consumes Key Vault (uses Managed Identity), implements PII scrubbing before LLM calls
Key Notes
- Managed Identity eliminates credential rotation. No expiry, no rotation, no secret sprawl.
- Private Endpoint + DNS: without Private DNS Zone, your VNet resources still resolve to public IP. Both required.
- Purview sensitivity labels flow from M365 compliance center. One classification framework across email, SharePoint, and Azure data.
- Defender for Cloud Secure Score deductions: public AI endpoints, missing MFA, unused privileged accounts.
- Azure Policy
DeployIfNotExistsvsDeny: Deny blocks immediately. DINE auto-remediates after deployment. - Remember: "Zero Trust" is not a product. It's a principle: verify every request (Entra ID), minimize access (RBAC), assume breach (private endpoints + Defender). Key Vault + Private Endpoints + Managed Identity is the Azure Zero Trust implementation for AI.
- Customer Managed Keys (CMK): required for Highly Confidential data at rest. Azure OpenAI, AI Search, Cosmos DB all support CMK. Performance cost: <5% for most workloads.
Cross-Cutting: Architecture Questions
What would you do differently if building this from scratch?
- Start with APIM and Entra ID — not with the OpenAI endpoint. Every team skips this and regrets it at compliance audit time.
- Private endpoints from day one — retrofitting private networking is a 2-week project. It's a 2-hour project up front.
- Event Hub for all AI call logging — you will need this for drift detection, cost attribution, and debugging. If you didn't log it at call time, it's gone.
- AI Search for RAG, not custom vector DBs — the hybrid retrieval and semantic ranker in AI Search outperform most custom setups. Don't build what Azure already ships.
How do you handle cost governance for AI?
Token spend visibility: APIM → Event Hub → Log Analytics → Power BI dashboard
Budget alerts: Azure Cost Management budgets at subscription level
Throttling: APIM token-limit policy per subscription key
PTU evaluation: monthly Azure Cost Analysis — if GPT-4o standard > $15K/month, model PTU
Right-sizing: route simple queries to GPT-4o-mini (~20x cheaper than GPT-4o)
Cache: Redis prompt cache reduces token spend by 30-40% on common queries
How do you ensure AI outputs are compliant?
Pre-model: PII scrubbing (Azure Language PII detection) before prompt composition
In-model: system prompt engineering — explicit RESPA/compliance constraints
Post-model: output validation (regex + classifier for prohibited content)
Audit trail: every request/response logged to Cosmos DB (immutable) + Event Hub
Explainability: SHAP values for tabular ML, groundedness scores for RAG
Human review: confidence < threshold → Service Bus queue → Logic App → human
Kafka vs Azure messaging services?
| Kafka (self-managed) | Event Hub (Kafka-compatible) | Service Bus | |
|---|---|---|---|
| Use case | High-throughput event streaming | Same as Kafka, managed | Reliable message queuing |
| Throughput | Unlimited | 40 MB/s per partition | 80 GB/day (Premium) |
| Operational cost | High (cluster mgmt) | Low (managed) | Low (managed) |
| Ordering | Per partition | Per partition | Per session |
| Migration | N/A | Kafka SDK compatible | Different protocol |
At MortgageIQ (UWM): GCP Pub/Sub for loan event streaming. On Azure: Event Hub is the right answer — Kafka SDK compatible, no code changes needed for teams already writing Kafka producers.
Quick Reference — Decision Trees
Which compute for my agent?
Is it event-triggered and < 10 min?
YES → Azure Functions (Consumption or Premium if VNet needed)
NO →
Does it need GPU?
YES → Azure ML Compute (training) or AKS GPU node pool (inference)
NO →
Do I need full Kubernetes control?
YES → AKS
NO → Container Apps (scale to zero, simpler ops)
Which data store for my AI context?
Current request context (< 1s TTL) → Redis
Conversation history (hours/days) → Cosmos DB
Domain knowledge for RAG → AI Search (vector)
Long-term analytics / training data → Data Lake Gen2
Reporting and BI on AI metrics → Microsoft Fabric
Which model for my task?
Generate language (explain, summarize, draft) → GPT-4o
High-volume simple generation → GPT-4o-mini
Complex reasoning (math, legal, code) → o1
Predict a number from structured data → XGBoost / Azure ML AutoML
Extract fields from a document → Document Intelligence
Search for relevant documents → AI Search (hybrid + semantic)
Transcribe speech → Azure Speech (STT)
Detect PII in text → Azure Language (PII detection)