Your LLM is in production. Can you answer these questions?
- What is the model risk tier assigned to this system?
- Where is the validation report documenting conceptual soundness?
- What does your ongoing monitoring cadence look like — and who reviews it?
- If a regulator audits a specific response from six months ago, can you reproduce the full reasoning chain?
- How do you ensure PHI never appears in a prompt sent to an external model API?
Most enterprise AI teams cannot answer these questions. Not because they don't care about compliance — because compliance is almost always added after the system is built, at the worst possible time.
This post covers AI compliance end to end: what the regulations actually require, how they apply specifically to LLMs, and how to architect compliance into the platform from day one — on both Azure and open-source stacks, with real examples from financial services and healthcare.
What AI Compliance Actually Means
Compliance is not governance. Governance is the internal framework of policies and accountability. Compliance is the demonstrable, auditable proof that your AI system meets specific regulatory requirements — requirements that carry legal, financial, and reputational consequences if violated.
For enterprise AI in 2026, the three regulatory frameworks that matter most:
These frameworks are not optional for regulated enterprises. SR 11-7 violations can result in supervisory action from the Federal Reserve. HIPAA violations carry civil and criminal penalties up to $1.9M per violation category per year. EU AI Act non-compliance for high-risk AI can result in fines up to €30M or 6% of global annual turnover.
SR 11-7 — Model Risk Management for Financial Services LLMs
What SR 11-7 Is
SR 11-7 is the Federal Reserve and OCC's supervisory guidance on model risk management, issued in 2011 and still the primary framework governing AI/ML models in U.S. financial services. It defines a model as any quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories to transform input data into output that informs decisions.
LLMs are models under SR 11-7. A mortgage loan assistant that influences underwriting decisions, a fraud detection agent that scores transactions, or a customer service AI that affects loan officer behavior — all fall under SR 11-7's scope.
The Three SR 11-7 Requirements Applied to LLMs
Pillar 1 — Model Development and Documentation
For a traditional ML model, this means documenting training data, feature selection, algorithm choice, and hyperparameters. For an LLM, the documentation requirements translate differently:
| Traditional ML | LLM Equivalent |
|---|---|
| Training data description | Foundation model card (GPT-4o, Claude) + RAG knowledge base description |
| Feature engineering | Prompt design, few-shot examples, retrieval configuration |
| Algorithm selection rationale | Model selection rationale (capability vs cost vs risk) |
| Assumptions and limitations | Context window constraints, hallucination risks, knowledge cutoff |
| Intended use scope | Use case definition, out-of-scope queries, escalation triggers |
| Performance metrics | Groundedness, faithfulness, citation coverage, compliance failure rate |
Conceptual soundness for LLMs means documenting why the design decisions produce reliable outputs for the stated use case. For the MortgageIQ SO agent:
# Model Documentation — SO Mortgage Assistant
## System Classification
Type: Retrieval-Augmented Generation (RAG) system
Foundation model: Azure OpenAI GPT-4o (gpt-4o-2024-11-20)
Risk tier: Tier 2 — Informational AI (does not make credit decisions)
## Intended Use
Provides loan officers with grounded answers about FHA/VA/Conventional
guidelines, loan status, and document requirements. Does NOT approve,
deny, or score loan applications.
## Conceptual Soundness Rationale
RAG architecture selected over fine-tuning because:
1. Guideline documents update quarterly — RAG allows index refresh
without model retraining, reducing knowledge staleness risk
2. Citation-required output format provides traceability to source
documents, satisfying audit requirements
3. Retrieval confidence thresholds (score > 0.78) prevent low-confidence
context from influencing responses
## Known Limitations
- Knowledge cutoff: RAG index reflects guidelines as of last refresh date
- Non-English queries: supported via Azure AI Search language analyzers,
but evaluation dataset is English-only — expanded eval required
- Maximum context window: 128K tokens — queries requiring full loan file
review may require chunking
## Out-of-Scope Queries
- Final credit decisions (routed to human underwriter)
- Rate quotes (routed to pricing engine)
- Legal advice (routed to legal department)
## Escalation Triggers
- Groundedness score < 0.80 → human review flag
- Compliance agent failure after 2 attempts → licensed advisor escalation
- Query classified as credit decision → immediate human routing
Pillar 2 — Model Validation
SR 11-7 requires independent validation — the validation team must be separate from the development team. For LLMs, validation includes:
Benchmarking against a challenger: run the LLM against a simpler baseline (keyword search, rule-based responses) on the same query set. Document where the LLM adds value and where it underperforms.
Outcome analysis: for deployed systems, compare LLM-assisted decisions against outcomes. Did loan officers who used SO make better decisions? Did the agent's citations accurately reflect guideline content?
Challenge process: an independent reviewer attempts to identify failure modes — edge cases, adversarial inputs, boundary conditions — that the development team did not anticipate.
# validation/sr117_validation_suite.py
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
import json
from datetime import datetime
class SR117ValidationSuite:
"""
Structured validation suite meeting SR 11-7 requirements.
Produces a signed validation report with findings and recommendations.
"""
def __init__(self, model_name: str, model_version: str, validator_id: str):
self.model_name = model_name
self.model_version = model_version
self.validator_id = validator_id # independent validator, not dev team
self.findings = []
self.validation_date = datetime.utcnow().isoformat()
def run_benchmark_evaluation(
self,
test_dataset: Dataset,
baseline_fn, # challenger: simpler keyword search baseline
model_fn # the LLM under validation
) -> dict:
"""Compare LLM against baseline challenger on held-out test set."""
# Run both
model_results = evaluate(test_dataset, metrics=[
faithfulness, answer_relevancy, context_precision
])
baseline_results = self._run_baseline(test_dataset, baseline_fn)
delta = {
metric: model_results[metric] - baseline_results.get(metric, 0)
for metric in ["faithfulness", "answer_relevancy", "context_precision"]
}
finding = {
"type": "benchmark_comparison",
"model_scores": dict(model_results),
"baseline_scores": baseline_results,
"delta": delta,
"assessment": "PASS" if all(v >= 0 for v in delta.values()) else "CONCERN",
"notes": "LLM must outperform baseline on all metrics to justify adoption"
}
self.findings.append(finding)
return finding
def run_adversarial_challenge(self, challenge_queries: list[dict]) -> dict:
"""
Independent challenge process — test edge cases and failure modes.
challenge_queries: [{query, expected_behavior, risk_category}]
"""
results = []
for cq in challenge_queries:
response = self._invoke_model(cq["query"])
passed = self._evaluate_challenge(response, cq["expected_behavior"])
results.append({
"query": cq["query"],
"risk_category": cq["risk_category"],
"expected": cq["expected_behavior"],
"actual": response,
"passed": passed
})
failure_rate = sum(1 for r in results if not r["passed"]) / len(results)
finding = {
"type": "adversarial_challenge",
"total_queries": len(results),
"failure_rate": failure_rate,
"failures": [r for r in results if not r["passed"]],
"assessment": "PASS" if failure_rate < 0.05 else "FAIL",
"sr117_requirement": "Challenge process must identify material weaknesses"
}
self.findings.append(finding)
return finding
def run_outcome_analysis(self, historical_runs: list[dict]) -> dict:
"""
Analyze production outcomes — were LLM-assisted decisions better?
historical_runs: [{query, response, outcome, ground_truth}]
"""
accuracy = sum(
1 for r in historical_runs
if r["outcome"] == r["ground_truth"]
) / len(historical_runs)
citation_accuracy = sum(
1 for r in historical_runs
if self._citations_verified(r["response"])
) / len(historical_runs)
finding = {
"type": "outcome_analysis",
"sample_size": len(historical_runs),
"outcome_accuracy": accuracy,
"citation_accuracy": citation_accuracy,
"assessment": "PASS" if accuracy >= 0.90 and citation_accuracy >= 0.85 else "CONCERN",
"sr117_requirement": "Outcome analysis required for ongoing monitoring"
}
self.findings.append(finding)
return finding
def generate_validation_report(self) -> dict:
"""
Produce signed validation report — required artifact under SR 11-7.
Must be retained per model risk record-keeping requirements.
"""
overall = "PASS" if all(
f["assessment"] in ("PASS",) for f in self.findings
) else "CONDITIONAL" if all(
f["assessment"] != "FAIL" for f in self.findings
) else "FAIL"
report = {
"report_type": "SR11-7 Model Validation Report",
"model_name": self.model_name,
"model_version": self.model_version,
"validation_date": self.validation_date,
"validator_id": self.validator_id, # independent of dev team
"overall_assessment": overall,
"findings": self.findings,
"recommendations": self._generate_recommendations(),
"next_validation_date": self._compute_next_validation(),
"signature_required": True # wet or digital signature required
}
# Write to immutable audit store
self._write_to_audit_store(report)
return report
def _compute_next_validation(self) -> str:
"""SR 11-7: revalidation on material change or minimum annually."""
from dateutil.relativedelta import relativedelta
next_date = datetime.utcnow() + relativedelta(months=12)
return next_date.isoformat()
Pillar 3 — Ongoing Monitoring
SR 11-7 requires continuous performance monitoring after deployment. For LLMs, this translates to:
- Quality drift detection: groundedness score trending down over weeks
- Usage pattern monitoring: query types shifting outside validated scope
- Model version changes: when the foundation model provider updates the model, revalidation is triggered
- Knowledge base staleness: RAG index age exceeds refresh SLA
# monitoring/sr117_monitor.py
from dataclasses import dataclass
from datetime import datetime, timedelta
import statistics
@dataclass
class SR117MonitoringAlert:
alert_type: str
severity: str # CRITICAL | WARNING | INFO
metric: str
current_value: float
threshold: float
sr117_requirement: str
action_required: str
timestamp: str
class SR117OngoingMonitor:
"""
Ongoing monitoring implementation satisfying SR 11-7 Pillar 3.
Thresholds are documented in the model's validation report.
"""
THRESHOLDS = {
"groundedness": {"warning": 0.85, "critical": 0.80},
"compliance_failure_rate": {"warning": 0.05, "critical": 0.10},
"escalation_rate": {"warning": 0.08, "critical": 0.15},
"out_of_scope_rate": {"warning": 0.10, "critical": 0.20},
"citation_coverage": {"warning": 0.85, "critical": 0.75},
"index_staleness_days": {"warning": 30, "critical": 45}
}
def evaluate_weekly_metrics(self, metrics: dict) -> list[SR117MonitoringAlert]:
alerts = []
for metric, value in metrics.items():
thresholds = self.THRESHOLDS.get(metric)
if not thresholds:
continue
if metric == "index_staleness_days":
# Higher = worse for staleness
if value >= thresholds["critical"]:
alerts.append(SR117MonitoringAlert(
alert_type="KNOWLEDGE_STALENESS",
severity="CRITICAL",
metric=metric,
current_value=value,
threshold=thresholds["critical"],
sr117_requirement="SR 11-7: Model inputs must remain current and appropriate",
action_required="Immediate RAG index refresh required. Escalate to model owner.",
timestamp=datetime.utcnow().isoformat()
))
else:
# Lower = worse for quality metrics
if value <= thresholds["critical"]:
alerts.append(SR117MonitoringAlert(
alert_type="QUALITY_DEGRADATION",
severity="CRITICAL",
metric=metric,
current_value=value,
threshold=thresholds["critical"],
sr117_requirement="SR 11-7: Ongoing monitoring must detect performance deterioration",
action_required="Suspend model pending investigation. Notify Model Risk Committee.",
timestamp=datetime.utcnow().isoformat()
))
elif value <= thresholds["warning"]:
alerts.append(SR117MonitoringAlert(
alert_type="QUALITY_WARNING",
severity="WARNING",
metric=metric,
current_value=value,
threshold=thresholds["warning"],
sr117_requirement="SR 11-7: Early warning indicators must trigger investigation",
action_required="Schedule prompt review within 5 business days.",
timestamp=datetime.utcnow().isoformat()
))
return alerts
def check_revalidation_trigger(self, events: list[dict]) -> list[str]:
"""SR 11-7: revalidation required on material change."""
triggers = []
for event in events:
if event["type"] == "foundation_model_update":
triggers.append(
f"Foundation model updated to {event['new_version']} — "
f"revalidation required within 30 days per SR 11-7"
)
if event["type"] == "use_case_expansion":
triggers.append(
f"New use case added: {event['use_case']} — "
f"full validation required before go-live"
)
if event["type"] == "volume_threshold_breach":
triggers.append(
f"Decision volume exceeded validated threshold ({event['volume']}) — "
f"model tier review required"
)
return triggers
SR 11-7 Risk Tiering for LLMs
Not all LLMs carry the same risk. SR 11-7 requires risk tiering — higher-risk models get more rigorous validation requirements and more frequent monitoring.
| Tier | Description | LLM Example | Validation Cadence | Monitoring |
|---|---|---|---|---|
| Tier 1 — High | Directly influences credit decisions, fraud scoring, or regulatory reporting | Automated underwriting assistant | Annual full + quarterly review | Real-time + weekly report |
| Tier 2 — Medium | Informs decisions but does not make them | SO mortgage assistant — answers guideline questions | Annual full validation | Weekly quality metrics |
| Tier 3 — Low | Administrative, informational, no decision influence | Internal HR policy chatbot | Annual simplified validation | Monthly spot check |
MortgageIQ SO agent is Tier 2. It informs loan officers but does not approve or deny loans. The compliance agent loop in the LangGraph architecture enforces this boundary — any response that crosses into credit decision territory is intercepted and escalated.
HIPAA Compliance — PHI Handling for Healthcare AI
What HIPAA Requires for AI Systems
HIPAA's Privacy and Security Rules apply to any system that creates, receives, maintains, or transmits Protected Health Information (PHI). For AI systems, the critical requirements are:
PHI De-identification Before LLM Calls
The most critical architectural decision for healthcare AI: PHI must be de-identified or pseudonymized before it enters any LLM prompt. Azure OpenAI has a HIPAA-eligible BAA — but defense-in-depth means you do not rely solely on the BAA. PHI is stripped before the prompt is assembled.
# compliance/phi_sanitizer.py
import re
import hashlib
from azure.ai.textanalytics import TextAnalyticsClient, HealthcareEntityCategory
from azure.core.credentials import AzureKeyCredential
class PHISanitizer:
"""
De-identifies PHI from text before LLM prompt assembly.
Replaces identifiers with pseudonyms — preserves clinical context
while removing patient identity.
HIPAA Safe Harbor method: removes all 18 identifiers.
"""
# Rule-based patterns for common PHI
PATTERNS = {
"SSN": (r'\b\d{3}-\d{2}-\d{4}\b', "[SSN-REDACTED]"),
"MRN": (r'\bMRN[:\s#]*\d{6,10}\b', "[MRN-REDACTED]"),
"DOB": (r'\b(0[1-9]|1[0-2])/(0[1-9]|[12]\d|3[01])/\d{4}\b', "[DOB-REDACTED]"),
"PHONE": (r'\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b', "[PHONE-REDACTED]"),
"EMAIL": (r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', "[EMAIL-REDACTED]"),
"IP": (r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', "[IP-REDACTED]"),
"ZIP_FULL": (r'\b\d{5}-\d{4}\b', "[ZIP-REDACTED]"), # full ZIP+4 is PHI
}
def __init__(self, text_analytics_endpoint: str, text_analytics_key: str):
self.ta_client = TextAnalyticsClient(
endpoint=text_analytics_endpoint,
credential=AzureKeyCredential(text_analytics_key)
)
self._pseudonym_map = {} # session-scoped: consistent replacement within a session
def sanitize(self, text: str, session_id: str) -> tuple[str, dict]:
"""
Returns (sanitized_text, phi_audit_record).
phi_audit_record is written to audit log — never the actual PHI.
"""
audit_record = {
"session_id": session_id,
"phi_detected": [],
"sanitization_method": "azure_text_analytics + rule_based"
}
# Step 1: Azure Text Analytics for Health — NLP-based PHI detection
sanitized = self._azure_health_sanitize(text, audit_record)
# Step 2: Rule-based patterns for structured PHI (SSN, MRN, dates)
for phi_type, (pattern, replacement) in self.PATTERNS.items():
matches = re.findall(pattern, sanitized)
if matches:
audit_record["phi_detected"].append({
"type": phi_type,
"count": len(matches),
# NEVER log actual PHI — log a hash for correlation only
"hash": hashlib.sha256(str(matches).encode()).hexdigest()[:16]
})
sanitized = re.sub(pattern, replacement, sanitized)
return sanitized, audit_record
def _azure_health_sanitize(self, text: str, audit_record: dict) -> str:
"""Use Azure Text Analytics for Health to detect clinical PHI entities."""
poller = self.ta_client.begin_analyze_healthcare_entities([text])
result = poller.result()
sanitized = text
phi_categories = {
HealthcareEntityCategory.PATIENT_ID,
HealthcareEntityCategory.DATE,
HealthcareEntityCategory.AGE,
HealthcareEntityCategory.PHONE_NUMBER,
HealthcareEntityCategory.EMAIL,
HealthcareEntityCategory.ADDRESS,
HealthcareEntityCategory.NAME,
}
for doc in result:
if not doc.is_error:
for entity in reversed(doc.entities):
if entity.category in phi_categories:
audit_record["phi_detected"].append({
"type": entity.category.value,
"confidence": entity.confidence_score,
"hash": hashlib.sha256(entity.text.encode()).hexdigest()[:16]
})
# Replace with consistent pseudonym within session
pseudonym = self._get_pseudonym(entity.text, entity.category.value)
sanitized = sanitized.replace(entity.text, pseudonym)
return sanitized
def _get_pseudonym(self, phi_value: str, category: str) -> str:
"""Generate consistent pseudonym — same PHI value → same pseudonym within session."""
key = hashlib.sha256(f"{phi_value}:{category}".encode()).hexdigest()[:8]
pseudonym = f"[{category.upper()}-{key}]"
self._pseudonym_map[pseudonym] = phi_value # for authorized re-identification only
return pseudonym
HIPAA Audit Trail Architecture
Every PHI access — including AI system access — must be logged for 6 years, with who accessed what, when, and from where.
# compliance/hipaa_audit_logger.py
import hashlib
import hmac
import json
from datetime import datetime
from azure.cosmos import CosmosClient
from azure.core.credentials import DefaultAzureCredential
class HIPAAAuditLogger:
"""
Immutable audit log for PHI access events.
Tamper-evident: each record includes HMAC of previous record.
Satisfies HIPAA §164.312(b) — Audit Controls.
Retention: 6 years per HIPAA §164.530(j).
"""
def __init__(self, cosmos_url: str, signing_key: str):
self.container = (
CosmosClient(url=cosmos_url, credential=DefaultAzureCredential())
.get_database_client("hipaa-audit")
.get_container_client("phi-access-log")
)
self.signing_key = signing_key.encode()
self._last_hash = self._get_last_record_hash()
def log_phi_access(
self,
user_id: str,
user_role: str,
patient_id_hash: str, # NEVER log actual patient ID
access_purpose: str, # Treatment | Payment | Operations | Other
ai_system: str, # "SO-HealthAssistant-v1.2"
phi_types_accessed: list[str],
sanitization_applied: bool,
request_id: str,
ip_address_hash: str
) -> str:
record = {
"id": request_id,
"partitionKey": patient_id_hash,
"timestamp": datetime.utcnow().isoformat(),
"event_type": "PHI_ACCESS",
"user_id": user_id,
"user_role": user_role,
"patient_id_hash": patient_id_hash,
"access_purpose": access_purpose,
"ai_system": ai_system,
"phi_types_accessed": phi_types_accessed,
"sanitization_applied": sanitization_applied,
"ip_address_hash": ip_address_hash,
"ttl": -1 # never expire — 6-year retention enforced by policy
}
# Tamper-evident chain: HMAC of this record + previous record hash
record_bytes = json.dumps(record, sort_keys=True).encode()
chain_input = record_bytes + self._last_hash.encode()
record["chain_hash"] = hmac.new(
self.signing_key, chain_input, hashlib.sha256
).hexdigest()
self.container.upsert_item(record)
self._last_hash = record["chain_hash"]
return record["chain_hash"]
def verify_audit_chain(self, start_date: str, end_date: str) -> bool:
"""Verify tamper-evidence of audit chain — called during compliance audit."""
records = list(self.container.query_items(
"SELECT * FROM c WHERE c.timestamp >= @start AND c.timestamp <= @end ORDER BY c.timestamp",
parameters=[
{"name": "@start", "value": start_date},
{"name": "@end", "value": end_date}
]
))
prev_hash = "genesis"
for record in records:
stored_hash = record.pop("chain_hash")
record_bytes = json.dumps(record, sort_keys=True).encode()
chain_input = record_bytes + prev_hash.encode()
expected_hash = hmac.new(
self.signing_key, chain_input, hashlib.sha256
).hexdigest()
if stored_hash != expected_hash:
return False # tamper detected
prev_hash = stored_hash
return True
Zero-Trust Architecture for Healthcare AI
PHI must be protected at every layer — not just at the database. Zero-trust means every component authenticates every request, regardless of network location.
The minimum necessary standard: a nurse asking about medication dosing should not trigger a full patient record retrieval. The retrieval scope is bounded by the user's role — encoded in the Entra ID JWT claim, enforced at the application layer, logged in the audit trail.
EU AI Act — High-Risk Classification and Conformity
When Does the EU AI Act Apply?
The EU AI Act applies to any AI system placed on the EU market or used in the EU — regardless of where the system is built. For U.S. enterprises serving EU customers, this is not optional.
The Act classifies AI systems into four risk tiers:
| Risk Level | Examples | Requirements |
|---|---|---|
| Unacceptable | Social scoring, biometric mass surveillance | Prohibited — cannot deploy |
| High-Risk | Credit scoring, medical diagnosis, employment screening, critical infrastructure | Full conformity assessment, logging, human oversight, registration in EU database |
| Limited Risk | Chatbots, deepfake detection | Transparency obligations — disclose it's AI |
| Minimal Risk | Spam filters, AI in video games | No mandatory requirements |
High-risk classification triggers for enterprise AI:
- AI used in credit, insurance underwriting, or loan origination → High-risk
- AI used in clinical decision support or medical device → High-risk
- AI used in employment screening or performance evaluation → High-risk
- AI that manages critical infrastructure → High-risk
Conformity Assessment Requirements for High-Risk AI
Logging obligations in practice: the EU AI Act requires that high-risk AI systems log events automatically and with sufficient detail to reconstruct what happened post-hoc. For an LLM:
# compliance/eu_ai_act_logger.py
from dataclasses import dataclass, asdict
import json, hashlib
from datetime import datetime
@dataclass
class EUAIActLogEvent:
"""
Logging schema satisfying EU AI Act Article 12 — Logging.
High-risk AI systems must log events enabling post-hoc audit.
"""
# System identification
system_id: str # registered in EU AI database
system_version: str
deployment_environment: str # production | staging
# Request context
event_id: str
timestamp: str
user_role: str # role, not identity — privacy preservation
session_id_hash: str # hash, not raw session ID
# Input (no PII)
input_category: str # query classification — not raw text
input_hash: str # SHA-256 of input for tamper evidence
input_language: str
# Processing
model_used: str
prompt_version: str
retrieval_sources: list[str] # source document IDs, not content
human_override_available: bool # EU AI Act: must be possible
# Output
output_hash: str # SHA-256 of output
confidence_indicators: dict # groundedness, citation coverage
compliance_checks_passed: bool
human_review_triggered: bool
# Audit
log_hash: str # tamper-evident chain
def create_eu_log_event(request: dict, response: dict, system_meta: dict) -> EUAIActLogEvent:
input_text = request.get("query", "")
output_text = response.get("answer", "")
return EUAIActLogEvent(
system_id=system_meta["eu_ai_database_id"],
system_version=system_meta["version"],
deployment_environment=system_meta["env"],
event_id=request["request_id"],
timestamp=datetime.utcnow().isoformat(),
user_role=request["user_role"],
session_id_hash=hashlib.sha256(request["session_id"].encode()).hexdigest(),
input_category=request.get("intent_category", "unknown"),
input_hash=hashlib.sha256(input_text.encode()).hexdigest(),
input_language=request.get("language", "en"),
model_used=response["model"],
prompt_version=response["prompt_version"],
retrieval_sources=response.get("citation_ids", []),
human_override_available=True, # system design guarantees this
output_hash=hashlib.sha256(output_text.encode()).hexdigest(),
confidence_indicators={
"groundedness": response.get("groundedness_score"),
"citation_count": len(response.get("citations", []))
},
compliance_checks_passed=response.get("compliance_pass", False),
human_review_triggered=response.get("requires_human_review", False),
log_hash="" # computed and chained by logger
)
The Compliance Architecture — Azure + Open Source
Azure Services — Compliance Role
| Azure Service | Compliance Role | Regulation |
|---|---|---|
| Entra ID | Identity, role-based access, conditional access (MFA, device compliance) | All |
| Azure APIM | Request logging, rate limiting, JWT validation, input audit | All |
| Azure Text Analytics for Health | PHI detection and de-identification before LLM | HIPAA |
| Azure Content Safety | Input/output harmful content filtering, prompt injection detection | All |
| Azure Prompt Shields | Prompt injection and jailbreak detection at the gateway | SR 11-7, EU AI Act |
| Azure OpenAI | BAA-eligible, DPA-eligible, private endpoint, managed identity | HIPAA, GDPR |
| Cosmos DB | Immutable audit log with TTL control and RBAC | All |
| Microsoft Purview | Data classification, PHI lineage, compliance reporting | HIPAA, EU AI Act |
| Key Vault | Secrets management, audit of key access | All |
| Azure Monitor | SR 11-7 ongoing monitoring, drift alerts, quality dashboards | SR 11-7 |
Open Source — Compliance Role
| OSS Tool | Compliance Role |
|---|---|
| Langfuse | Prompt version audit trail, per-request trace, quality trending, cost attribution |
| Ragas | Automated groundedness/faithfulness evaluation — SR 11-7 validation evidence |
| DeepEval | CI/CD compliance gate — blocks deployment if safety or quality thresholds fail |
| LlamaGuard | Open-source input/output safety classifier — HIPAA safe output validation |
| NeMo Guardrails | Declarative compliance rules — enforce SR 11-7 scope boundaries in code |
| EvidentlyAI | SR 11-7 ongoing monitoring — data drift, output drift, quality degradation reports |
| OpenTelemetry | Standard tracing — EU AI Act logging with vendor-neutral format |
MortgageIQ — SR 11-7 Compliance in Practice
The SO mortgage assistant at MortgageIQ operates under a full SR 11-7 compliance framework. Here is how the three pillars are implemented:
Pillar 1 — Documentation:
- Model documentation stored in SharePoint with version control
- Every prompt version in Cosmos DB includes: owner, approver, compliance reviewer, changelog
- System prompt explicitly states decision boundaries ("does not approve or deny loans")
- Risk tier: Tier 2 — documented in Model Risk inventory
Pillar 2 — Validation:
- Independent validation by Model Risk team (separate from AI platform team)
- 200-query evaluation dataset run against every prompt version before promotion
- Challenger benchmark: keyword search baseline — SO must outperform on all metrics
- Adversarial challenge: 50 boundary-test queries (out-of-scope, credit decisions, rate quotes) — must route correctly 100%
Pillar 3 — Ongoing Monitoring:
- Weekly SR 11-7 monitoring report: groundedness, compliance failure rate, escalation rate, out-of-scope rate
- Azure Monitor alert: groundedness below 0.85 → notify model owner within 1 hour
- Quarterly validation review with Model Risk Committee
- Revalidation trigger: any GPT-4o model version update by Microsoft
Compliance agent boundary enforcement:
# The compliance agent prompt — the SR 11-7 boundary enforcer
COMPLIANCE_PROMPT = """
You are a compliance reviewer for a mortgage AI system operating under SR 11-7 Model Risk Management guidelines.
Review the following AI response and check for these compliance violations:
SR 11-7 VIOLATIONS (return FAIL if any present):
1. The response makes a final credit approval or denial decision
2. The response provides a specific rate quote or lock commitment
3. The response states that a borrower "qualifies" or "does not qualify" as a final determination
4. The response contains advice that requires a licensed mortgage originator
5. The response references data that is not cited with a source document and version
REQUIRED ELEMENTS (return FAIL if any missing for factual claims):
1. Source citation for guideline references (document name + section)
2. Data date for any numerical figures
3. Disclaimer when response approaches a credit judgment
If FAIL: provide a revised response that removes the violation and adds required elements.
If PASS: confirm with "COMPLIANCE: PASS".
AI Response to review:
{draft_answer}
"""
This prompt is version-controlled, compliance-reviewed, and logged with every execution — providing the audit trail that SR 11-7 requires.
Compliance Checklist — Before Production
Key Takeaways
- SR 11-7 applies to LLMs — any LLM that informs financial decisions in a U.S. regulated institution is a model under SR 11-7, requiring documentation of conceptual soundness, independent validation, and ongoing monitoring with a defined escalation path
- PHI must be de-identified before the prompt, not at the API boundary — HIPAA compliance for AI requires defense-in-depth: Azure Text Analytics for Health removes PHI before it enters any LLM prompt, even when the LLM provider has a BAA
- The EU AI Act is not future-state for most enterprises — credit scoring, clinical decision support, and employment AI are already high-risk by classification, requiring conformity assessment, logging, and human override capability designed in — not bolted on after deployment
- Compliance agent loops are the architectural pattern for regulated AI — a separate compliance agent that reviews and rewrites responses before they reach the user, with a human escalation path when automated remediation fails, is the SR 11-7-compliant answer to the hallucination risk
- Audit trails must be immutable and tamper-evident — HMAC-chained records in Cosmos DB, verified on a schedule, satisfy both HIPAA (6-year retention) and EU AI Act (6-month minimum) requirements with a single architecture
- Ongoing monitoring is the most consistently underfunded pillar — SR 11-7 Pillar 3 requires weekly quality metrics, revalidation triggers on material change, and an escalation path when metrics breach thresholds; this is where most enterprise AI programs have the largest compliance gap
- Open source and Azure are complementary, not competing — Azure provides the managed control plane (identity, private endpoints, Content Safety, Purview) while Langfuse, Ragas, and DeepEval provide the LLM-specific evaluation and tracing that Azure's managed services do not yet cover at the same depth