← All Posts
ai-mlApril 24, 2026ai-compliancesr-11-7hipaaeu-ai-actazurellmenterprise-aifintechhealthcareauditgovernanceresponsible-ai

AI Compliance in Production — Finance, Healthcare, and the Enterprise Architect's Playbook

AI compliance is not a checkbox. In financial services, SR 11-7 requires model validation, risk tiering, and ongoing monitoring for every LLM in production. In healthcare, HIPAA mandates PHI controls and audit trails. The EU AI Act adds conformity assessments for high-risk AI. This post covers the full compliance stack — regulation, architecture, code, and real examples.

Your LLM is in production. Can you answer these questions?

  • What is the model risk tier assigned to this system?
  • Where is the validation report documenting conceptual soundness?
  • What does your ongoing monitoring cadence look like — and who reviews it?
  • If a regulator audits a specific response from six months ago, can you reproduce the full reasoning chain?
  • How do you ensure PHI never appears in a prompt sent to an external model API?

Most enterprise AI teams cannot answer these questions. Not because they don't care about compliance — because compliance is almost always added after the system is built, at the worst possible time.

This post covers AI compliance end to end: what the regulations actually require, how they apply specifically to LLMs, and how to architect compliance into the platform from day one — on both Azure and open-source stacks, with real examples from financial services and healthcare.


What AI Compliance Actually Means

Compliance is not governance. Governance is the internal framework of policies and accountability. Compliance is the demonstrable, auditable proof that your AI system meets specific regulatory requirements — requirements that carry legal, financial, and reputational consequences if violated.

For enterprise AI in 2026, the three regulatory frameworks that matter most:

These frameworks are not optional for regulated enterprises. SR 11-7 violations can result in supervisory action from the Federal Reserve. HIPAA violations carry civil and criminal penalties up to $1.9M per violation category per year. EU AI Act non-compliance for high-risk AI can result in fines up to €30M or 6% of global annual turnover.


SR 11-7 — Model Risk Management for Financial Services LLMs

What SR 11-7 Is

SR 11-7 is the Federal Reserve and OCC's supervisory guidance on model risk management, issued in 2011 and still the primary framework governing AI/ML models in U.S. financial services. It defines a model as any quantitative method, system, or approach that applies statistical, economic, financial, or mathematical theories to transform input data into output that informs decisions.

LLMs are models under SR 11-7. A mortgage loan assistant that influences underwriting decisions, a fraud detection agent that scores transactions, or a customer service AI that affects loan officer behavior — all fall under SR 11-7's scope.

The Three SR 11-7 Requirements Applied to LLMs

Pillar 1 — Model Development and Documentation

For a traditional ML model, this means documenting training data, feature selection, algorithm choice, and hyperparameters. For an LLM, the documentation requirements translate differently:

Traditional MLLLM Equivalent
Training data descriptionFoundation model card (GPT-4o, Claude) + RAG knowledge base description
Feature engineeringPrompt design, few-shot examples, retrieval configuration
Algorithm selection rationaleModel selection rationale (capability vs cost vs risk)
Assumptions and limitationsContext window constraints, hallucination risks, knowledge cutoff
Intended use scopeUse case definition, out-of-scope queries, escalation triggers
Performance metricsGroundedness, faithfulness, citation coverage, compliance failure rate

Conceptual soundness for LLMs means documenting why the design decisions produce reliable outputs for the stated use case. For the MortgageIQ SO agent:

# Model Documentation — SO Mortgage Assistant
## System Classification
Type: Retrieval-Augmented Generation (RAG) system
Foundation model: Azure OpenAI GPT-4o (gpt-4o-2024-11-20)
Risk tier: Tier 2 — Informational AI (does not make credit decisions)

## Intended Use
Provides loan officers with grounded answers about FHA/VA/Conventional
guidelines, loan status, and document requirements. Does NOT approve,
deny, or score loan applications.

## Conceptual Soundness Rationale
RAG architecture selected over fine-tuning because:
1. Guideline documents update quarterly — RAG allows index refresh
   without model retraining, reducing knowledge staleness risk
2. Citation-required output format provides traceability to source
   documents, satisfying audit requirements
3. Retrieval confidence thresholds (score > 0.78) prevent low-confidence
   context from influencing responses

## Known Limitations
- Knowledge cutoff: RAG index reflects guidelines as of last refresh date
- Non-English queries: supported via Azure AI Search language analyzers,
  but evaluation dataset is English-only — expanded eval required
- Maximum context window: 128K tokens — queries requiring full loan file
  review may require chunking

## Out-of-Scope Queries
- Final credit decisions (routed to human underwriter)
- Rate quotes (routed to pricing engine)
- Legal advice (routed to legal department)

## Escalation Triggers
- Groundedness score < 0.80 → human review flag
- Compliance agent failure after 2 attempts → licensed advisor escalation
- Query classified as credit decision → immediate human routing

Pillar 2 — Model Validation

SR 11-7 requires independent validation — the validation team must be separate from the development team. For LLMs, validation includes:

Benchmarking against a challenger: run the LLM against a simpler baseline (keyword search, rule-based responses) on the same query set. Document where the LLM adds value and where it underperforms.

Outcome analysis: for deployed systems, compare LLM-assisted decisions against outcomes. Did loan officers who used SO make better decisions? Did the agent's citations accurately reflect guideline content?

Challenge process: an independent reviewer attempts to identify failure modes — edge cases, adversarial inputs, boundary conditions — that the development team did not anticipate.

# validation/sr117_validation_suite.py
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
import json
from datetime import datetime

class SR117ValidationSuite:
    """
    Structured validation suite meeting SR 11-7 requirements.
    Produces a signed validation report with findings and recommendations.
    """
    def __init__(self, model_name: str, model_version: str, validator_id: str):
        self.model_name = model_name
        self.model_version = model_version
        self.validator_id = validator_id   # independent validator, not dev team
        self.findings = []
        self.validation_date = datetime.utcnow().isoformat()

    def run_benchmark_evaluation(
        self,
        test_dataset: Dataset,
        baseline_fn,       # challenger: simpler keyword search baseline
        model_fn           # the LLM under validation
    ) -> dict:
        """Compare LLM against baseline challenger on held-out test set."""
        # Run both
        model_results = evaluate(test_dataset, metrics=[
            faithfulness, answer_relevancy, context_precision
        ])
        baseline_results = self._run_baseline(test_dataset, baseline_fn)

        delta = {
            metric: model_results[metric] - baseline_results.get(metric, 0)
            for metric in ["faithfulness", "answer_relevancy", "context_precision"]
        }

        finding = {
            "type": "benchmark_comparison",
            "model_scores": dict(model_results),
            "baseline_scores": baseline_results,
            "delta": delta,
            "assessment": "PASS" if all(v >= 0 for v in delta.values()) else "CONCERN",
            "notes": "LLM must outperform baseline on all metrics to justify adoption"
        }
        self.findings.append(finding)
        return finding

    def run_adversarial_challenge(self, challenge_queries: list[dict]) -> dict:
        """
        Independent challenge process — test edge cases and failure modes.
        challenge_queries: [{query, expected_behavior, risk_category}]
        """
        results = []
        for cq in challenge_queries:
            response = self._invoke_model(cq["query"])
            passed = self._evaluate_challenge(response, cq["expected_behavior"])
            results.append({
                "query": cq["query"],
                "risk_category": cq["risk_category"],
                "expected": cq["expected_behavior"],
                "actual": response,
                "passed": passed
            })

        failure_rate = sum(1 for r in results if not r["passed"]) / len(results)
        finding = {
            "type": "adversarial_challenge",
            "total_queries": len(results),
            "failure_rate": failure_rate,
            "failures": [r for r in results if not r["passed"]],
            "assessment": "PASS" if failure_rate < 0.05 else "FAIL",
            "sr117_requirement": "Challenge process must identify material weaknesses"
        }
        self.findings.append(finding)
        return finding

    def run_outcome_analysis(self, historical_runs: list[dict]) -> dict:
        """
        Analyze production outcomes — were LLM-assisted decisions better?
        historical_runs: [{query, response, outcome, ground_truth}]
        """
        accuracy = sum(
            1 for r in historical_runs
            if r["outcome"] == r["ground_truth"]
        ) / len(historical_runs)

        citation_accuracy = sum(
            1 for r in historical_runs
            if self._citations_verified(r["response"])
        ) / len(historical_runs)

        finding = {
            "type": "outcome_analysis",
            "sample_size": len(historical_runs),
            "outcome_accuracy": accuracy,
            "citation_accuracy": citation_accuracy,
            "assessment": "PASS" if accuracy >= 0.90 and citation_accuracy >= 0.85 else "CONCERN",
            "sr117_requirement": "Outcome analysis required for ongoing monitoring"
        }
        self.findings.append(finding)
        return finding

    def generate_validation_report(self) -> dict:
        """
        Produce signed validation report — required artifact under SR 11-7.
        Must be retained per model risk record-keeping requirements.
        """
        overall = "PASS" if all(
            f["assessment"] in ("PASS",) for f in self.findings
        ) else "CONDITIONAL" if all(
            f["assessment"] != "FAIL" for f in self.findings
        ) else "FAIL"

        report = {
            "report_type": "SR11-7 Model Validation Report",
            "model_name": self.model_name,
            "model_version": self.model_version,
            "validation_date": self.validation_date,
            "validator_id": self.validator_id,   # independent of dev team
            "overall_assessment": overall,
            "findings": self.findings,
            "recommendations": self._generate_recommendations(),
            "next_validation_date": self._compute_next_validation(),
            "signature_required": True           # wet or digital signature required
        }

        # Write to immutable audit store
        self._write_to_audit_store(report)
        return report

    def _compute_next_validation(self) -> str:
        """SR 11-7: revalidation on material change or minimum annually."""
        from dateutil.relativedelta import relativedelta
        next_date = datetime.utcnow() + relativedelta(months=12)
        return next_date.isoformat()

Pillar 3 — Ongoing Monitoring

SR 11-7 requires continuous performance monitoring after deployment. For LLMs, this translates to:

  • Quality drift detection: groundedness score trending down over weeks
  • Usage pattern monitoring: query types shifting outside validated scope
  • Model version changes: when the foundation model provider updates the model, revalidation is triggered
  • Knowledge base staleness: RAG index age exceeds refresh SLA
# monitoring/sr117_monitor.py
from dataclasses import dataclass
from datetime import datetime, timedelta
import statistics

@dataclass
class SR117MonitoringAlert:
    alert_type: str
    severity: str          # CRITICAL | WARNING | INFO
    metric: str
    current_value: float
    threshold: float
    sr117_requirement: str
    action_required: str
    timestamp: str

class SR117OngoingMonitor:
    """
    Ongoing monitoring implementation satisfying SR 11-7 Pillar 3.
    Thresholds are documented in the model's validation report.
    """
    THRESHOLDS = {
        "groundedness": {"warning": 0.85, "critical": 0.80},
        "compliance_failure_rate": {"warning": 0.05, "critical": 0.10},
        "escalation_rate": {"warning": 0.08, "critical": 0.15},
        "out_of_scope_rate": {"warning": 0.10, "critical": 0.20},
        "citation_coverage": {"warning": 0.85, "critical": 0.75},
        "index_staleness_days": {"warning": 30, "critical": 45}
    }

    def evaluate_weekly_metrics(self, metrics: dict) -> list[SR117MonitoringAlert]:
        alerts = []

        for metric, value in metrics.items():
            thresholds = self.THRESHOLDS.get(metric)
            if not thresholds:
                continue

            if metric == "index_staleness_days":
                # Higher = worse for staleness
                if value >= thresholds["critical"]:
                    alerts.append(SR117MonitoringAlert(
                        alert_type="KNOWLEDGE_STALENESS",
                        severity="CRITICAL",
                        metric=metric,
                        current_value=value,
                        threshold=thresholds["critical"],
                        sr117_requirement="SR 11-7: Model inputs must remain current and appropriate",
                        action_required="Immediate RAG index refresh required. Escalate to model owner.",
                        timestamp=datetime.utcnow().isoformat()
                    ))
            else:
                # Lower = worse for quality metrics
                if value <= thresholds["critical"]:
                    alerts.append(SR117MonitoringAlert(
                        alert_type="QUALITY_DEGRADATION",
                        severity="CRITICAL",
                        metric=metric,
                        current_value=value,
                        threshold=thresholds["critical"],
                        sr117_requirement="SR 11-7: Ongoing monitoring must detect performance deterioration",
                        action_required="Suspend model pending investigation. Notify Model Risk Committee.",
                        timestamp=datetime.utcnow().isoformat()
                    ))
                elif value <= thresholds["warning"]:
                    alerts.append(SR117MonitoringAlert(
                        alert_type="QUALITY_WARNING",
                        severity="WARNING",
                        metric=metric,
                        current_value=value,
                        threshold=thresholds["warning"],
                        sr117_requirement="SR 11-7: Early warning indicators must trigger investigation",
                        action_required="Schedule prompt review within 5 business days.",
                        timestamp=datetime.utcnow().isoformat()
                    ))

        return alerts

    def check_revalidation_trigger(self, events: list[dict]) -> list[str]:
        """SR 11-7: revalidation required on material change."""
        triggers = []

        for event in events:
            if event["type"] == "foundation_model_update":
                triggers.append(
                    f"Foundation model updated to {event['new_version']} — "
                    f"revalidation required within 30 days per SR 11-7"
                )
            if event["type"] == "use_case_expansion":
                triggers.append(
                    f"New use case added: {event['use_case']} — "
                    f"full validation required before go-live"
                )
            if event["type"] == "volume_threshold_breach":
                triggers.append(
                    f"Decision volume exceeded validated threshold ({event['volume']}) — "
                    f"model tier review required"
                )

        return triggers

SR 11-7 Risk Tiering for LLMs

Not all LLMs carry the same risk. SR 11-7 requires risk tiering — higher-risk models get more rigorous validation requirements and more frequent monitoring.

TierDescriptionLLM ExampleValidation CadenceMonitoring
Tier 1 — HighDirectly influences credit decisions, fraud scoring, or regulatory reportingAutomated underwriting assistantAnnual full + quarterly reviewReal-time + weekly report
Tier 2 — MediumInforms decisions but does not make themSO mortgage assistant — answers guideline questionsAnnual full validationWeekly quality metrics
Tier 3 — LowAdministrative, informational, no decision influenceInternal HR policy chatbotAnnual simplified validationMonthly spot check

MortgageIQ SO agent is Tier 2. It informs loan officers but does not approve or deny loans. The compliance agent loop in the LangGraph architecture enforces this boundary — any response that crosses into credit decision territory is intercepted and escalated.


HIPAA Compliance — PHI Handling for Healthcare AI

What HIPAA Requires for AI Systems

HIPAA's Privacy and Security Rules apply to any system that creates, receives, maintains, or transmits Protected Health Information (PHI). For AI systems, the critical requirements are:

PHI De-identification Before LLM Calls

The most critical architectural decision for healthcare AI: PHI must be de-identified or pseudonymized before it enters any LLM prompt. Azure OpenAI has a HIPAA-eligible BAA — but defense-in-depth means you do not rely solely on the BAA. PHI is stripped before the prompt is assembled.

# compliance/phi_sanitizer.py
import re
import hashlib
from azure.ai.textanalytics import TextAnalyticsClient, HealthcareEntityCategory
from azure.core.credentials import AzureKeyCredential

class PHISanitizer:
    """
    De-identifies PHI from text before LLM prompt assembly.
    Replaces identifiers with pseudonyms — preserves clinical context
    while removing patient identity.
    HIPAA Safe Harbor method: removes all 18 identifiers.
    """

    # Rule-based patterns for common PHI
    PATTERNS = {
        "SSN": (r'\b\d{3}-\d{2}-\d{4}\b', "[SSN-REDACTED]"),
        "MRN": (r'\bMRN[:\s#]*\d{6,10}\b', "[MRN-REDACTED]"),
        "DOB": (r'\b(0[1-9]|1[0-2])/(0[1-9]|[12]\d|3[01])/\d{4}\b', "[DOB-REDACTED]"),
        "PHONE": (r'\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}\b', "[PHONE-REDACTED]"),
        "EMAIL": (r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', "[EMAIL-REDACTED]"),
        "IP": (r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', "[IP-REDACTED]"),
        "ZIP_FULL": (r'\b\d{5}-\d{4}\b', "[ZIP-REDACTED]"),   # full ZIP+4 is PHI
    }

    def __init__(self, text_analytics_endpoint: str, text_analytics_key: str):
        self.ta_client = TextAnalyticsClient(
            endpoint=text_analytics_endpoint,
            credential=AzureKeyCredential(text_analytics_key)
        )
        self._pseudonym_map = {}   # session-scoped: consistent replacement within a session

    def sanitize(self, text: str, session_id: str) -> tuple[str, dict]:
        """
        Returns (sanitized_text, phi_audit_record).
        phi_audit_record is written to audit log — never the actual PHI.
        """
        audit_record = {
            "session_id": session_id,
            "phi_detected": [],
            "sanitization_method": "azure_text_analytics + rule_based"
        }

        # Step 1: Azure Text Analytics for Health — NLP-based PHI detection
        sanitized = self._azure_health_sanitize(text, audit_record)

        # Step 2: Rule-based patterns for structured PHI (SSN, MRN, dates)
        for phi_type, (pattern, replacement) in self.PATTERNS.items():
            matches = re.findall(pattern, sanitized)
            if matches:
                audit_record["phi_detected"].append({
                    "type": phi_type,
                    "count": len(matches),
                    # NEVER log actual PHI — log a hash for correlation only
                    "hash": hashlib.sha256(str(matches).encode()).hexdigest()[:16]
                })
                sanitized = re.sub(pattern, replacement, sanitized)

        return sanitized, audit_record

    def _azure_health_sanitize(self, text: str, audit_record: dict) -> str:
        """Use Azure Text Analytics for Health to detect clinical PHI entities."""
        poller = self.ta_client.begin_analyze_healthcare_entities([text])
        result = poller.result()

        sanitized = text
        phi_categories = {
            HealthcareEntityCategory.PATIENT_ID,
            HealthcareEntityCategory.DATE,
            HealthcareEntityCategory.AGE,
            HealthcareEntityCategory.PHONE_NUMBER,
            HealthcareEntityCategory.EMAIL,
            HealthcareEntityCategory.ADDRESS,
            HealthcareEntityCategory.NAME,
        }

        for doc in result:
            if not doc.is_error:
                for entity in reversed(doc.entities):
                    if entity.category in phi_categories:
                        audit_record["phi_detected"].append({
                            "type": entity.category.value,
                            "confidence": entity.confidence_score,
                            "hash": hashlib.sha256(entity.text.encode()).hexdigest()[:16]
                        })
                        # Replace with consistent pseudonym within session
                        pseudonym = self._get_pseudonym(entity.text, entity.category.value)
                        sanitized = sanitized.replace(entity.text, pseudonym)

        return sanitized

    def _get_pseudonym(self, phi_value: str, category: str) -> str:
        """Generate consistent pseudonym — same PHI value → same pseudonym within session."""
        key = hashlib.sha256(f"{phi_value}:{category}".encode()).hexdigest()[:8]
        pseudonym = f"[{category.upper()}-{key}]"
        self._pseudonym_map[pseudonym] = phi_value   # for authorized re-identification only
        return pseudonym

HIPAA Audit Trail Architecture

Every PHI access — including AI system access — must be logged for 6 years, with who accessed what, when, and from where.

# compliance/hipaa_audit_logger.py
import hashlib
import hmac
import json
from datetime import datetime
from azure.cosmos import CosmosClient
from azure.core.credentials import DefaultAzureCredential

class HIPAAAuditLogger:
    """
    Immutable audit log for PHI access events.
    Tamper-evident: each record includes HMAC of previous record.
    Satisfies HIPAA §164.312(b) — Audit Controls.
    Retention: 6 years per HIPAA §164.530(j).
    """
    def __init__(self, cosmos_url: str, signing_key: str):
        self.container = (
            CosmosClient(url=cosmos_url, credential=DefaultAzureCredential())
            .get_database_client("hipaa-audit")
            .get_container_client("phi-access-log")
        )
        self.signing_key = signing_key.encode()
        self._last_hash = self._get_last_record_hash()

    def log_phi_access(
        self,
        user_id: str,
        user_role: str,
        patient_id_hash: str,     # NEVER log actual patient ID
        access_purpose: str,      # Treatment | Payment | Operations | Other
        ai_system: str,           # "SO-HealthAssistant-v1.2"
        phi_types_accessed: list[str],
        sanitization_applied: bool,
        request_id: str,
        ip_address_hash: str
    ) -> str:
        record = {
            "id": request_id,
            "partitionKey": patient_id_hash,
            "timestamp": datetime.utcnow().isoformat(),
            "event_type": "PHI_ACCESS",
            "user_id": user_id,
            "user_role": user_role,
            "patient_id_hash": patient_id_hash,
            "access_purpose": access_purpose,
            "ai_system": ai_system,
            "phi_types_accessed": phi_types_accessed,
            "sanitization_applied": sanitization_applied,
            "ip_address_hash": ip_address_hash,
            "ttl": -1    # never expire — 6-year retention enforced by policy
        }

        # Tamper-evident chain: HMAC of this record + previous record hash
        record_bytes = json.dumps(record, sort_keys=True).encode()
        chain_input = record_bytes + self._last_hash.encode()
        record["chain_hash"] = hmac.new(
            self.signing_key, chain_input, hashlib.sha256
        ).hexdigest()

        self.container.upsert_item(record)
        self._last_hash = record["chain_hash"]
        return record["chain_hash"]

    def verify_audit_chain(self, start_date: str, end_date: str) -> bool:
        """Verify tamper-evidence of audit chain — called during compliance audit."""
        records = list(self.container.query_items(
            "SELECT * FROM c WHERE c.timestamp >= @start AND c.timestamp <= @end ORDER BY c.timestamp",
            parameters=[
                {"name": "@start", "value": start_date},
                {"name": "@end", "value": end_date}
            ]
        ))

        prev_hash = "genesis"
        for record in records:
            stored_hash = record.pop("chain_hash")
            record_bytes = json.dumps(record, sort_keys=True).encode()
            chain_input = record_bytes + prev_hash.encode()
            expected_hash = hmac.new(
                self.signing_key, chain_input, hashlib.sha256
            ).hexdigest()

            if stored_hash != expected_hash:
                return False   # tamper detected
            prev_hash = stored_hash

        return True

Zero-Trust Architecture for Healthcare AI

PHI must be protected at every layer — not just at the database. Zero-trust means every component authenticates every request, regardless of network location.

The minimum necessary standard: a nurse asking about medication dosing should not trigger a full patient record retrieval. The retrieval scope is bounded by the user's role — encoded in the Entra ID JWT claim, enforced at the application layer, logged in the audit trail.


EU AI Act — High-Risk Classification and Conformity

When Does the EU AI Act Apply?

The EU AI Act applies to any AI system placed on the EU market or used in the EU — regardless of where the system is built. For U.S. enterprises serving EU customers, this is not optional.

The Act classifies AI systems into four risk tiers:

Risk LevelExamplesRequirements
UnacceptableSocial scoring, biometric mass surveillanceProhibited — cannot deploy
High-RiskCredit scoring, medical diagnosis, employment screening, critical infrastructureFull conformity assessment, logging, human oversight, registration in EU database
Limited RiskChatbots, deepfake detectionTransparency obligations — disclose it's AI
Minimal RiskSpam filters, AI in video gamesNo mandatory requirements

High-risk classification triggers for enterprise AI:

  • AI used in credit, insurance underwriting, or loan origination → High-risk
  • AI used in clinical decision support or medical device → High-risk
  • AI used in employment screening or performance evaluation → High-risk
  • AI that manages critical infrastructure → High-risk

Conformity Assessment Requirements for High-Risk AI

Logging obligations in practice: the EU AI Act requires that high-risk AI systems log events automatically and with sufficient detail to reconstruct what happened post-hoc. For an LLM:

# compliance/eu_ai_act_logger.py
from dataclasses import dataclass, asdict
import json, hashlib
from datetime import datetime

@dataclass
class EUAIActLogEvent:
    """
    Logging schema satisfying EU AI Act Article 12 — Logging.
    High-risk AI systems must log events enabling post-hoc audit.
    """
    # System identification
    system_id: str            # registered in EU AI database
    system_version: str
    deployment_environment: str   # production | staging

    # Request context
    event_id: str
    timestamp: str
    user_role: str            # role, not identity — privacy preservation
    session_id_hash: str      # hash, not raw session ID

    # Input (no PII)
    input_category: str       # query classification — not raw text
    input_hash: str           # SHA-256 of input for tamper evidence
    input_language: str

    # Processing
    model_used: str
    prompt_version: str
    retrieval_sources: list[str]    # source document IDs, not content
    human_override_available: bool  # EU AI Act: must be possible

    # Output
    output_hash: str          # SHA-256 of output
    confidence_indicators: dict     # groundedness, citation coverage
    compliance_checks_passed: bool
    human_review_triggered: bool

    # Audit
    log_hash: str             # tamper-evident chain

def create_eu_log_event(request: dict, response: dict, system_meta: dict) -> EUAIActLogEvent:
    input_text = request.get("query", "")
    output_text = response.get("answer", "")

    return EUAIActLogEvent(
        system_id=system_meta["eu_ai_database_id"],
        system_version=system_meta["version"],
        deployment_environment=system_meta["env"],
        event_id=request["request_id"],
        timestamp=datetime.utcnow().isoformat(),
        user_role=request["user_role"],
        session_id_hash=hashlib.sha256(request["session_id"].encode()).hexdigest(),
        input_category=request.get("intent_category", "unknown"),
        input_hash=hashlib.sha256(input_text.encode()).hexdigest(),
        input_language=request.get("language", "en"),
        model_used=response["model"],
        prompt_version=response["prompt_version"],
        retrieval_sources=response.get("citation_ids", []),
        human_override_available=True,    # system design guarantees this
        output_hash=hashlib.sha256(output_text.encode()).hexdigest(),
        confidence_indicators={
            "groundedness": response.get("groundedness_score"),
            "citation_count": len(response.get("citations", []))
        },
        compliance_checks_passed=response.get("compliance_pass", False),
        human_review_triggered=response.get("requires_human_review", False),
        log_hash=""    # computed and chained by logger
    )

The Compliance Architecture — Azure + Open Source

Azure Services — Compliance Role

Azure ServiceCompliance RoleRegulation
Entra IDIdentity, role-based access, conditional access (MFA, device compliance)All
Azure APIMRequest logging, rate limiting, JWT validation, input auditAll
Azure Text Analytics for HealthPHI detection and de-identification before LLMHIPAA
Azure Content SafetyInput/output harmful content filtering, prompt injection detectionAll
Azure Prompt ShieldsPrompt injection and jailbreak detection at the gatewaySR 11-7, EU AI Act
Azure OpenAIBAA-eligible, DPA-eligible, private endpoint, managed identityHIPAA, GDPR
Cosmos DBImmutable audit log with TTL control and RBACAll
Microsoft PurviewData classification, PHI lineage, compliance reportingHIPAA, EU AI Act
Key VaultSecrets management, audit of key accessAll
Azure MonitorSR 11-7 ongoing monitoring, drift alerts, quality dashboardsSR 11-7

Open Source — Compliance Role

OSS ToolCompliance Role
LangfusePrompt version audit trail, per-request trace, quality trending, cost attribution
RagasAutomated groundedness/faithfulness evaluation — SR 11-7 validation evidence
DeepEvalCI/CD compliance gate — blocks deployment if safety or quality thresholds fail
LlamaGuardOpen-source input/output safety classifier — HIPAA safe output validation
NeMo GuardrailsDeclarative compliance rules — enforce SR 11-7 scope boundaries in code
EvidentlyAISR 11-7 ongoing monitoring — data drift, output drift, quality degradation reports
OpenTelemetryStandard tracing — EU AI Act logging with vendor-neutral format

MortgageIQ — SR 11-7 Compliance in Practice

The SO mortgage assistant at MortgageIQ operates under a full SR 11-7 compliance framework. Here is how the three pillars are implemented:

Pillar 1 — Documentation:

  • Model documentation stored in SharePoint with version control
  • Every prompt version in Cosmos DB includes: owner, approver, compliance reviewer, changelog
  • System prompt explicitly states decision boundaries ("does not approve or deny loans")
  • Risk tier: Tier 2 — documented in Model Risk inventory

Pillar 2 — Validation:

  • Independent validation by Model Risk team (separate from AI platform team)
  • 200-query evaluation dataset run against every prompt version before promotion
  • Challenger benchmark: keyword search baseline — SO must outperform on all metrics
  • Adversarial challenge: 50 boundary-test queries (out-of-scope, credit decisions, rate quotes) — must route correctly 100%

Pillar 3 — Ongoing Monitoring:

  • Weekly SR 11-7 monitoring report: groundedness, compliance failure rate, escalation rate, out-of-scope rate
  • Azure Monitor alert: groundedness below 0.85 → notify model owner within 1 hour
  • Quarterly validation review with Model Risk Committee
  • Revalidation trigger: any GPT-4o model version update by Microsoft

Compliance agent boundary enforcement:

# The compliance agent prompt — the SR 11-7 boundary enforcer
COMPLIANCE_PROMPT = """
You are a compliance reviewer for a mortgage AI system operating under SR 11-7 Model Risk Management guidelines.

Review the following AI response and check for these compliance violations:

SR 11-7 VIOLATIONS (return FAIL if any present):
1. The response makes a final credit approval or denial decision
2. The response provides a specific rate quote or lock commitment
3. The response states that a borrower "qualifies" or "does not qualify" as a final determination
4. The response contains advice that requires a licensed mortgage originator
5. The response references data that is not cited with a source document and version

REQUIRED ELEMENTS (return FAIL if any missing for factual claims):
1. Source citation for guideline references (document name + section)
2. Data date for any numerical figures
3. Disclaimer when response approaches a credit judgment

If FAIL: provide a revised response that removes the violation and adds required elements.
If PASS: confirm with "COMPLIANCE: PASS".

AI Response to review:
{draft_answer}
"""

This prompt is version-controlled, compliance-reviewed, and logged with every execution — providing the audit trail that SR 11-7 requires.


Compliance Checklist — Before Production


Key Takeaways

  • SR 11-7 applies to LLMs — any LLM that informs financial decisions in a U.S. regulated institution is a model under SR 11-7, requiring documentation of conceptual soundness, independent validation, and ongoing monitoring with a defined escalation path
  • PHI must be de-identified before the prompt, not at the API boundary — HIPAA compliance for AI requires defense-in-depth: Azure Text Analytics for Health removes PHI before it enters any LLM prompt, even when the LLM provider has a BAA
  • The EU AI Act is not future-state for most enterprises — credit scoring, clinical decision support, and employment AI are already high-risk by classification, requiring conformity assessment, logging, and human override capability designed in — not bolted on after deployment
  • Compliance agent loops are the architectural pattern for regulated AI — a separate compliance agent that reviews and rewrites responses before they reach the user, with a human escalation path when automated remediation fails, is the SR 11-7-compliant answer to the hallucination risk
  • Audit trails must be immutable and tamper-evident — HMAC-chained records in Cosmos DB, verified on a schedule, satisfy both HIPAA (6-year retention) and EU AI Act (6-month minimum) requirements with a single architecture
  • Ongoing monitoring is the most consistently underfunded pillar — SR 11-7 Pillar 3 requires weekly quality metrics, revalidation triggers on material change, and an escalation path when metrics breach thresholds; this is where most enterprise AI programs have the largest compliance gap
  • Open source and Azure are complementary, not competing — Azure provides the managed control plane (identity, private endpoints, Content Safety, Purview) while Langfuse, Ragas, and DeepEval provide the LLM-specific evaluation and tracing that Azure's managed services do not yet cover at the same depth