AI Governance for Regulated Industries: How to Build Accountability into the Architecture

AI governance is not a policy document.

Every enterprise has an AI policy. Most of them live in Confluence and were last reviewed before the first LLM API call was made. They describe principles — transparency, fairness, accountability — without specifying what the system must do architecturally to satisfy them.

In a regulated domain — mortgage lending, healthcare, insurance — the gap between "we have an AI policy" and "our AI system is auditable" is where the liability lives.

This post is about closing that gap. Governance in a regulated AI system is not a review process bolted on after the architecture is designed. It is a set of architectural properties that either exist in the system or don't — and cannot be added retroactively without significant rework.

The Four Governance Properties

A governed AI system in a regulated domain must satisfy four properties — not as documentation, but as verifiable system behavior:

Most systems satisfy none of them by default. GPT-4o is not governed by calling it — it's governed by what you log, version, evaluate, and monitor around it.

Property 1: Traceability — The Evidence Chain

The requirement: For every AI-assisted decision, you must be able to reconstruct exactly what the model was given and what it produced.

In mortgage lending, the question from a regulator is not "does your AI make good decisions?" It is "show me the exact inputs, context, and output for loan application #84732 on March 23, 2026."

The evidence chain for a single MortgageIQ response:

Request ID:    req_7f3a91b2
Loan ID:       84732
Timestamp:     2026-03-23T14:22:07Z
Model:         gpt-4o-2024-11-20
Deployment:    mortgageiq-prod-v3

User Question: "What documents are required for FHA pre-approval?"

Retrieved Sources:
  - [pre-approval-process.md § Required Documents] score: 0.82
  - [fha-loan-requirements.md § Documentation Requirements] score: 0.71

System Prompt Hash: sha256:a3f1b9c2...  (pinned — maps to prompt version v1.4.2)

Completion Tokens: 312
Prompt Tokens:     1,847
Groundedness Score: 0.88
Escalated to o1:    false

Response:      "For FHA pre-approval, you will need..."
Sources[]:     ["pre-approval-process.md", "fha-loan-requirements.md"]

Every field in this log is a governance artifact. The model version proves which behavior version was active. The system prompt hash proves what constraints were in effect. The groundedness score proves the response was grounded. The sources array is the evidence chain a regulator can follow.

Implementation: A structured logging middleware that intercepts every LLM request/response and writes to an immutable log store (Azure Blob with append-only policy, or Cosmos DB with no-delete TTL policy).

public class AiAuditMiddleware
{
    public async Task<ChatResponse> InvokeAsync(ChatRequest request, Func<Task<ChatResponse>> next)
    {
        var response = await next();

        await _auditLog.WriteAsync(new AiAuditEntry
        {
            RequestId        = request.RequestId,
            LoanId           = request.LoanId,
            Timestamp        = DateTimeOffset.UtcNow,
            ModelVersion     = _config.ModelVersion,
            SystemPromptHash = _promptHasher.Hash(_config.SystemPrompt),
            UserQuestion     = request.Question,
            RetrievedSources = response.Sources,
            GroundednessScore = response.GroundednessScore,
            ResponseText     = response.Answer,
            PromptTokens     = response.Usage.PromptTokens,
            CompletionTokens = response.Usage.CompletionTokens,
        });

        return response;
    }
}

The audit log is write-once. No update, no delete. If a regulator asks, you run a query. If your compliance team asks, you run a query. The evidence chain exists because the architecture made it exist, not because someone remembered to log it.

Property 2: Reproducibility — Pinned Prompts and Model Versions

The requirement: Given the same inputs, you must be able to reproduce the system's behavior — or at minimum, explain why the behavior changed.

This is why version pinning from the model selection post is a governance requirement, not just an operational best practice. gpt-4o-latest is not reproducible. gpt-4o-2024-11-20 deployed as mortgageiq-prod-v3 is.

The same discipline applies to system prompts. A system prompt stored in appsettings.json and deployed with the application is versioned with every release. A system prompt stored in a database and editable through an admin UI is a governance gap — a prompt can change without a code deployment, without a PR review, without an audit trail.

The governed prompt lifecycle:

Every prompt change is a code change. Every code change goes through review. Every review has an audit trail. This is not bureaucracy — it is the difference between "we believe the prompt was this" and "the prompt was this, at this version, from this date."

Property 3: Accountability — Model Risk Management

The requirement: New model versions must be evaluated before promotion, and the evaluation results must be retained.

Model risk management in financial services (SR 11-7 guidance from the Federal Reserve) requires that models be validated before use and monitored after deployment. Generative AI models are models. The same governance framework applies.

The evaluation pipeline:

Golden Dataset: 150 question-answer pairs curated by compliance + mortgage subject matter experts
Metrics:
  - Faithfulness:         does the answer contradict the retrieved sources?
  - Answer Relevance:     does the answer address the question asked?
  - Context Precision:    are the retrieved chunks actually relevant?
  - Groundedness Score:   token overlap between answer and sources (internal metric)
  - Prohibited Outputs:   zero-tolerance for discriminatory lending language

Promotion threshold:
  - Faithfulness ≥ 0.90
  - Answer Relevance ≥ 0.85
  - Zero prohibited outputs
  - Groundedness Score ≥ 0.75 on ≥ 95% of test cases

When gpt-4o-2025-01-15 is released, it does not replace gpt-4o-2024-11-20 in production automatically. It is deployed to a shadow environment, evaluated against the golden dataset, results are reviewed, and promotion happens as a gated deployment. The evaluation results are retained as governance artifacts alongside the model version.

This is the same process applied to any model change in a financial services context — the LLM is a model, and model governance applies.

Property 4: Fairness Monitoring — Output Distribution

The requirement: AI outputs must be monitored for patterns that could constitute disparate treatment under fair lending laws (ECOA, Fair Housing Act).

In mortgage lending, an AI system that consistently provides less helpful guidance to questions framed around protected class characteristics — even without being given demographic information — can produce a disparate impact. This is not a theoretical concern. It is a regulatory examination risk.

What to monitor:

Signal	What it detects
Response length by question category	Model giving shorter answers to certain loan product types
Groundedness score distribution	Model drifting to parametric memory on certain topics
Fallback rate by question type	Retrieval consistently failing for certain borrower scenarios
Source citation patterns	Certain guideline sections never being retrieved
Escalation rate by topic	o1 escalation clustering around specific loan programs

None of these metrics directly measure fairness — the system doesn't receive demographic information. But anomalies in these distributions are early indicators that the system's behavior is not uniform across the question space, which is the starting point for a fair lending analysis.

Implementation: Azure Monitor custom metrics + a weekly distribution report reviewed by the compliance team. Not a dashboard — a scheduled review with documented sign-off.

The Privacy-by-Design Layer

Governance in a regulated domain also requires that personal data never enters the AI pipeline uncontrolled.

Three rules:

1. PII is scrubbed before embedding. If borrower documents are ingested into the knowledge base or used as retrieval context, PII fields (SSN, DOB, account numbers) are redacted before the text is chunked and embedded. Azure AI Language's PII detection service runs as a pre-processing step.

2. Loan-specific context is not persisted in the LLM. Responses from the LLM are never stored in the vector index. The knowledge base contains guidelines — not individual loan decisions. A future retrieval query cannot surface a previous borrower's information because that information was never indexed.

3. Audit logs are access-controlled. The immutable audit log contains the full LLM context — including any loan-specific data in the system prompt. This log is protected at the same classification level as loan records, not at the application log level.

What I've Seen Fail

1. Governance as a post-launch checklist. The architecture is deployed. Legal asks "can you explain every AI-assisted decision?" The answer is no, because the audit log was not designed in. Retrofitting traceability into a running production system requires rearchitecting the inference path. Build it in from day one.

2. System prompts in a database with no audit trail. A well-intentioned product manager updates the system prompt to be "more helpful." The prompt now produces answers that are less constrained. A regulator asks which prompt was active on a specific date. The database has the current prompt. The previous version is gone. This is a governance failure.

3. Model evaluation as a one-time event. Models are evaluated at launch and never again. Six months later, a model update changes behavior on a subset of questions. The change is not caught because there is no ongoing evaluation process. Evaluation is a continuous process — automated, scheduled, and threshold-gated.

4. Treating the LLM as outside the model risk framework. "It's just a third-party API" is not a model risk management position. The LLM produces outputs that influence decisions. Those outputs must be governed. The fact that the model itself is hosted by Microsoft does not transfer the model risk management obligation.

5. No data classification on audit logs. AI audit logs contain system prompts, retrieved context, and generated responses. In a mortgage system, retrieved context may include loan guideline text that references sensitive scenarios. Storing these logs at standard application log classification — with broad access — is a data governance failure.

The Architecture Implication

AI governance in a regulated industry is not a separate system added alongside the AI. It is a set of properties that must be designed into the AI system from the first request:

The evidence chain exists because the middleware logs it
Reproducibility holds because model versions and prompts are pinned to releases
Model risk is managed because the evaluation pipeline is part of the deployment process
Fairness is monitored because the metrics exist in Azure Monitor

The compliance team doesn't build these properties. The architect does. By the time the compliance team is involved, the architecture either has them or it doesn't.