Prompt Engineering Is Software Engineering. Treat It That Way.

A prompt is a contract. It specifies what you expect the model to do, under what conditions, with what constraints. When that contract is implicit — written once, stored in a config file, never tested — it breaks silently. The model updates. Business logic in the system prompt drifts. A retrieval strategy changes. Output schema shifts. Nobody notices until a user reports a wrong answer.

Treat prompts like code. Version them, test them, evaluate them.

The Problem With "Just Write a Better Prompt"

The advice to "improve your prompt" assumes the prompt is the problem. Usually it isn't.

In a RAG system like MortgageIQ, the model receives: a system message defining its role and constraints, retrieved context chunks, and the user's question. If the answer is wrong, the failure could be in any of those three layers. The most common failure modes — in order of frequency:

Retrieval miss — the wrong chunks were retrieved; no prompt fixes bad context
Prompt drift — the system prompt was updated without updating the evaluation dataset
Schema mismatch — the model returns a different format than the downstream code expects
Model update — the base model behavior shifted; the prompt worked on the old version

Prompt engineering that doesn't account for these failure modes is just guessing. Structured prompt engineering means having a way to detect each failure mode before it reaches production.

The Four Parts of a Production Prompt

Every production prompt has four components. Missing any one of them creates a specific, predictable failure.

1. Role Definition

The role tells the model what kind of expertise to apply and what it doesn't know. The MortgageIQ system prompt is:

You are a helpful mortgage and loan assistant. Give concise, practical guidance,
note when rules vary by lender or location, and avoid pretending to know
borrower-specific facts you were not given.

Two decisions embedded here: "note when rules vary by lender or location" — this prevents confident wrong answers on state-specific rules. "Avoid pretending to know borrower-specific facts" — this is a hallucination guard for personalized advice the system can't give.

A weak role definition: "You are a helpful assistant." This gives the model no scope boundary. It will answer anything with equal confidence.

2. Constraints

Constraints define what the model must not do. In regulated domains, this is where compliance lives:

Answer only from the provided context.
If the provided context does not address the question,
answer from your general knowledge and say so.
Do not provide personalized financial advice.
Do not quote specific interest rates — these change daily.

The distinction between "answer only from context" and "answer from general knowledge and say so" matters. Strict retrieval-only fails gracefully on out-of-scope questions; the user gets a "I don't have this information" that erodes trust. Allowing general knowledge with a disclaimer is more useful — the user gets an answer and understands its source.

3. Retrieved Context

This is the RAG injection point. The format in which context is injected affects retrieval quality as much as the retrieval algorithm itself.

Format matters: The model uses [SourceName] headers to attribute answers to sources. If context is injected as a wall of text with no structure, the model cannot reliably cite the right source.

# MortgageIQ context format — each chunk labeled with source
Use the following loan knowledge to answer accurately:

[fha loan requirements — Credit Score Requirements]
FHA loans require a minimum credit score of 580 to qualify for the 3.5% down payment...

[credit score guidelines — Credit Score Ranges]
Credit scores are grouped into ranges that affect loan eligibility...

If the provided context does not address the question, answer from your
general knowledge and say so.

The citation is retrieval metadata, not model output. In MortgageIQ, sources[] in the API response is populated from RetrievalResult.SourceName — it comes from the retrieval layer before the model call. The model is asked to reference the source label in its answer, but the source chip the user clicks is guaranteed to be accurate regardless of what the model says. This is the difference between a traceable system and a hallucination-prone one.

4. Output Format

Specifying the output schema is the difference between a prompt and a structured output pipeline.

For a simple chat response, format constraints are implicit. For a pipeline that parses model output — classification, extraction, structured data generation — they must be explicit:

Return your answer in JSON:
{
  "answer": "...",
  "confidence": "high | medium | low",
  "source_used": "source name or null",
  "disclaimer": "optional — include if answer may vary by lender or state"
}

Do not include any text outside the JSON object.

When the model returns invalid JSON — and it will, on edge cases — you need a validation and retry loop, not a crash.

Validation Loops: What Production Prompt Engineering Looks Like

A validation loop is a programmatic check on model output before it reaches the user. When the check fails, the system retries with a corrective prompt.

What to validate:

JSON schema adherence — does the output match the expected structure?
Citation coverage — does every factual claim reference a source?
Scope compliance — did the model answer from context, or from training data on a scoped query?
Confidence threshold — if confidence: low, surface a disclaimer before returning

The corrective prompt pattern:

# First attempt failed schema validation
The previous response did not match the required JSON schema.
Error: missing field "confidence"

Please retry. Return ONLY a valid JSON object with these fields:
{ "answer": "...", "confidence": "high|medium|low", "source_used": "..." }

The corrective prompt includes the specific error. "Please try again" without context produces the same failure. The error message is part of the prompt.

Retry budget: two retries maximum. If the model fails three times, it's a schema problem, not a transient model behavior. Log it, return a fallback, fix the prompt.

Prompt Versioning — Treating Prompts Like Code

A prompt is the interface between your application and the model. Changing it without tracking the change is equivalent to changing an API contract without a version bump.

What prompt versioning requires:

prompts/
  system/
    v1.0.0-base.txt          ← original
    v1.1.0-add-disclaimer.txt ← added "note when rules vary by lender"
    v1.2.0-scope-constraints.txt ← added "do not quote live rates"
  evaluation/
    golden-dataset.jsonl     ← 50 question-answer pairs
    v1.0.0-scores.json       ← RAGAS scores for v1.0.0
    v1.1.0-scores.json       ← RAGAS scores for v1.1.0

What changes between versions must be documented:

What changed and why
Which evaluation scores changed (faithfulness, relevance, answer correctness)
Whether the change was a regression or improvement on the golden dataset

The golden dataset: 50–100 representative questions with expected answers. Not just "happy path" questions — include questions that should produce retrieval misses, questions with ambiguous scope, and questions where the wrong answer has consequences (wrong credit score threshold, wrong DTI limit).

Run the evaluation suite on every prompt change before merging. A prompt that improves the happy path but increases retrieval miss rate on edge cases is a regression.

Few-Shot vs. Zero-Shot — When Each Works

Zero-shot prompting gives the model instructions and no examples. It works when the task is unambiguous and the model's training includes sufficient similar patterns.

Few-shot prompting provides 2–5 examples of input/output pairs. It works when:

The output format is unusual or complex
The tone or style must be tightly controlled
The model consistently fails a specific pattern zero-shot

For MortgageIQ: zero-shot with explicit format constraints handles most queries. Few-shot is needed for the edge case where the model must distinguish between "this question is out of scope" (live rates) and "this question is in scope but the answer isn't in the knowledge base" (obscure loan type).

# Few-shot examples for scope boundary detection

Example 1:
Question: "What is the current 30-year fixed mortgage rate?"
Response: { "answer": "Mortgage rates change daily and depend on your lender, credit profile, and market conditions. I can't quote a current rate — check with your lender or Bankrate for today's rates.", "confidence": "high", "source_used": null, "out_of_scope": true }

Example 2:
Question: "What credit score do I need for a USDA loan?"
Response: { "answer": "I don't have USDA loan requirements in my current knowledge base. Generally, lenders require a 640+ score, but this varies — confirm with a USDA-approved lender.", "confidence": "medium", "source_used": null, "out_of_scope": false }

The distinction: live rates are always out of scope (external, time-varying data). USDA loans are in scope as a loan type, just not in the current knowledge base. The system behavior differs: one is a hard redirect, one is a soft "here's what I know generally."

Hallucination Prevention — Programmatic, Not Wishful

"Don't hallucinate" in a system prompt is not a hallucination prevention strategy. It's an instruction the model can ignore. Programmatic prevention is the only reliable approach.

Five programmatic guardrails:

1. Context grounding check — After generation, verify the answer contains references to the retrieved context. If the answer cites a source not in sources[], flag it.

2. Citation enforcement — Citations come from the retrieval layer (RetrievalResult.SourceName), not from model output. The UI renders these regardless of what the model says. The model cannot fabricate a source chip.

3. Scope detection — Use response tags (with-retrieval, retrieval-miss) as observability signals. A retrieval miss on an in-scope question means the knowledge base needs expansion, not that the model should guess.

4. Confidence scoring — Ask the model to self-report confidence. Low confidence answers get a disclaimer rendered in the UI before the answer displays.

5. Output validation — Schema validation on every response. If the model returns text instead of JSON, retry with a corrective prompt. If it fails twice, return a fallback response.

None of these require trusting the model to behave correctly. They're checks on the output. That's the difference between a prompt and a prompt system.

What I've Seen Fail

1. Business logic in prompts with no tests. A constraint like "do not quote interest rates" lives in the system prompt. When someone edited the prompt to "improve tone," the constraint was inadvertently removed. Nobody noticed for two weeks. Fix: the golden dataset includes questions that test every constraint, not just happy path questions.

2. Format instructions without validation loops. "Return JSON" in the system prompt works 95% of the time. The remaining 5% — edge cases, long outputs, unusual characters — the model returns markdown-wrapped JSON or adds an explanation before the JSON object. Without a validation loop, that 5% crashes the downstream parser.

3. Retrieval context injected as plain text. No source labels, no structure. The model generates an answer and cites "the document" generically. Users have no traceability. Fix: inject context with structured [SourceName] labels and require the model to reference them.

4. Prompt pinned to a model version, model updated silently. Azure OpenAI model deployments can be configured to auto-update. A prompt tuned for gpt-4o-2024-08-06 may produce different output on gpt-4o-2024-11-20. Fix: pin the model version in deployment config; run evaluation suite on every version update before switching production traffic.

The Minimum Viable Prompt Engineering System

For a team shipping a production RAG system, the minimum viable prompt engineering system is:

Prompts in version control — not in a config file, not in a database — in git, alongside the code that uses them
Golden dataset of 50 questions — covering happy path, retrieval miss, scope boundary, and edge cases
Evaluation on every prompt change — RAGAS faithfulness + answer correctness at minimum
Validation loop — schema check + one retry + fallback on second failure
Response tags — with-retrieval, retrieval-miss, validation-retry as observable signals

This is not a research project. It's the engineering discipline that separates a prompt that works in a demo from a system that works in production.

MortgageIQ source code: github.com/shivojha/azure-ai-loan-copilot Project page: MortgageIQ — Azure AI Loan Copilot