Fine-Tuning GPT-4o on Borrower Conversations: How MortgageIQ Learned to Speak Like a Loan Officer

The borrower called three times. The answer was technically correct all three times.

A MortgageIQ borrower in Ohio received this AI-generated explanation for a $340 payment increase:

"Your monthly payment adjustment is attributable to an escrow impound account recalculation necessitated by a reassessment of your property tax obligation and a concurrent adjustment to your homeowner's insurance premium impound. The aggregate escrow deficiency has been amortized across your remaining loan term in accordance with RESPA guidelines, resulting in the referenced payment modification."

Every word was accurate. Every regulatory requirement was met. The borrower read it twice, didn't understand it, and called the loan servicer. A loan officer spent 18 minutes on the phone explaining the same thing in plain English. The borrower called again two weeks later with the same question.

The problem was not that GPT-4o was wrong. The problem was that GPT-4o was writing like a compliance document, not like the best loan officer in the room.

Fine-tuning on 50,000 borrower conversation transcripts — scrubbed of PII, labeled by communication quality, validated for RESPA compliance — changed this. The same question now gets:

"Your property taxes went up this year. Your county reassessed your home's value in January, and the new tax bill is $312 more per year. Your lender spread that increase across your monthly payments, which added $26 to each one. The remaining $14 is a small insurance adjustment. Your payment stays at this level unless your taxes change again — which usually happens every 1–3 years."

Same facts. Same compliance. Borrower understood on the first read. Repeat call volume on payment change questions dropped 34%.

This is the story of how that happened — the data pipeline, the PII problem, the compliance constraint, the labeling challenge, and the Azure AI Foundry fine-tuning process.

Why This Is the Right Use Case for Fine-Tuning an LLM

Before going further — why fine-tune at all? Why not prompt engineering or RAG?

MortgageIQ already uses both. The RAG pipeline grounds every answer in the loan knowledge base. The system prompt constrains the model to answer only from retrieved context. These solved the accuracy problem. They did not solve the communication problem.

Prompt engineering ceiling: Adding "explain in simple terms" to the system prompt improved readability scores from 32nd grade level to 18th grade level. Still too high. Still not the tone of a trusted loan officer.

RAG ceiling: The knowledge base contains guidelines and policy documents — written in the same compliance language GPT-4o defaulted to. Retrieving more of it didn't help. The model was faithfully reproducing the register of its sources.

What fine-tuning addresses that RAG and prompts cannot: The style and register of communication. How an expert in a regulated domain talks to a non-expert who is anxious, confused, and often making the largest financial decision of their life. This is not in any document. It lives in thousands of loan officer conversations. Fine-tuning is how you extract it.

The Data Problem: Why This Wasn't Done at Launch

Three blockers stood between MortgageIQ and a training dataset at launch.

Blocker 1: Conversations weren't recorded.

Loan officer Q&A happened over phone, email, and in-person. Email threads existed but were scattered across CRM systems with no consistent structure. Phone calls were not transcribed. The richest source of "how our best loan officers communicate" existed only in the memories of the people who made those calls.

Solving this took 8 months: deploying call recording and transcription (Azure AI Speech), building a CRM integration to capture email threads, and implementing a structured Q&A logging system in the MortgageIQ borrower portal. Fine-tuning was planned from the beginning — the infrastructure to collect the training data had to be built first.

Blocker 2: PII contamination.

Every borrower conversation contains PII: names, loan numbers, addresses, SSNs, income figures, account numbers. A training dataset built from raw transcripts would contain all of it. Sending PII to a fine-tuning API — even your own Azure deployment — violates GLBA (Gramm-Leach-Bliley Act) and your borrower privacy policy.

The PII scrubbing pipeline was a prerequisite, not an afterthought:

Azure AI Language's PII detection handles standard entities (names, addresses, SSNs). Custom entity recognition was trained for mortgage-specific PII: loan numbers, property addresses formatted as property descriptions, income figures embedded in explanations.

5% of redacted transcripts were reviewed by a human and the privacy officer before any data entered the training pipeline. This review rate was set by legal, not by engineering.

Blocker 3: No quality labels.

A dataset of 50,000 borrower conversations is not a training dataset. It's a collection of conversations — some excellent, some average, some poor. The model needs to know which responses to emulate, not just that these conversations happened.

Labeling quality required domain expertise. The labeling rubric:

Dimension	1 (Poor)	3 (Good)	5 (Excellent)
Readability	Grade 18+	Grade 10–12	Grade 6–8
Accuracy	Contains errors	Technically correct	Correct + nuanced
Empathy	Transactional	Acknowledges concern	Validates + reassures
Completeness	Partial answer	Answers the question	Answers + anticipates follow-up
Compliance	Missing disclosures	Meets requirements	Meets + explains why

Eight senior loan officers labeled 2,000 transcripts each. Inter-rater agreement was measured — responses with disagreement scores above a threshold were excluded. Only conversations scoring ≥ 4.0 average across all dimensions entered the training set.

This labeling process took 14 weeks. It cannot be shortcut. The quality of the fine-tuned model is a direct function of the quality of the labels.

Training Data Structure

The final training format for Azure AI Foundry:

{"messages": [
  {
    "role": "system",
    "content": "You are a MortgageIQ loan advisor. Explain mortgage concepts clearly and warmly to borrowers who may be anxious or confused. Use plain English — no jargon without explanation. Always be accurate and RESPA-compliant. Anticipate follow-up questions and address them proactively."
  },
  {
    "role": "user",
    "content": "Why did my payment go up $340 this month? I didn't change anything."
  },
  {
    "role": "assistant",
    "content": "Your property taxes went up this year. Your county reassessed your home's value in January, and the new tax bill is $312 more per year. Your lender spread that increase across your monthly payments, which added $26 to each one. The remaining $14 is a small insurance adjustment.\n\nYour payment stays at this level unless your taxes change again — which usually happens every 1–3 years. If you think the county assessed your home too high, you can appeal the assessment — many homeowners do this successfully. Would you like information on how that process works?"
  }
]}

Note the last sentence — the fine-tuned model proactively offers the appeal option. This was a pattern found consistently in the highest-scoring loan officer responses. The labeling rubric rewarded "anticipates follow-up." The model learned it.

Dataset composition:

Raw transcripts collected:      127,483
After PII scrubbing:            124,891
After quality labeling:          62,447  (scored ≥ 4.0)
After format conversion:         58,203
After deduplication:             51,847
Training set (80%):              41,477
Validation set (10%):             5,185
Test set (10% — held out):        5,185

Topic distribution was intentionally balanced:

Topic	Training examples	% of queries in prod
Payment changes (escrow, tax, insurance)	12,443	31%
Payoff quotes and timelines	8,296	21%
PMI removal eligibility	6,221	16%
Rate and ARM adjustments	4,148	11%
Loan modification options	3,323	8%
Forbearance and hardship	2,489	6%
Refinancing eligibility	2,074	5%
Other	2,483	2%

Underrepresented topics (forbearance, hardship) were oversampled to 6% of training data — these are the highest-stakes conversations. A fine-tuned model that performs well on common queries but fails on hardship conversations is not acceptable in a regulated lending context.

The Compliance Constraint: Training Data That Passes Legal Review

This is the dimension that makes MortgageIQ fine-tuning different from a consumer chatbot fine-tuning project.

Every training response was reviewed for RESPA compliance before entering the dataset:

RESPA requirements that shape the training data:

Payment change explanations must reference the escrow account analysis statement
Payoff quotes must include the date through which the quote is valid
Loss mitigation options must be presented without steering
Foreclosure alternatives must be disclosed in hardship conversations

Responses that were warm, clear, and empathetic but missed a required disclosure were excluded from training — regardless of their communication quality score. Compliance was a hard gate, not a weighted dimension.

This created a tension: the most natural-sounding responses sometimes omitted disclosures that feel bureaucratic in conversation. The labeling rubric resolved this by making compliance a binary requirement rather than a scored dimension. A response either passes compliance review or it doesn't appear in the training set.

The result: the fine-tuned model learned to embed required disclosures naturally into the flow of explanation, the way the best loan officers do — not as a tagged-on legal paragraph, but as context that helps the borrower understand:

"...Your lender is required to send you an annual escrow analysis statement that breaks this down in detail — you should receive it within the next 10 days if you haven't already. That document will show exactly how the new payment was calculated."

The disclosure is there. It doesn't feel like a disclosure.

Fine-Tuning on Azure AI Foundry

from azure.ai.foundry import FineTuningJob

job = FineTuningJob(
    model="gpt-4o-2024-11-20",
    training_file="mortgageiq-conversations-train.jsonl",
    validation_file="mortgageiq-conversations-val.jsonl",
    hyperparameters={
        "n_epochs": 4,
        "batch_size": 16,          # smaller batch — preserves nuance in tone
        "learning_rate_multiplier": 0.05,  # very conservative — tone is fragile
    },
    suffix="mortgageiq-advisor-v1",
)

job.submit()
# Training time: ~14 hours
# Cost: ~$890

Why learning rate 0.05 — lower than the dispatch model's 0.1:

Tone is more fragile than factual accuracy. A high learning rate can shift the model's communication style aggressively — you get a model that sounds like your training data but has lost the general language fluency that makes it coherent. For communication tasks, conservative learning rates preserve the underlying language model while nudging the style.

We ran learning rates of 0.01, 0.05, and 0.1 on a 10% data sample:

0.01: Minimal style shift — sounded like the base model with slight improvement
0.05: Target tone achieved — clear, empathetic, compliant
0.1: Overfit to training register — started sounding robotic in edge cases

Evaluation: Beyond Accuracy

Evaluating a communication fine-tune is harder than evaluating a prediction model. There is no single accuracy metric. We measured five dimensions:

1. Readability (Flesch-Kincaid grade level)

	Base GPT-4o	Fine-Tuned
Payment change explanations	Grade 18.2	Grade 7.1
Payoff quote explanations	Grade 16.8	Grade 6.9
PMI removal	Grade 15.3	Grade 7.4
Hardship options	Grade 14.1	Grade 6.2

Target was Grade 6–8 — readable by a borrower under stress, reading on a phone screen, possibly at 9PM after opening a payment notice.

2. RESPA Compliance Pass Rate

Automated compliance checking (rule-based, not LLM-based) against a test set of 5,185 conversations:

	Base GPT-4o	Fine-Tuned
Compliance pass rate	78.3%	97.1%
Missing disclosure rate	21.7%	2.9%

The base model was compliant 78% of the time — meaning 22% of responses required human review before sending. The fine-tuned model reduced this to 2.9%. Not zero — there will always be edge cases — but the human review burden dropped by 87%.

3. Borrower Comprehension (A/B test)

500 real borrowers (consented) were randomly assigned base or fine-tuned responses to payment change questions. Follow-up survey after 48 hours:

Question	Base GPT-4o	Fine-Tuned
"I understood the explanation clearly"	41% agree	89% agree
"I felt reassured after reading"	28% agree	76% agree
"I needed to call for clarification"	54% yes	18% yes

4. Repeat Call Rate (production metric)

After full deployment, measured over 90 days:

Query Type	Before Fine-Tuning	After Fine-Tuning	Change
Payment change questions	47% repeat call rate	13% repeat call rate	-34 pts
Payoff quote questions	38% repeat call rate	11% repeat call rate	-27 pts
PMI removal questions	52% repeat call rate	19% repeat call rate	-33 pts

Each avoided repeat call saves approximately 18 minutes of loan officer time. At volume, the ROI on the fine-tuning project paid back in 6 weeks.

5. Groundedness (still required — fine-tuning doesn't remove the need for RAG)

Fine-tuning changed the style. RAG still provides the substance. The fine-tuned model still retrieves from the loan knowledge base. Groundedness scores stayed consistent:

	Base + RAG	Fine-Tuned + RAG
Groundedness score	0.87	0.89

Fine-tuning did not degrade factual grounding. The model learned to express retrieved facts differently — not to generate different facts.

The Champion-Challenger Deployment

The fine-tuned model was not released to all borrowers immediately:

Phase 1 (weeks 1–2):   5% of payment change queries → fine-tuned
Phase 2 (weeks 3–4):   25% of payment change queries → fine-tuned
Phase 3 (weeks 5–6):   50% of all queries → fine-tuned
Phase 4 (week 7+):     100% rollout after compliance sign-off

Each phase gate required:

Compliance pass rate ≥ 95%
Repeat call rate ≤ prior phase
No new complaint categories in CRM
Privacy officer sign-off on PII handling in production

The phased rollout caught one issue at Phase 2: the fine-tuned model was occasionally adding unsolicited refinancing suggestions to hardship conversations — a pattern it had learned from a subset of training data where loan officers offered alternatives. This was flagged by compliance as potential steering. The training data was filtered to remove those examples, the model was retrained, and the rollout resumed.

This is why phased rollouts exist. The training process is not the last line of defense.

What Fine-Tuning Cannot Fix — The Honest Limits

1. Hallucinated loan officer charm. The model learned empathetic communication patterns. It did not learn genuine understanding of borrower circumstances. A borrower who says "I'm going through a divorce and I'm terrified of losing the house" gets a warmer response than before — but the model is pattern-matching to "distress + housing concern," not actually understanding. Human escalation paths remain essential for high-stakes emotional conversations.

2. Outdated style in a changing regulatory environment. RESPA requirements change. State-specific disclosures change. The fine-tuned model learned compliance patterns from training data with a cutoff date. A regulatory change after that cutoff is not reflected in the model's learned behavior. The RAG knowledge base handles factual updates. Behavioral compliance updates require retraining.

3. The long tail of edge cases. 50,000 training examples covers common scenarios well. Uncommon scenarios — a borrower in active bankruptcy, a loan in a FEMA disaster zone, a reverse mortgage converting to conventional — may not have enough training examples to produce fine-tuned behavior. The model falls back toward base model behavior for these cases. Monitoring for edge case failure is ongoing.

Key Takeaways

The MortgageIQ problem: GPT-4o explained mortgage concepts accurately but wrote like a compliance document. Borrowers didn't understand. Repeat calls cost 18 minutes of loan officer time each. The problem was not accuracy — it was register. The model needed to learn how the best loan officers communicate, not just what they know.

The transferable principle: Fine-tuning a language model is most valuable when the gap is how something is communicated, not what is communicated. RAG solves the "what." Prompt engineering nudges the "how." When the "how" gap is large and domain-specific — regulated financial communication, clinical patient communication, legal plain-language requirements — fine-tuning on expert communication examples is the right tool.

What I'd do differently: Start the call recording and transcript pipeline 12 months before the fine-tuning project. The 8-month delay between deciding to fine-tune and having usable training data was the longest part of the timeline. The model training itself took 14 hours. The data infrastructure took 8 months.

Watch out for: The compliance-quality tension. The most natural-sounding responses sometimes omit required disclosures. If you weight communication quality as the only labeling dimension, you train a model that sounds great and violates RESPA. Compliance must be a hard gate — not a scored dimension — in the labeling rubric. Design the labeling system before you start labeling.