There are 6 different search algorithms your RAG system can use. Most teams pick one, call it "vector search," and wonder why retrieval misses the obvious answers.
The algorithm you choose — and how you combine them — determines whether your RAG system finds "closing costs" when a user asks about "cash at settlement," whether it finds loan program code "CONV30" in a sea of 50,000 documents, and whether a voice query gets the same precision as a text query.
This is the complete map — every algorithm, how it works at the math level, how it's implemented in Azure AI Search and open source, performance benchmarks, and when to use each.
The Search Algorithm Landscape
1. Keyword Search — BM25
BM25 (Best Match 25) is the gold standard for keyword-based retrieval. It's the algorithm behind Elasticsearch, Azure AI Search's full-text mode, Solr, and every traditional search system built in the last two decades.
How BM25 Works
BM25 scores a document against a query by measuring how often query terms appear in the document, weighted by how rare those terms are across the entire corpus.
BM25(D, Q) = Σ IDF(qᵢ) × [f(qᵢ, D) × (k₁ + 1)] / [f(qᵢ, D) + k₁ × (1 - b + b × |D|/avgdl)]
Where:
IDF(qᵢ)= Inverse Document Frequency — rare terms score higherf(qᵢ, D)= Term frequency in document Dk₁= term frequency saturation (typically 1.2–2.0) — diminishing returns on repeated termsb= length normalization (typically 0.75) — penalizes long documents|D|/avgdl= document length relative to corpus average
The saturation effect: the k₁ parameter means doubling the term frequency doesn't double the score. A document with "FHA" appearing 10 times doesn't score 10x better than one with "FHA" appearing once. This prevents keyword stuffing from dominating results.
Where BM25 wins:
- Exact term matching — loan codes ("CONV30", "FHA203K"), product IDs, named entities
- Rare, high-signal terms — "RESPA", "TRID", "mTLS"
- When the user knows the exact terminology
Where BM25 fails:
- Vocabulary mismatch — "cash upfront" vs "closing costs" scores zero overlap
- Synonyms and paraphrases
- Conceptual queries — "what makes a loan risky?" has no obvious keyword targets
Azure AI Search — Full-Text / BM25
from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential
client = SearchClient(
endpoint="https://your-search.search.windows.net",
index_name="mortgage-rag-index",
credential=AzureKeyCredential(API_KEY)
)
# Pure BM25 keyword search
results = client.search(
search_text="FHA loan limits 2025",
query_type="simple", # BM25 scoring
search_fields=["content", "title", "section"],
select=["chunk_id", "content", "doc_title", "section"],
top=10
)
for result in results:
print(f"Score: {result['@search.score']:.3f} | {result['doc_title']} — {result['section']}")
BM25 with field boosting — weight title matches higher than body matches:
results = client.search(
search_text="FHA loan limits 2025",
query_type="full", # Lucene query syntax
search_fields=["title^3", "section^2", "content"], # boost title 3x
top=10
)
Open Source — Elasticsearch / OpenSearch
from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200")
# BM25 search — default in Elasticsearch
response = es.search(
index="mortgage-chunks",
body={
"query": {
"multi_match": {
"query": "FHA loan limits 2025",
"fields": ["title^3", "section^2", "content"],
"type": "best_fields",
"tie_breaker": 0.3
}
},
"size": 10
}
)
Qdrant sparse vectors (BM25-compatible):
from qdrant_client import QdrantClient
from qdrant_client.models import SparseVector, NamedSparseVector
# Qdrant supports sparse vectors for BM25-style retrieval
# Use a sparse encoder (SPLADE, BM25) to generate sparse vectors
sparse_vector = encode_bm25("FHA loan limits 2025") # returns {token_id: score}
client.search(
collection_name="mortgage-chunks",
query_vector=NamedSparseVector(
name="sparse",
vector=SparseVector(indices=sparse_vector.keys(), values=sparse_vector.values())
),
limit=10
)
2. Vector Search — HNSW and eKNN
Vector search finds documents whose embedding vectors are nearest to the query vector in high-dimensional space. The challenge: searching 50,000 vectors for the nearest neighbors naively is O(n×d) — too slow for production query latency.
Two algorithms solve this: HNSW (approximate, fast) and eKNN (exact, slow).
HNSW — Hierarchical Navigable Small World
HNSW builds a multi-layer graph where each node (vector) connects to its nearest neighbors. Higher layers are sparser — long-range connections for fast traversal. Lower layers are denser — precise local neighborhood search.
Query traversal:
- Enter at the top layer — pick the best-connected entry node
- Greedily move toward the query vector at each layer
- When you can't improve at the current layer, drop to the next layer
- At Layer 0, perform local beam search among the dense neighborhood
- Return top-K results
HNSW parameters:
| Parameter | Default | Effect |
|---|---|---|
m (connections per node) | 16 | Higher = better recall, larger index, slower build |
ef_construction (build beam width) | 200 | Higher = better index quality, slower build |
ef_search (query beam width) | 50 | Higher = better recall, slower queries |
The recall-latency tradeoff:
At ef_search=50, you get 93% recall at ~5ms. At ef_search=400, you get 99.5% recall at ~25ms. Default of 50 is wrong for production RAG — at 93% recall, 7% of correct answers are missed. Set ef_search=100–200 for enterprise retrieval.
eKNN — Exhaustive K-Nearest Neighbors
eKNN computes the exact distance from the query vector to every vector in the index. 100% recall — never misses a result. O(n) per query — impractical at scale.
When to use eKNN:
- Small datasets (under 10,000 vectors)
- Offline batch evaluation — compute ground truth for HNSW recall measurement
- High-stakes retrieval where approximate results are unacceptable and latency is not a constraint
Azure AI Search — Vector Search
from azure.search.documents.models import VectorizedQuery
# Generate query embedding
from openai import AzureOpenAI
openai_client = AzureOpenAI(...)
query = "What are FHA loan limits for 2025?"
embedding_response = openai_client.embeddings.create(
model="text-embedding-3-large",
input=query,
dimensions=512 # Matryoshka truncation
)
query_vector = embedding_response.data[0].embedding
# HNSW vector search
vector_query = VectorizedQuery(
vector=query_vector,
k_nearest_neighbors=50, # candidate set size (not final top-K)
fields="content_vector", # index field containing embeddings
exhaustive=False # False = HNSW (approximate), True = eKNN (exact)
)
results = client.search(
search_text=None, # no keyword search
vector_queries=[vector_query],
select=["chunk_id", "content", "doc_title", "section"],
top=10 # final results after scoring
)
for result in results:
print(f"Score: {result['@search.score']:.4f} | {result['doc_title']}")
eKNN for small indexes or ground truth:
vector_query = VectorizedQuery(
vector=query_vector,
k_nearest_neighbors=10,
fields="content_vector",
exhaustive=True # exact search — 100% recall, higher latency
)
Open Source — Qdrant HNSW
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance
client = QdrantClient("localhost", port=6333)
# Create collection with HNSW config
client.create_collection(
collection_name="mortgage-chunks",
vectors_config=VectorParams(
size=512, # embedding dimensions
distance=Distance.COSINE
),
hnsw_config={
"m": 16, # connections per node
"ef_construct": 200, # build quality
"full_scan_threshold": 10000 # switch to flat below this count
}
)
# Query with ef tuning
results = client.search(
collection_name="mortgage-chunks",
query_vector=query_vector,
limit=10,
search_params={"hnsw_ef": 128, "exact": False} # ef_search=128 for better recall
)
3. Hybrid Search — BM25 + Vector + RRF
Hybrid search runs BM25 and vector search in parallel and merges the ranked result lists. The merger algorithm is Reciprocal Rank Fusion (RRF).
Why Neither Alone Is Enough
How RRF Works
RRF doesn't normalize or compare raw scores from BM25 and cosine similarity — they're on completely different scales and distributions. Instead, it uses rank positions only:
RRF_score(doc) = Σ 1 / (rank_i + k)
Where rank_i is the document's position in each result list and k is a smoothing constant (default 60 in Azure AI Search).
Why k=60: the smoothing constant prevents a #1 rank from overwhelming all other signals. With k=0, rank #1 scores 1.0 and rank #2 scores 0.5 — a 2x gap. With k=60, rank #1 scores 0.0164 and rank #2 scores 0.0161 — a 1.8% gap. Small differences in rank matter less; both signals get fair weight.
Azure AI Search — Hybrid Search
from azure.search.documents.models import VectorizedQuery
# Hybrid: BM25 keyword + HNSW vector, merged with RRF
results = client.search(
search_text="CONV30 rate 720 credit score", # BM25 query
vector_queries=[
VectorizedQuery(
vector=query_vector,
k_nearest_neighbors=50,
fields="content_vector",
exhaustive=False
)
],
query_type="simple",
select=["chunk_id", "content", "doc_title", "section"],
top=10
# RRF fusion is automatic when both search_text and vector_queries are provided
)
for result in results:
print(f"RRF Score: {result['@search.score']:.4f} | {result['doc_title']}")
Hybrid with field filtering — scope the search to specific document types before hybrid runs:
results = client.search(
search_text="CONV30 rate 720 credit score",
vector_queries=[VectorizedQuery(vector=query_vector, k_nearest_neighbors=50, fields="content_vector")],
filter="doc_type eq 'rate_sheet' and doc_version eq '2026-Q1'", # pre-filter
top=10
)
Open Source — Qdrant Hybrid Search
from qdrant_client.models import (
SparseVector, NamedVector, NamedSparseVector,
Prefetch, FusionQuery, Fusion
)
# Qdrant native hybrid search with RRF (v1.7+)
results = client.query_points(
collection_name="mortgage-chunks",
prefetch=[
Prefetch(
query=NamedSparseVector( # BM25-style sparse
name="sparse",
vector=SparseVector(indices=sparse_ids, values=sparse_weights)
),
limit=20
),
Prefetch(
query=query_vector, # dense vector
using="dense",
limit=20
)
],
query=FusionQuery(fusion=Fusion.RRF), # merge with RRF
limit=10
)
Elasticsearch hybrid search:
response = es.search(
index="mortgage-chunks",
body={
"query": {
"bool": {
"should": [
{
"multi_match": {
"query": "CONV30 rate 720 credit score",
"fields": ["content^2", "section"]
}
}
]
}
},
"knn": {
"field": "content_vector",
"query_vector": query_vector,
"k": 10,
"num_candidates": 50
},
"rank": {
"rrf": {
"window_size": 50,
"rank_constant": 60 # k parameter
}
},
"size": 10
}
)
4. Semantic Reranking — The Precision Layer
Hybrid search gives you a merged candidate set. Semantic reranking reorders it for precision. The reranker reads the query and each candidate together — full joint attention — and assigns a relevance score.
Azure AI Search — Semantic Ranker (L2 Reranking)
Azure's semantic ranker is a Microsoft-hosted cross-encoder, fine-tuned on Bing search data. It takes the top 50 results from BM25/hybrid and reorders them using a cross-encoder model.
from azure.search.documents.models import VectorizedQuery, QueryType, QueryCaptionType, QueryAnswerType
# Full pipeline: hybrid retrieval + semantic reranking
results = client.search(
search_text="FHA DTI limit compensating factors",
vector_queries=[
VectorizedQuery(
vector=query_vector,
k_nearest_neighbors=50,
fields="content_vector"
)
],
query_type=QueryType.SEMANTIC, # enables semantic ranker
semantic_configuration_name="mortgage-semantic-config",
query_caption=QueryCaptionType.EXTRACTIVE, # extract relevant passages
query_answer=QueryAnswerType.EXTRACTIVE, # extract direct answers
top=5 # final results after reranking
)
# Semantic results include rerank score + extracted captions
for result in results:
print(f"Rerank Score: {result['@search.reranker_score']:.4f}")
print(f"Content: {result['content'][:200]}")
if result.get('@search.captions'):
for caption in result['@search.captions']:
print(f"Caption: {caption.text}")
print()
# Extractive answers — direct answer passages from top result
answers = results.get_answers()
if answers:
for answer in answers:
print(f"Answer: {answer.text} (confidence: {answer.score:.3f})")
Configure semantic ranker — index definition:
from azure.search.documents.indexes.models import (
SearchIndex, SemanticConfiguration, SemanticSearch,
SemanticPrioritizedFields, SemanticField
)
semantic_config = SemanticConfiguration(
name="mortgage-semantic-config",
prioritized_fields=SemanticPrioritizedFields(
title_field=SemanticField(field_name="doc_title"),
keywords_fields=[SemanticField(field_name="section")],
content_fields=[SemanticField(field_name="content")]
)
)
index = SearchIndex(
name="mortgage-rag-index",
fields=[...],
semantic_search=SemanticSearch(configurations=[semantic_config])
)
Open Source — Cross-Encoder Reranker (HuggingFace)
from sentence_transformers import CrossEncoder
# ms-marco models are trained on MS MARCO passage retrieval dataset
# MiniLM-L-6 = fast (CPU-runnable), MiniLM-L-12 = better precision
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, candidates: list[dict], top_k: int = 5, threshold: float = 0.0) -> list[dict]:
# Prepare query-document pairs
pairs = [(query, c["content"]) for c in candidates]
# Score all pairs — joint attention over query + document
scores = reranker.predict(pairs)
# Sort by score descending
ranked = sorted(
zip(scores, candidates),
key=lambda x: x[0],
reverse=True
)
# Apply threshold filter
return [
{**doc, "rerank_score": float(score)}
for score, doc in ranked[:top_k]
if score >= threshold
]
# Usage
hybrid_candidates = get_hybrid_results(query, top_k=50)
final_results = rerank(query, hybrid_candidates, top_k=5, threshold=0.0)
Cohere Rerank API — managed cross-encoder, no GPU required:
import cohere
co = cohere.Client(COHERE_API_KEY)
rerank_results = co.rerank(
query="FHA DTI limit compensating factors",
documents=[c["content"] for c in hybrid_candidates],
top_n=5,
model="rerank-english-v3.0"
)
for result in rerank_results.results:
print(f"Index: {result.index} | Score: {result.relevance_score:.4f}")
5. Multimodal Search — Text + Image + Voice
Enterprise knowledge isn't only text. Diagrams in architecture docs, photos in property appraisals, voice queries from mobile users — all require multimodal search.
Text-to-Image Search
Multimodal embedding models (CLIP, Azure AI Vision) embed text and images into the same vector space. A text query can retrieve relevant images, and an image can retrieve relevant text — cross-modal retrieval.
Azure AI Vision multimodal embeddings:
from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.core.credentials import AzureKeyCredential
import httpx
vision_client = ImageAnalysisClient(
endpoint="https://your-vision.cognitiveservices.azure.com",
credential=AzureKeyCredential(VISION_KEY)
)
# Embed an image for indexing
def embed_image(image_url: str) -> list[float]:
result = vision_client.analyze_from_url(
image_url=image_url,
visual_features=["Caption", "DenseCaptions"]
)
# Use Azure AI Vision vectorization endpoint for embeddings
response = httpx.post(
f"{VISION_ENDPOINT}/computervision/retrieval:vectorizeImage?api-version=2023-02-01-preview",
headers={"Ocp-Apim-Subscription-Key": VISION_KEY},
json={"url": image_url}
)
return response.json()["vector"]
# Embed a text query — same embedding space as images
def embed_text_for_image_search(text: str) -> list[float]:
response = httpx.post(
f"{VISION_ENDPOINT}/computervision/retrieval:vectorizeText?api-version=2023-02-01-preview",
headers={"Ocp-Apim-Subscription-Key": VISION_KEY},
json={"text": text}
)
return response.json()["vector"]
# Text query → retrieve images
query_vector = embed_text_for_image_search("property with detached garage FHA eligible")
image_results = client.search(
search_text=None,
vector_queries=[VectorizedQuery(
vector=query_vector,
k_nearest_neighbors=10,
fields="image_vector"
)],
top=5
)
Open Source — CLIP:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch
model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
# Embed image
image = Image.open("property_photo.jpg")
image_inputs = processor(images=image, return_tensors="pt")
image_embedding = model.get_image_features(**image_inputs)
image_embedding = image_embedding / image_embedding.norm(dim=-1, keepdim=True) # normalize
# Embed text query — same CLIP space
text_inputs = processor(text=["FHA eligible property with garage"], return_tensors="pt")
text_embedding = model.get_text_features(**text_inputs)
text_embedding = text_embedding / text_embedding.norm(dim=-1, keepdim=True)
# Cosine similarity
similarity = torch.cosine_similarity(text_embedding, image_embedding)
Voice / Audio Search
Voice queries follow a Speech-to-Text → Embedding → Vector Search pipeline. The STT step is where accuracy matters most — domain-specific vocabulary ("RESPA", "CONV30", "mTLS") requires custom vocabulary or domain-adapted STT models.
Azure Speech → Search pipeline:
import azure.cognitiveservices.speech as speechsdk
def voice_to_search(audio_file: str) -> list[dict]:
# Step 1: Speech to Text
speech_config = speechsdk.SpeechConfig(
subscription=SPEECH_KEY,
region=SPEECH_REGION
)
# Custom speech model for domain vocabulary (optional but recommended)
speech_config.endpoint_id = CUSTOM_SPEECH_ENDPOINT_ID
audio_config = speechsdk.AudioConfig(filename=audio_file)
recognizer = speechsdk.SpeechRecognizer(speech_config, audio_config)
result = recognizer.recognize_once()
transcribed_text = result.text
print(f"Transcribed: {transcribed_text}")
# Step 2: Embed transcribed text
query_vector = embed_text(transcribed_text)
# Step 3: Hybrid search — same pipeline as text
return hybrid_search(transcribed_text, query_vector, top_k=5)
Open Source — Whisper (OpenAI):
import whisper
model = whisper.load_model("large-v3") # or "medium" for speed/accuracy tradeoff
def voice_to_text(audio_file: str) -> str:
result = model.transcribe(
audio_file,
language="en",
initial_prompt="mortgage FHA VA conventional loan RESPA TRID DTI LTV"
# initial_prompt biases toward domain vocabulary — critical for accuracy
)
return result["text"]
The initial_prompt trick: Whisper uses the initial_prompt as context to bias transcription toward expected vocabulary. Without it, "RESPA" transcribes as "Respa" or "Resp-a." With it, accuracy on domain terms improves significantly.
6. Full Pipeline — All Search Types Together
This is how all search algorithms compose in a production RAG system:
Performance Comparison
Latency (p50, 512-dim vectors, 50K documents)
| Algorithm | Azure AI Search | Qdrant (GPU) | Elasticsearch | Notes |
|---|---|---|---|---|
| BM25 keyword | 5–15ms | — | 5–20ms | Inverted index, near-instant |
| HNSW vector (ef=50) | 20–40ms | 10–20ms | 25–45ms | Approximate, fast |
| HNSW vector (ef=200) | 40–80ms | 20–40ms | 50–90ms | Better recall, higher latency |
| eKNN (exact) | 150–500ms | 80–200ms | 200–600ms | Scales with corpus size |
| Hybrid BM25 + HNSW + RRF | 30–60ms | 20–50ms | 40–80ms | Parallel execution |
| Hybrid + Semantic Reranker | 130–200ms | 120–200ms* | 150–250ms* | +100ms for cross-encoder |
*Open source semantic reranker latency depends on GPU availability and batch size.
Recall@10 Benchmark (1M vectors, 768-dim, BEIR dataset)
| Algorithm | Recall@10 | Precision@10 | Notes |
|---|---|---|---|
| BM25 only | 0.71 | 0.68 | Strong on keyword-heavy queries |
| HNSW only (ef=50) | 0.78 | 0.74 | Good semantic, misses exact terms |
| HNSW only (ef=200) | 0.85 | 0.81 | Better recall, same semantic gap |
| Hybrid BM25 + HNSW | 0.89 | 0.86 | Best of both, consistent |
| Hybrid + Reranker | 0.93 | 0.91 | Production standard |
| eKNN + Reranker | 0.95 | 0.93 | Maximum precision, highest latency |
Algorithm Selection Guide
| Query Type | Recommended Algorithm | Why |
|---|---|---|
| Exact term / code lookup | BM25 | "CONV30" must exact-match |
| Semantic / natural language | HNSW vector | Vocabulary gap handled |
| Mixed (most production queries) | Hybrid BM25 + HNSW + RRF | Covers both failure modes |
| High precision required | Hybrid + Semantic Reranker | +5–8 points precision |
| Maximum precision (offline) | eKNN + Cross-Encoder | No latency constraint |
| Text + image corpus | Multimodal CLIP embedding | Same vector space |
| Voice input | STT → Hybrid | Transcription first |
| Regulated industry | Hybrid + Reranker + Threshold | Auditability + precision |
Azure AI Search vs Open Source — Full Comparison
| Capability | Azure AI Search | Elasticsearch | Qdrant |
|---|---|---|---|
| BM25 keyword | ✓ Native | ✓ Native | Via sparse vectors |
| HNSW vector | ✓ Native | ✓ Native (8.x+) | ✓ Native, configurable |
| eKNN exact | ✓ exhaustive=True | ✓ exact=True | ✓ exact=True |
| Hybrid BM25 + vector | ✓ Native, one API call | ✓ Native (8.x+) | ✓ Native (1.7+) |
| RRF fusion | ✓ Automatic | ✓ Configurable k | ✓ Configurable |
| Semantic reranker | ✓ Managed (Microsoft model) | Via Cohere/Voyage plugin | External model |
| Multimodal (text+image) | ✓ Azure AI Vision integration | Via custom vectors | Via CLIP custom vectors |
| Voice / STT | ✓ Azure Speech integration | External | External |
| Managed infra | ✓ Fully managed | Self-hosted or Elastic Cloud | Self-hosted or Qdrant Cloud |
| Permission trimming | ✓ Native AAD / SharePoint ACL | Custom filter | Custom filter |
| Compliance (SOC2, HIPAA) | ✓ Azure certified | Elastic Cloud only | Self-managed |
| Cost model | Per SKU + per query | Per node-hour | Per node-hour / cloud credits |
| Best for | Azure enterprise, .NET shops | Self-hosted, OSS-first | Vector-native, high-performance |
What We Run at MortgageIQ
Query pipeline:
async def search(query: str, query_type: str = "text") -> list[dict]:
# Step 1: Voice → text (if needed)
if query_type == "voice":
query = await transcribe(query) # Azure Speech + domain vocabulary
# Step 2: Embed query
query_vector = await embed(query) # text-embedding-3-large, 512-dim
# Step 3: Hybrid search — BM25 + HNSW + RRF
candidates = await client.search(
search_text=query,
vector_queries=[VectorizedQuery(
vector=query_vector,
k_nearest_neighbors=50,
fields="content_vector",
exhaustive=False # HNSW, ef=200 configured at index level
)],
query_type=QueryType.SEMANTIC, # Step 4: semantic reranker
semantic_configuration_name="mortgage-semantic-config",
query_caption=QueryCaptionType.EXTRACTIVE,
query_answer=QueryAnswerType.EXTRACTIVE,
filter=build_filter(query), # doc_type, version, section scoping
top=5
)
# Step 5: Threshold — don't pass low-confidence chunks to LLM
filtered = [r for r in candidates if r["@search.reranker_score"] > 0.7]
if not filtered:
return [] # surface "no reliable information" — not hallucination
return filtered
Results after moving from pure vector to full hybrid + semantic reranker:
- Recall on loan program code queries (exact term): 82% → 99% (BM25 addition)
- Precision on semantic queries: 74% → 91% (semantic reranker addition)
- False answer rate (hallucination from low-relevance context): 8% → 1.2% (threshold filter)
- Voice query accuracy on domain terms: 61% → 89% (domain vocabulary in STT prompt)
Key Takeaways
- BM25 is not legacy — it's essential. Every production RAG system needs BM25 alongside vector search. Exact term queries on product codes, IDs, and named entities will never be well-served by approximate nearest neighbor alone.
- HNSW default settings are wrong for RAG. The default
ef_search=50gives 93% recall. Setef_search=100–200for production retrieval — the latency cost is 10–20ms, and the 7% missed answers compound across every query. - RRF is the right merger for hybrid search — not score averaging or linear combination. Scores from BM25 and cosine similarity are incomparable; rank positions are not.
- Azure AI Search's semantic ranker is a managed cross-encoder — it adds ~100ms and 5–8 precision points. Worth it for every use case where accuracy matters more than raw throughput.
- Multimodal search requires a shared embedding space — text and image queries only work together if both are embedded by the same multimodal model (CLIP, Azure AI Vision). You can't mix embedding models across modalities.
- Voice search accuracy on domain vocabulary requires an initial prompt or custom STT model — without it, "RESPA" becomes "Respa" and retrieval fails silently.
Coming Up in This Series
- Day 6: Evaluation — RAGAS, context recall, answer faithfulness, and how to run a retrieval A/B test
- Day 7: Production Patterns — caching, index freshness, multi-tenant isolation, and cost governance