Every Search Algorithm Explained — Azure AI Search vs Open Source, With Code

There are 6 different search algorithms your RAG system can use. Most teams pick one, call it "vector search," and wonder why retrieval misses the obvious answers.

The algorithm you choose — and how you combine them — determines whether your RAG system finds "closing costs" when a user asks about "cash at settlement," whether it finds loan program code "CONV30" in a sea of 50,000 documents, and whether a voice query gets the same precision as a text query.

This is the complete map — every algorithm, how it works at the math level, how it's implemented in Azure AI Search and open source, performance benchmarks, and when to use each.

The Search Algorithm Landscape

1. Keyword Search — BM25

BM25 (Best Match 25) is the gold standard for keyword-based retrieval. It's the algorithm behind Elasticsearch, Azure AI Search's full-text mode, Solr, and every traditional search system built in the last two decades.

How BM25 Works

BM25 scores a document against a query by measuring how often query terms appear in the document, weighted by how rare those terms are across the entire corpus.

BM25(D, Q) = Σ IDF(qᵢ) × [f(qᵢ, D) × (k₁ + 1)] / [f(qᵢ, D) + k₁ × (1 - b + b × |D|/avgdl)]

Where:

IDF(qᵢ) = Inverse Document Frequency — rare terms score higher
f(qᵢ, D) = Term frequency in document D
k₁ = term frequency saturation (typically 1.2–2.0) — diminishing returns on repeated terms
b = length normalization (typically 0.75) — penalizes long documents
|D|/avgdl = document length relative to corpus average

The saturation effect: the k₁ parameter means doubling the term frequency doesn't double the score. A document with "FHA" appearing 10 times doesn't score 10x better than one with "FHA" appearing once. This prevents keyword stuffing from dominating results.

Where BM25 wins:

Exact term matching — loan codes ("CONV30", "FHA203K"), product IDs, named entities
Rare, high-signal terms — "RESPA", "TRID", "mTLS"
When the user knows the exact terminology

Where BM25 fails:

Vocabulary mismatch — "cash upfront" vs "closing costs" scores zero overlap
Synonyms and paraphrases
Conceptual queries — "what makes a loan risky?" has no obvious keyword targets

Azure AI Search — Full-Text / BM25

from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential

client = SearchClient(
    endpoint="https://your-search.search.windows.net",
    index_name="mortgage-rag-index",
    credential=AzureKeyCredential(API_KEY)
)

# Pure BM25 keyword search
results = client.search(
    search_text="FHA loan limits 2025",
    query_type="simple",           # BM25 scoring
    search_fields=["content", "title", "section"],
    select=["chunk_id", "content", "doc_title", "section"],
    top=10
)

for result in results:
    print(f"Score: {result['@search.score']:.3f} | {result['doc_title']} — {result['section']}")

BM25 with field boosting — weight title matches higher than body matches:

results = client.search(
    search_text="FHA loan limits 2025",
    query_type="full",             # Lucene query syntax
    search_fields=["title^3", "section^2", "content"],  # boost title 3x
    top=10
)

Open Source — Elasticsearch / OpenSearch

from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")

# BM25 search — default in Elasticsearch
response = es.search(
    index="mortgage-chunks",
    body={
        "query": {
            "multi_match": {
                "query": "FHA loan limits 2025",
                "fields": ["title^3", "section^2", "content"],
                "type": "best_fields",
                "tie_breaker": 0.3
            }
        },
        "size": 10
    }
)

Qdrant sparse vectors (BM25-compatible):

from qdrant_client import QdrantClient
from qdrant_client.models import SparseVector, NamedSparseVector

# Qdrant supports sparse vectors for BM25-style retrieval
# Use a sparse encoder (SPLADE, BM25) to generate sparse vectors
sparse_vector = encode_bm25("FHA loan limits 2025")  # returns {token_id: score}

client.search(
    collection_name="mortgage-chunks",
    query_vector=NamedSparseVector(
        name="sparse",
        vector=SparseVector(indices=sparse_vector.keys(), values=sparse_vector.values())
    ),
    limit=10
)

2. Vector Search — HNSW and eKNN

Vector search finds documents whose embedding vectors are nearest to the query vector in high-dimensional space. The challenge: searching 50,000 vectors for the nearest neighbors naively is O(n×d) — too slow for production query latency.

Two algorithms solve this: HNSW (approximate, fast) and eKNN (exact, slow).

HNSW — Hierarchical Navigable Small World

HNSW builds a multi-layer graph where each node (vector) connects to its nearest neighbors. Higher layers are sparser — long-range connections for fast traversal. Lower layers are denser — precise local neighborhood search.

Query traversal:

Enter at the top layer — pick the best-connected entry node
Greedily move toward the query vector at each layer
When you can't improve at the current layer, drop to the next layer
At Layer 0, perform local beam search among the dense neighborhood
Return top-K results

HNSW parameters:

Parameter	Default	Effect
`m` (connections per node)	16	Higher = better recall, larger index, slower build
`ef_construction` (build beam width)	200	Higher = better index quality, slower build
`ef_search` (query beam width)	50	Higher = better recall, slower queries

The recall-latency tradeoff:

At ef_search=50, you get 93% recall at ~5ms. At ef_search=400, you get 99.5% recall at ~25ms. Default of 50 is wrong for production RAG — at 93% recall, 7% of correct answers are missed. Set ef_search=100–200 for enterprise retrieval.

eKNN — Exhaustive K-Nearest Neighbors

eKNN computes the exact distance from the query vector to every vector in the index. 100% recall — never misses a result. O(n) per query — impractical at scale.

When to use eKNN:

Small datasets (under 10,000 vectors)
Offline batch evaluation — compute ground truth for HNSW recall measurement
High-stakes retrieval where approximate results are unacceptable and latency is not a constraint

Azure AI Search — Vector Search

from azure.search.documents.models import VectorizedQuery

# Generate query embedding
from openai import AzureOpenAI
openai_client = AzureOpenAI(...)

query = "What are FHA loan limits for 2025?"
embedding_response = openai_client.embeddings.create(
    model="text-embedding-3-large",
    input=query,
    dimensions=512  # Matryoshka truncation
)
query_vector = embedding_response.data[0].embedding

# HNSW vector search
vector_query = VectorizedQuery(
    vector=query_vector,
    k_nearest_neighbors=50,        # candidate set size (not final top-K)
    fields="content_vector",       # index field containing embeddings
    exhaustive=False               # False = HNSW (approximate), True = eKNN (exact)
)

results = client.search(
    search_text=None,              # no keyword search
    vector_queries=[vector_query],
    select=["chunk_id", "content", "doc_title", "section"],
    top=10                         # final results after scoring
)

for result in results:
    print(f"Score: {result['@search.score']:.4f} | {result['doc_title']}")

eKNN for small indexes or ground truth:

vector_query = VectorizedQuery(
    vector=query_vector,
    k_nearest_neighbors=10,
    fields="content_vector",
    exhaustive=True    # exact search — 100% recall, higher latency
)

Open Source — Qdrant HNSW

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance

client = QdrantClient("localhost", port=6333)

# Create collection with HNSW config
client.create_collection(
    collection_name="mortgage-chunks",
    vectors_config=VectorParams(
        size=512,                  # embedding dimensions
        distance=Distance.COSINE
    ),
    hnsw_config={
        "m": 16,                   # connections per node
        "ef_construct": 200,       # build quality
        "full_scan_threshold": 10000  # switch to flat below this count
    }
)

# Query with ef tuning
results = client.search(
    collection_name="mortgage-chunks",
    query_vector=query_vector,
    limit=10,
    search_params={"hnsw_ef": 128, "exact": False}  # ef_search=128 for better recall
)

3. Hybrid Search — BM25 + Vector + RRF

Hybrid search runs BM25 and vector search in parallel and merges the ranked result lists. The merger algorithm is Reciprocal Rank Fusion (RRF).

Why Neither Alone Is Enough

How RRF Works

RRF doesn't normalize or compare raw scores from BM25 and cosine similarity — they're on completely different scales and distributions. Instead, it uses rank positions only:

RRF_score(doc) = Σ 1 / (rank_i + k)

Where rank_i is the document's position in each result list and k is a smoothing constant (default 60 in Azure AI Search).

Why k=60: the smoothing constant prevents a #1 rank from overwhelming all other signals. With k=0, rank #1 scores 1.0 and rank #2 scores 0.5 — a 2x gap. With k=60, rank #1 scores 0.0164 and rank #2 scores 0.0161 — a 1.8% gap. Small differences in rank matter less; both signals get fair weight.

Azure AI Search — Hybrid Search

from azure.search.documents.models import VectorizedQuery

# Hybrid: BM25 keyword + HNSW vector, merged with RRF
results = client.search(
    search_text="CONV30 rate 720 credit score",    # BM25 query
    vector_queries=[
        VectorizedQuery(
            vector=query_vector,
            k_nearest_neighbors=50,
            fields="content_vector",
            exhaustive=False
        )
    ],
    query_type="simple",
    select=["chunk_id", "content", "doc_title", "section"],
    top=10
    # RRF fusion is automatic when both search_text and vector_queries are provided
)

for result in results:
    print(f"RRF Score: {result['@search.score']:.4f} | {result['doc_title']}")

Hybrid with field filtering — scope the search to specific document types before hybrid runs:

results = client.search(
    search_text="CONV30 rate 720 credit score",
    vector_queries=[VectorizedQuery(vector=query_vector, k_nearest_neighbors=50, fields="content_vector")],
    filter="doc_type eq 'rate_sheet' and doc_version eq '2026-Q1'",  # pre-filter
    top=10
)

Open Source — Qdrant Hybrid Search

from qdrant_client.models import (
    SparseVector, NamedVector, NamedSparseVector,
    Prefetch, FusionQuery, Fusion
)

# Qdrant native hybrid search with RRF (v1.7+)
results = client.query_points(
    collection_name="mortgage-chunks",
    prefetch=[
        Prefetch(
            query=NamedSparseVector(      # BM25-style sparse
                name="sparse",
                vector=SparseVector(indices=sparse_ids, values=sparse_weights)
            ),
            limit=20
        ),
        Prefetch(
            query=query_vector,           # dense vector
            using="dense",
            limit=20
        )
    ],
    query=FusionQuery(fusion=Fusion.RRF), # merge with RRF
    limit=10
)

Elasticsearch hybrid search:

response = es.search(
    index="mortgage-chunks",
    body={
        "query": {
            "bool": {
                "should": [
                    {
                        "multi_match": {
                            "query": "CONV30 rate 720 credit score",
                            "fields": ["content^2", "section"]
                        }
                    }
                ]
            }
        },
        "knn": {
            "field": "content_vector",
            "query_vector": query_vector,
            "k": 10,
            "num_candidates": 50
        },
        "rank": {
            "rrf": {
                "window_size": 50,
                "rank_constant": 60      # k parameter
            }
        },
        "size": 10
    }
)

4. Semantic Reranking — The Precision Layer

Hybrid search gives you a merged candidate set. Semantic reranking reorders it for precision. The reranker reads the query and each candidate together — full joint attention — and assigns a relevance score.

Azure AI Search — Semantic Ranker (L2 Reranking)

Azure's semantic ranker is a Microsoft-hosted cross-encoder, fine-tuned on Bing search data. It takes the top 50 results from BM25/hybrid and reorders them using a cross-encoder model.

from azure.search.documents.models import VectorizedQuery, QueryType, QueryCaptionType, QueryAnswerType

# Full pipeline: hybrid retrieval + semantic reranking
results = client.search(
    search_text="FHA DTI limit compensating factors",
    vector_queries=[
        VectorizedQuery(
            vector=query_vector,
            k_nearest_neighbors=50,
            fields="content_vector"
        )
    ],
    query_type=QueryType.SEMANTIC,             # enables semantic ranker
    semantic_configuration_name="mortgage-semantic-config",
    query_caption=QueryCaptionType.EXTRACTIVE, # extract relevant passages
    query_answer=QueryAnswerType.EXTRACTIVE,   # extract direct answers
    top=5                                      # final results after reranking
)

# Semantic results include rerank score + extracted captions
for result in results:
    print(f"Rerank Score: {result['@search.reranker_score']:.4f}")
    print(f"Content: {result['content'][:200]}")
    if result.get('@search.captions'):
        for caption in result['@search.captions']:
            print(f"Caption: {caption.text}")
    print()

# Extractive answers — direct answer passages from top result
answers = results.get_answers()
if answers:
    for answer in answers:
        print(f"Answer: {answer.text} (confidence: {answer.score:.3f})")

Configure semantic ranker — index definition:

from azure.search.documents.indexes.models import (
    SearchIndex, SemanticConfiguration, SemanticSearch,
    SemanticPrioritizedFields, SemanticField
)

semantic_config = SemanticConfiguration(
    name="mortgage-semantic-config",
    prioritized_fields=SemanticPrioritizedFields(
        title_field=SemanticField(field_name="doc_title"),
        keywords_fields=[SemanticField(field_name="section")],
        content_fields=[SemanticField(field_name="content")]
    )
)

index = SearchIndex(
    name="mortgage-rag-index",
    fields=[...],
    semantic_search=SemanticSearch(configurations=[semantic_config])
)

Open Source — Cross-Encoder Reranker (HuggingFace)

from sentence_transformers import CrossEncoder

# ms-marco models are trained on MS MARCO passage retrieval dataset
# MiniLM-L-6 = fast (CPU-runnable), MiniLM-L-12 = better precision
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, candidates: list[dict], top_k: int = 5, threshold: float = 0.0) -> list[dict]:
    # Prepare query-document pairs
    pairs = [(query, c["content"]) for c in candidates]
    
    # Score all pairs — joint attention over query + document
    scores = reranker.predict(pairs)
    
    # Sort by score descending
    ranked = sorted(
        zip(scores, candidates),
        key=lambda x: x[0],
        reverse=True
    )
    
    # Apply threshold filter
    return [
        {**doc, "rerank_score": float(score)}
        for score, doc in ranked[:top_k]
        if score >= threshold
    ]

# Usage
hybrid_candidates = get_hybrid_results(query, top_k=50)
final_results = rerank(query, hybrid_candidates, top_k=5, threshold=0.0)

Cohere Rerank API — managed cross-encoder, no GPU required:

import cohere

co = cohere.Client(COHERE_API_KEY)

rerank_results = co.rerank(
    query="FHA DTI limit compensating factors",
    documents=[c["content"] for c in hybrid_candidates],
    top_n=5,
    model="rerank-english-v3.0"
)

for result in rerank_results.results:
    print(f"Index: {result.index} | Score: {result.relevance_score:.4f}")

5. Multimodal Search — Text + Image + Voice

Enterprise knowledge isn't only text. Diagrams in architecture docs, photos in property appraisals, voice queries from mobile users — all require multimodal search.

Text-to-Image Search

Multimodal embedding models (CLIP, Azure AI Vision) embed text and images into the same vector space. A text query can retrieve relevant images, and an image can retrieve relevant text — cross-modal retrieval.

Azure AI Vision multimodal embeddings:

from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.core.credentials import AzureKeyCredential
import httpx

vision_client = ImageAnalysisClient(
    endpoint="https://your-vision.cognitiveservices.azure.com",
    credential=AzureKeyCredential(VISION_KEY)
)

# Embed an image for indexing
def embed_image(image_url: str) -> list[float]:
    result = vision_client.analyze_from_url(
        image_url=image_url,
        visual_features=["Caption", "DenseCaptions"]
    )
    # Use Azure AI Vision vectorization endpoint for embeddings
    response = httpx.post(
        f"{VISION_ENDPOINT}/computervision/retrieval:vectorizeImage?api-version=2023-02-01-preview",
        headers={"Ocp-Apim-Subscription-Key": VISION_KEY},
        json={"url": image_url}
    )
    return response.json()["vector"]

# Embed a text query — same embedding space as images
def embed_text_for_image_search(text: str) -> list[float]:
    response = httpx.post(
        f"{VISION_ENDPOINT}/computervision/retrieval:vectorizeText?api-version=2023-02-01-preview",
        headers={"Ocp-Apim-Subscription-Key": VISION_KEY},
        json={"text": text}
    )
    return response.json()["vector"]

# Text query → retrieve images
query_vector = embed_text_for_image_search("property with detached garage FHA eligible")
image_results = client.search(
    search_text=None,
    vector_queries=[VectorizedQuery(
        vector=query_vector,
        k_nearest_neighbors=10,
        fields="image_vector"
    )],
    top=5
)

Open Source — CLIP:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

# Embed image
image = Image.open("property_photo.jpg")
image_inputs = processor(images=image, return_tensors="pt")
image_embedding = model.get_image_features(**image_inputs)
image_embedding = image_embedding / image_embedding.norm(dim=-1, keepdim=True)  # normalize

# Embed text query — same CLIP space
text_inputs = processor(text=["FHA eligible property with garage"], return_tensors="pt")
text_embedding = model.get_text_features(**text_inputs)
text_embedding = text_embedding / text_embedding.norm(dim=-1, keepdim=True)

# Cosine similarity
similarity = torch.cosine_similarity(text_embedding, image_embedding)

Voice / Audio Search

Voice queries follow a Speech-to-Text → Embedding → Vector Search pipeline. The STT step is where accuracy matters most — domain-specific vocabulary ("RESPA", "CONV30", "mTLS") requires custom vocabulary or domain-adapted STT models.

Azure Speech → Search pipeline:

import azure.cognitiveservices.speech as speechsdk

def voice_to_search(audio_file: str) -> list[dict]:
    # Step 1: Speech to Text
    speech_config = speechsdk.SpeechConfig(
        subscription=SPEECH_KEY,
        region=SPEECH_REGION
    )
    # Custom speech model for domain vocabulary (optional but recommended)
    speech_config.endpoint_id = CUSTOM_SPEECH_ENDPOINT_ID
    
    audio_config = speechsdk.AudioConfig(filename=audio_file)
    recognizer = speechsdk.SpeechRecognizer(speech_config, audio_config)
    
    result = recognizer.recognize_once()
    transcribed_text = result.text
    print(f"Transcribed: {transcribed_text}")
    
    # Step 2: Embed transcribed text
    query_vector = embed_text(transcribed_text)
    
    # Step 3: Hybrid search — same pipeline as text
    return hybrid_search(transcribed_text, query_vector, top_k=5)

Open Source — Whisper (OpenAI):

import whisper

model = whisper.load_model("large-v3")  # or "medium" for speed/accuracy tradeoff

def voice_to_text(audio_file: str) -> str:
    result = model.transcribe(
        audio_file,
        language="en",
        initial_prompt="mortgage FHA VA conventional loan RESPA TRID DTI LTV"
        # initial_prompt biases toward domain vocabulary — critical for accuracy
    )
    return result["text"]

The initial_prompt trick: Whisper uses the initial_prompt as context to bias transcription toward expected vocabulary. Without it, "RESPA" transcribes as "Respa" or "Resp-a." With it, accuracy on domain terms improves significantly.

6. Full Pipeline — All Search Types Together

This is how all search algorithms compose in a production RAG system:

Performance Comparison

Latency (p50, 512-dim vectors, 50K documents)

Algorithm	Azure AI Search	Qdrant (GPU)	Elasticsearch	Notes
BM25 keyword	5–15ms	—	5–20ms	Inverted index, near-instant
HNSW vector (ef=50)	20–40ms	10–20ms	25–45ms	Approximate, fast
HNSW vector (ef=200)	40–80ms	20–40ms	50–90ms	Better recall, higher latency
eKNN (exact)	150–500ms	80–200ms	200–600ms	Scales with corpus size
Hybrid BM25 + HNSW + RRF	30–60ms	20–50ms	40–80ms	Parallel execution
Hybrid + Semantic Reranker	130–200ms	120–200ms*	150–250ms*	+100ms for cross-encoder

*Open source semantic reranker latency depends on GPU availability and batch size.

Recall@10 Benchmark (1M vectors, 768-dim, BEIR dataset)

Algorithm	Recall@10	Precision@10	Notes
BM25 only	0.71	0.68	Strong on keyword-heavy queries
HNSW only (ef=50)	0.78	0.74	Good semantic, misses exact terms
HNSW only (ef=200)	0.85	0.81	Better recall, same semantic gap
Hybrid BM25 + HNSW	0.89	0.86	Best of both, consistent
Hybrid + Reranker	0.93	0.91	Production standard
eKNN + Reranker	0.95	0.93	Maximum precision, highest latency

Algorithm Selection Guide

Query Type	Recommended Algorithm	Why
Exact term / code lookup	BM25	"CONV30" must exact-match
Semantic / natural language	HNSW vector	Vocabulary gap handled
Mixed (most production queries)	Hybrid BM25 + HNSW + RRF	Covers both failure modes
High precision required	Hybrid + Semantic Reranker	+5–8 points precision
Maximum precision (offline)	eKNN + Cross-Encoder	No latency constraint
Text + image corpus	Multimodal CLIP embedding	Same vector space
Voice input	STT → Hybrid	Transcription first
Regulated industry	Hybrid + Reranker + Threshold	Auditability + precision

Azure AI Search vs Open Source — Full Comparison

Capability	Azure AI Search	Elasticsearch	Qdrant
BM25 keyword	✓ Native	✓ Native	Via sparse vectors
HNSW vector	✓ Native	✓ Native (8.x+)	✓ Native, configurable
eKNN exact	✓ `exhaustive=True`	✓ `exact=True`	✓ `exact=True`
Hybrid BM25 + vector	✓ Native, one API call	✓ Native (8.x+)	✓ Native (1.7+)
RRF fusion	✓ Automatic	✓ Configurable k	✓ Configurable
Semantic reranker	✓ Managed (Microsoft model)	Via Cohere/Voyage plugin	External model
Multimodal (text+image)	✓ Azure AI Vision integration	Via custom vectors	Via CLIP custom vectors
Voice / STT	✓ Azure Speech integration	External	External
Managed infra	✓ Fully managed	Self-hosted or Elastic Cloud	Self-hosted or Qdrant Cloud
Permission trimming	✓ Native AAD / SharePoint ACL	Custom filter	Custom filter
Compliance (SOC2, HIPAA)	✓ Azure certified	Elastic Cloud only	Self-managed
Cost model	Per SKU + per query	Per node-hour	Per node-hour / cloud credits
Best for	Azure enterprise, .NET shops	Self-hosted, OSS-first	Vector-native, high-performance

What We Run at MortgageIQ

Query pipeline:

async def search(query: str, query_type: str = "text") -> list[dict]:
    
    # Step 1: Voice → text (if needed)
    if query_type == "voice":
        query = await transcribe(query)  # Azure Speech + domain vocabulary
    
    # Step 2: Embed query
    query_vector = await embed(query)    # text-embedding-3-large, 512-dim
    
    # Step 3: Hybrid search — BM25 + HNSW + RRF
    candidates = await client.search(
        search_text=query,
        vector_queries=[VectorizedQuery(
            vector=query_vector,
            k_nearest_neighbors=50,
            fields="content_vector",
            exhaustive=False           # HNSW, ef=200 configured at index level
        )],
        query_type=QueryType.SEMANTIC,                      # Step 4: semantic reranker
        semantic_configuration_name="mortgage-semantic-config",
        query_caption=QueryCaptionType.EXTRACTIVE,
        query_answer=QueryAnswerType.EXTRACTIVE,
        filter=build_filter(query),    # doc_type, version, section scoping
        top=5
    )
    
    # Step 5: Threshold — don't pass low-confidence chunks to LLM
    filtered = [r for r in candidates if r["@search.reranker_score"] > 0.7]
    
    if not filtered:
        return []  # surface "no reliable information" — not hallucination
    
    return filtered

Results after moving from pure vector to full hybrid + semantic reranker:

Recall on loan program code queries (exact term): 82% → 99% (BM25 addition)
Precision on semantic queries: 74% → 91% (semantic reranker addition)
False answer rate (hallucination from low-relevance context): 8% → 1.2% (threshold filter)
Voice query accuracy on domain terms: 61% → 89% (domain vocabulary in STT prompt)

Key Takeaways

BM25 is not legacy — it's essential. Every production RAG system needs BM25 alongside vector search. Exact term queries on product codes, IDs, and named entities will never be well-served by approximate nearest neighbor alone.
HNSW default settings are wrong for RAG. The default ef_search=50 gives 93% recall. Set ef_search=100–200 for production retrieval — the latency cost is 10–20ms, and the 7% missed answers compound across every query.
RRF is the right merger for hybrid search — not score averaging or linear combination. Scores from BM25 and cosine similarity are incomparable; rank positions are not.
Azure AI Search's semantic ranker is a managed cross-encoder — it adds ~100ms and 5–8 precision points. Worth it for every use case where accuracy matters more than raw throughput.
Multimodal search requires a shared embedding space — text and image queries only work together if both are embedded by the same multimodal model (CLIP, Azure AI Vision). You can't mix embedding models across modalities.
Voice search accuracy on domain vocabulary requires an initial prompt or custom STT model — without it, "RESPA" becomes "Respa" and retrieval fails silently.

Coming Up in This Series

Day 6: Evaluation — RAGAS, context recall, answer faithfulness, and how to run a retrieval A/B test
Day 7: Production Patterns — caching, index freshness, multi-tenant isolation, and cost governance