← All Posts
ai-mlApril 19, 2026ragazure-ai-searchvector-searchhybrid-searchbm25hnswrrfsemantic-rankermultimodalqdrantelasticsearchenterprise-ai

Every Search Algorithm Explained — Azure AI Search vs Open Source, With Code

A complete guide to every search algorithm used in enterprise RAG — keyword BM25, vector HNSW/eKNN, hybrid RRF, semantic reranking, and multimodal search — with Python code for Azure AI Search and open source stacks, and a performance comparison of each.

There are 6 different search algorithms your RAG system can use. Most teams pick one, call it "vector search," and wonder why retrieval misses the obvious answers.

The algorithm you choose — and how you combine them — determines whether your RAG system finds "closing costs" when a user asks about "cash at settlement," whether it finds loan program code "CONV30" in a sea of 50,000 documents, and whether a voice query gets the same precision as a text query.

This is the complete map — every algorithm, how it works at the math level, how it's implemented in Azure AI Search and open source, performance benchmarks, and when to use each.


The Search Algorithm Landscape


1. Keyword Search — BM25

BM25 (Best Match 25) is the gold standard for keyword-based retrieval. It's the algorithm behind Elasticsearch, Azure AI Search's full-text mode, Solr, and every traditional search system built in the last two decades.

How BM25 Works

BM25 scores a document against a query by measuring how often query terms appear in the document, weighted by how rare those terms are across the entire corpus.

BM25(D, Q) = Σ IDF(qᵢ) × [f(qᵢ, D) × (k₁ + 1)] / [f(qᵢ, D) + k₁ × (1 - b + b × |D|/avgdl)]

Where:

  • IDF(qᵢ) = Inverse Document Frequency — rare terms score higher
  • f(qᵢ, D) = Term frequency in document D
  • k₁ = term frequency saturation (typically 1.2–2.0) — diminishing returns on repeated terms
  • b = length normalization (typically 0.75) — penalizes long documents
  • |D|/avgdl = document length relative to corpus average

The saturation effect: the k₁ parameter means doubling the term frequency doesn't double the score. A document with "FHA" appearing 10 times doesn't score 10x better than one with "FHA" appearing once. This prevents keyword stuffing from dominating results.

Where BM25 wins:

  • Exact term matching — loan codes ("CONV30", "FHA203K"), product IDs, named entities
  • Rare, high-signal terms — "RESPA", "TRID", "mTLS"
  • When the user knows the exact terminology

Where BM25 fails:

  • Vocabulary mismatch — "cash upfront" vs "closing costs" scores zero overlap
  • Synonyms and paraphrases
  • Conceptual queries — "what makes a loan risky?" has no obvious keyword targets

Azure AI Search — Full-Text / BM25

from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential

client = SearchClient(
    endpoint="https://your-search.search.windows.net",
    index_name="mortgage-rag-index",
    credential=AzureKeyCredential(API_KEY)
)

# Pure BM25 keyword search
results = client.search(
    search_text="FHA loan limits 2025",
    query_type="simple",           # BM25 scoring
    search_fields=["content", "title", "section"],
    select=["chunk_id", "content", "doc_title", "section"],
    top=10
)

for result in results:
    print(f"Score: {result['@search.score']:.3f} | {result['doc_title']} — {result['section']}")

BM25 with field boosting — weight title matches higher than body matches:

results = client.search(
    search_text="FHA loan limits 2025",
    query_type="full",             # Lucene query syntax
    search_fields=["title^3", "section^2", "content"],  # boost title 3x
    top=10
)

Open Source — Elasticsearch / OpenSearch

from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")

# BM25 search — default in Elasticsearch
response = es.search(
    index="mortgage-chunks",
    body={
        "query": {
            "multi_match": {
                "query": "FHA loan limits 2025",
                "fields": ["title^3", "section^2", "content"],
                "type": "best_fields",
                "tie_breaker": 0.3
            }
        },
        "size": 10
    }
)

Qdrant sparse vectors (BM25-compatible):

from qdrant_client import QdrantClient
from qdrant_client.models import SparseVector, NamedSparseVector

# Qdrant supports sparse vectors for BM25-style retrieval
# Use a sparse encoder (SPLADE, BM25) to generate sparse vectors
sparse_vector = encode_bm25("FHA loan limits 2025")  # returns {token_id: score}

client.search(
    collection_name="mortgage-chunks",
    query_vector=NamedSparseVector(
        name="sparse",
        vector=SparseVector(indices=sparse_vector.keys(), values=sparse_vector.values())
    ),
    limit=10
)

2. Vector Search — HNSW and eKNN

Vector search finds documents whose embedding vectors are nearest to the query vector in high-dimensional space. The challenge: searching 50,000 vectors for the nearest neighbors naively is O(n×d) — too slow for production query latency.

Two algorithms solve this: HNSW (approximate, fast) and eKNN (exact, slow).

HNSW — Hierarchical Navigable Small World

HNSW builds a multi-layer graph where each node (vector) connects to its nearest neighbors. Higher layers are sparser — long-range connections for fast traversal. Lower layers are denser — precise local neighborhood search.

Query traversal:

  1. Enter at the top layer — pick the best-connected entry node
  2. Greedily move toward the query vector at each layer
  3. When you can't improve at the current layer, drop to the next layer
  4. At Layer 0, perform local beam search among the dense neighborhood
  5. Return top-K results

HNSW parameters:

ParameterDefaultEffect
m (connections per node)16Higher = better recall, larger index, slower build
ef_construction (build beam width)200Higher = better index quality, slower build
ef_search (query beam width)50Higher = better recall, slower queries

The recall-latency tradeoff:

At ef_search=50, you get 93% recall at ~5ms. At ef_search=400, you get 99.5% recall at ~25ms. Default of 50 is wrong for production RAG — at 93% recall, 7% of correct answers are missed. Set ef_search=100–200 for enterprise retrieval.

eKNN — Exhaustive K-Nearest Neighbors

eKNN computes the exact distance from the query vector to every vector in the index. 100% recall — never misses a result. O(n) per query — impractical at scale.

When to use eKNN:

  • Small datasets (under 10,000 vectors)
  • Offline batch evaluation — compute ground truth for HNSW recall measurement
  • High-stakes retrieval where approximate results are unacceptable and latency is not a constraint

Azure AI Search — Vector Search

from azure.search.documents.models import VectorizedQuery

# Generate query embedding
from openai import AzureOpenAI
openai_client = AzureOpenAI(...)

query = "What are FHA loan limits for 2025?"
embedding_response = openai_client.embeddings.create(
    model="text-embedding-3-large",
    input=query,
    dimensions=512  # Matryoshka truncation
)
query_vector = embedding_response.data[0].embedding

# HNSW vector search
vector_query = VectorizedQuery(
    vector=query_vector,
    k_nearest_neighbors=50,        # candidate set size (not final top-K)
    fields="content_vector",       # index field containing embeddings
    exhaustive=False               # False = HNSW (approximate), True = eKNN (exact)
)

results = client.search(
    search_text=None,              # no keyword search
    vector_queries=[vector_query],
    select=["chunk_id", "content", "doc_title", "section"],
    top=10                         # final results after scoring
)

for result in results:
    print(f"Score: {result['@search.score']:.4f} | {result['doc_title']}")

eKNN for small indexes or ground truth:

vector_query = VectorizedQuery(
    vector=query_vector,
    k_nearest_neighbors=10,
    fields="content_vector",
    exhaustive=True    # exact search — 100% recall, higher latency
)

Open Source — Qdrant HNSW

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance

client = QdrantClient("localhost", port=6333)

# Create collection with HNSW config
client.create_collection(
    collection_name="mortgage-chunks",
    vectors_config=VectorParams(
        size=512,                  # embedding dimensions
        distance=Distance.COSINE
    ),
    hnsw_config={
        "m": 16,                   # connections per node
        "ef_construct": 200,       # build quality
        "full_scan_threshold": 10000  # switch to flat below this count
    }
)

# Query with ef tuning
results = client.search(
    collection_name="mortgage-chunks",
    query_vector=query_vector,
    limit=10,
    search_params={"hnsw_ef": 128, "exact": False}  # ef_search=128 for better recall
)

3. Hybrid Search — BM25 + Vector + RRF

Hybrid search runs BM25 and vector search in parallel and merges the ranked result lists. The merger algorithm is Reciprocal Rank Fusion (RRF).

Why Neither Alone Is Enough

How RRF Works

RRF doesn't normalize or compare raw scores from BM25 and cosine similarity — they're on completely different scales and distributions. Instead, it uses rank positions only:

RRF_score(doc) = Σ 1 / (rank_i + k)

Where rank_i is the document's position in each result list and k is a smoothing constant (default 60 in Azure AI Search).

Why k=60: the smoothing constant prevents a #1 rank from overwhelming all other signals. With k=0, rank #1 scores 1.0 and rank #2 scores 0.5 — a 2x gap. With k=60, rank #1 scores 0.0164 and rank #2 scores 0.0161 — a 1.8% gap. Small differences in rank matter less; both signals get fair weight.

Azure AI Search — Hybrid Search

from azure.search.documents.models import VectorizedQuery

# Hybrid: BM25 keyword + HNSW vector, merged with RRF
results = client.search(
    search_text="CONV30 rate 720 credit score",    # BM25 query
    vector_queries=[
        VectorizedQuery(
            vector=query_vector,
            k_nearest_neighbors=50,
            fields="content_vector",
            exhaustive=False
        )
    ],
    query_type="simple",
    select=["chunk_id", "content", "doc_title", "section"],
    top=10
    # RRF fusion is automatic when both search_text and vector_queries are provided
)

for result in results:
    print(f"RRF Score: {result['@search.score']:.4f} | {result['doc_title']}")

Hybrid with field filtering — scope the search to specific document types before hybrid runs:

results = client.search(
    search_text="CONV30 rate 720 credit score",
    vector_queries=[VectorizedQuery(vector=query_vector, k_nearest_neighbors=50, fields="content_vector")],
    filter="doc_type eq 'rate_sheet' and doc_version eq '2026-Q1'",  # pre-filter
    top=10
)

Open Source — Qdrant Hybrid Search

from qdrant_client.models import (
    SparseVector, NamedVector, NamedSparseVector,
    Prefetch, FusionQuery, Fusion
)

# Qdrant native hybrid search with RRF (v1.7+)
results = client.query_points(
    collection_name="mortgage-chunks",
    prefetch=[
        Prefetch(
            query=NamedSparseVector(      # BM25-style sparse
                name="sparse",
                vector=SparseVector(indices=sparse_ids, values=sparse_weights)
            ),
            limit=20
        ),
        Prefetch(
            query=query_vector,           # dense vector
            using="dense",
            limit=20
        )
    ],
    query=FusionQuery(fusion=Fusion.RRF), # merge with RRF
    limit=10
)

Elasticsearch hybrid search:

response = es.search(
    index="mortgage-chunks",
    body={
        "query": {
            "bool": {
                "should": [
                    {
                        "multi_match": {
                            "query": "CONV30 rate 720 credit score",
                            "fields": ["content^2", "section"]
                        }
                    }
                ]
            }
        },
        "knn": {
            "field": "content_vector",
            "query_vector": query_vector,
            "k": 10,
            "num_candidates": 50
        },
        "rank": {
            "rrf": {
                "window_size": 50,
                "rank_constant": 60      # k parameter
            }
        },
        "size": 10
    }
)

4. Semantic Reranking — The Precision Layer

Hybrid search gives you a merged candidate set. Semantic reranking reorders it for precision. The reranker reads the query and each candidate together — full joint attention — and assigns a relevance score.

Azure AI Search — Semantic Ranker (L2 Reranking)

Azure's semantic ranker is a Microsoft-hosted cross-encoder, fine-tuned on Bing search data. It takes the top 50 results from BM25/hybrid and reorders them using a cross-encoder model.

from azure.search.documents.models import VectorizedQuery, QueryType, QueryCaptionType, QueryAnswerType

# Full pipeline: hybrid retrieval + semantic reranking
results = client.search(
    search_text="FHA DTI limit compensating factors",
    vector_queries=[
        VectorizedQuery(
            vector=query_vector,
            k_nearest_neighbors=50,
            fields="content_vector"
        )
    ],
    query_type=QueryType.SEMANTIC,             # enables semantic ranker
    semantic_configuration_name="mortgage-semantic-config",
    query_caption=QueryCaptionType.EXTRACTIVE, # extract relevant passages
    query_answer=QueryAnswerType.EXTRACTIVE,   # extract direct answers
    top=5                                      # final results after reranking
)

# Semantic results include rerank score + extracted captions
for result in results:
    print(f"Rerank Score: {result['@search.reranker_score']:.4f}")
    print(f"Content: {result['content'][:200]}")
    if result.get('@search.captions'):
        for caption in result['@search.captions']:
            print(f"Caption: {caption.text}")
    print()

# Extractive answers — direct answer passages from top result
answers = results.get_answers()
if answers:
    for answer in answers:
        print(f"Answer: {answer.text} (confidence: {answer.score:.3f})")

Configure semantic ranker — index definition:

from azure.search.documents.indexes.models import (
    SearchIndex, SemanticConfiguration, SemanticSearch,
    SemanticPrioritizedFields, SemanticField
)

semantic_config = SemanticConfiguration(
    name="mortgage-semantic-config",
    prioritized_fields=SemanticPrioritizedFields(
        title_field=SemanticField(field_name="doc_title"),
        keywords_fields=[SemanticField(field_name="section")],
        content_fields=[SemanticField(field_name="content")]
    )
)

index = SearchIndex(
    name="mortgage-rag-index",
    fields=[...],
    semantic_search=SemanticSearch(configurations=[semantic_config])
)

Open Source — Cross-Encoder Reranker (HuggingFace)

from sentence_transformers import CrossEncoder

# ms-marco models are trained on MS MARCO passage retrieval dataset
# MiniLM-L-6 = fast (CPU-runnable), MiniLM-L-12 = better precision
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, candidates: list[dict], top_k: int = 5, threshold: float = 0.0) -> list[dict]:
    # Prepare query-document pairs
    pairs = [(query, c["content"]) for c in candidates]
    
    # Score all pairs — joint attention over query + document
    scores = reranker.predict(pairs)
    
    # Sort by score descending
    ranked = sorted(
        zip(scores, candidates),
        key=lambda x: x[0],
        reverse=True
    )
    
    # Apply threshold filter
    return [
        {**doc, "rerank_score": float(score)}
        for score, doc in ranked[:top_k]
        if score >= threshold
    ]

# Usage
hybrid_candidates = get_hybrid_results(query, top_k=50)
final_results = rerank(query, hybrid_candidates, top_k=5, threshold=0.0)

Cohere Rerank API — managed cross-encoder, no GPU required:

import cohere

co = cohere.Client(COHERE_API_KEY)

rerank_results = co.rerank(
    query="FHA DTI limit compensating factors",
    documents=[c["content"] for c in hybrid_candidates],
    top_n=5,
    model="rerank-english-v3.0"
)

for result in rerank_results.results:
    print(f"Index: {result.index} | Score: {result.relevance_score:.4f}")

5. Multimodal Search — Text + Image + Voice

Enterprise knowledge isn't only text. Diagrams in architecture docs, photos in property appraisals, voice queries from mobile users — all require multimodal search.

Text-to-Image Search

Multimodal embedding models (CLIP, Azure AI Vision) embed text and images into the same vector space. A text query can retrieve relevant images, and an image can retrieve relevant text — cross-modal retrieval.

Azure AI Vision multimodal embeddings:

from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.core.credentials import AzureKeyCredential
import httpx

vision_client = ImageAnalysisClient(
    endpoint="https://your-vision.cognitiveservices.azure.com",
    credential=AzureKeyCredential(VISION_KEY)
)

# Embed an image for indexing
def embed_image(image_url: str) -> list[float]:
    result = vision_client.analyze_from_url(
        image_url=image_url,
        visual_features=["Caption", "DenseCaptions"]
    )
    # Use Azure AI Vision vectorization endpoint for embeddings
    response = httpx.post(
        f"{VISION_ENDPOINT}/computervision/retrieval:vectorizeImage?api-version=2023-02-01-preview",
        headers={"Ocp-Apim-Subscription-Key": VISION_KEY},
        json={"url": image_url}
    )
    return response.json()["vector"]

# Embed a text query — same embedding space as images
def embed_text_for_image_search(text: str) -> list[float]:
    response = httpx.post(
        f"{VISION_ENDPOINT}/computervision/retrieval:vectorizeText?api-version=2023-02-01-preview",
        headers={"Ocp-Apim-Subscription-Key": VISION_KEY},
        json={"text": text}
    )
    return response.json()["vector"]

# Text query → retrieve images
query_vector = embed_text_for_image_search("property with detached garage FHA eligible")
image_results = client.search(
    search_text=None,
    vector_queries=[VectorizedQuery(
        vector=query_vector,
        k_nearest_neighbors=10,
        fields="image_vector"
    )],
    top=5
)

Open Source — CLIP:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

# Embed image
image = Image.open("property_photo.jpg")
image_inputs = processor(images=image, return_tensors="pt")
image_embedding = model.get_image_features(**image_inputs)
image_embedding = image_embedding / image_embedding.norm(dim=-1, keepdim=True)  # normalize

# Embed text query — same CLIP space
text_inputs = processor(text=["FHA eligible property with garage"], return_tensors="pt")
text_embedding = model.get_text_features(**text_inputs)
text_embedding = text_embedding / text_embedding.norm(dim=-1, keepdim=True)

# Cosine similarity
similarity = torch.cosine_similarity(text_embedding, image_embedding)

Voice / Audio Search

Voice queries follow a Speech-to-Text → Embedding → Vector Search pipeline. The STT step is where accuracy matters most — domain-specific vocabulary ("RESPA", "CONV30", "mTLS") requires custom vocabulary or domain-adapted STT models.

Azure Speech → Search pipeline:

import azure.cognitiveservices.speech as speechsdk

def voice_to_search(audio_file: str) -> list[dict]:
    # Step 1: Speech to Text
    speech_config = speechsdk.SpeechConfig(
        subscription=SPEECH_KEY,
        region=SPEECH_REGION
    )
    # Custom speech model for domain vocabulary (optional but recommended)
    speech_config.endpoint_id = CUSTOM_SPEECH_ENDPOINT_ID
    
    audio_config = speechsdk.AudioConfig(filename=audio_file)
    recognizer = speechsdk.SpeechRecognizer(speech_config, audio_config)
    
    result = recognizer.recognize_once()
    transcribed_text = result.text
    print(f"Transcribed: {transcribed_text}")
    
    # Step 2: Embed transcribed text
    query_vector = embed_text(transcribed_text)
    
    # Step 3: Hybrid search — same pipeline as text
    return hybrid_search(transcribed_text, query_vector, top_k=5)

Open Source — Whisper (OpenAI):

import whisper

model = whisper.load_model("large-v3")  # or "medium" for speed/accuracy tradeoff

def voice_to_text(audio_file: str) -> str:
    result = model.transcribe(
        audio_file,
        language="en",
        initial_prompt="mortgage FHA VA conventional loan RESPA TRID DTI LTV"
        # initial_prompt biases toward domain vocabulary — critical for accuracy
    )
    return result["text"]

The initial_prompt trick: Whisper uses the initial_prompt as context to bias transcription toward expected vocabulary. Without it, "RESPA" transcribes as "Respa" or "Resp-a." With it, accuracy on domain terms improves significantly.


6. Full Pipeline — All Search Types Together

This is how all search algorithms compose in a production RAG system:


Performance Comparison

Latency (p50, 512-dim vectors, 50K documents)

AlgorithmAzure AI SearchQdrant (GPU)ElasticsearchNotes
BM25 keyword5–15ms5–20msInverted index, near-instant
HNSW vector (ef=50)20–40ms10–20ms25–45msApproximate, fast
HNSW vector (ef=200)40–80ms20–40ms50–90msBetter recall, higher latency
eKNN (exact)150–500ms80–200ms200–600msScales with corpus size
Hybrid BM25 + HNSW + RRF30–60ms20–50ms40–80msParallel execution
Hybrid + Semantic Reranker130–200ms120–200ms*150–250ms*+100ms for cross-encoder

*Open source semantic reranker latency depends on GPU availability and batch size.

Recall@10 Benchmark (1M vectors, 768-dim, BEIR dataset)

AlgorithmRecall@10Precision@10Notes
BM25 only0.710.68Strong on keyword-heavy queries
HNSW only (ef=50)0.780.74Good semantic, misses exact terms
HNSW only (ef=200)0.850.81Better recall, same semantic gap
Hybrid BM25 + HNSW0.890.86Best of both, consistent
Hybrid + Reranker0.930.91Production standard
eKNN + Reranker0.950.93Maximum precision, highest latency

Algorithm Selection Guide

Query TypeRecommended AlgorithmWhy
Exact term / code lookupBM25"CONV30" must exact-match
Semantic / natural languageHNSW vectorVocabulary gap handled
Mixed (most production queries)Hybrid BM25 + HNSW + RRFCovers both failure modes
High precision requiredHybrid + Semantic Reranker+5–8 points precision
Maximum precision (offline)eKNN + Cross-EncoderNo latency constraint
Text + image corpusMultimodal CLIP embeddingSame vector space
Voice inputSTT → HybridTranscription first
Regulated industryHybrid + Reranker + ThresholdAuditability + precision

Azure AI Search vs Open Source — Full Comparison

CapabilityAzure AI SearchElasticsearchQdrant
BM25 keyword✓ Native✓ NativeVia sparse vectors
HNSW vector✓ Native✓ Native (8.x+)✓ Native, configurable
eKNN exactexhaustive=Trueexact=Trueexact=True
Hybrid BM25 + vector✓ Native, one API call✓ Native (8.x+)✓ Native (1.7+)
RRF fusion✓ Automatic✓ Configurable k✓ Configurable
Semantic reranker✓ Managed (Microsoft model)Via Cohere/Voyage pluginExternal model
Multimodal (text+image)✓ Azure AI Vision integrationVia custom vectorsVia CLIP custom vectors
Voice / STT✓ Azure Speech integrationExternalExternal
Managed infra✓ Fully managedSelf-hosted or Elastic CloudSelf-hosted or Qdrant Cloud
Permission trimming✓ Native AAD / SharePoint ACLCustom filterCustom filter
Compliance (SOC2, HIPAA)✓ Azure certifiedElastic Cloud onlySelf-managed
Cost modelPer SKU + per queryPer node-hourPer node-hour / cloud credits
Best forAzure enterprise, .NET shopsSelf-hosted, OSS-firstVector-native, high-performance

What We Run at MortgageIQ

Query pipeline:

async def search(query: str, query_type: str = "text") -> list[dict]:
    
    # Step 1: Voice → text (if needed)
    if query_type == "voice":
        query = await transcribe(query)  # Azure Speech + domain vocabulary
    
    # Step 2: Embed query
    query_vector = await embed(query)    # text-embedding-3-large, 512-dim
    
    # Step 3: Hybrid search — BM25 + HNSW + RRF
    candidates = await client.search(
        search_text=query,
        vector_queries=[VectorizedQuery(
            vector=query_vector,
            k_nearest_neighbors=50,
            fields="content_vector",
            exhaustive=False           # HNSW, ef=200 configured at index level
        )],
        query_type=QueryType.SEMANTIC,                      # Step 4: semantic reranker
        semantic_configuration_name="mortgage-semantic-config",
        query_caption=QueryCaptionType.EXTRACTIVE,
        query_answer=QueryAnswerType.EXTRACTIVE,
        filter=build_filter(query),    # doc_type, version, section scoping
        top=5
    )
    
    # Step 5: Threshold — don't pass low-confidence chunks to LLM
    filtered = [r for r in candidates if r["@search.reranker_score"] > 0.7]
    
    if not filtered:
        return []  # surface "no reliable information" — not hallucination
    
    return filtered

Results after moving from pure vector to full hybrid + semantic reranker:

  • Recall on loan program code queries (exact term): 82% → 99% (BM25 addition)
  • Precision on semantic queries: 74% → 91% (semantic reranker addition)
  • False answer rate (hallucination from low-relevance context): 8% → 1.2% (threshold filter)
  • Voice query accuracy on domain terms: 61% → 89% (domain vocabulary in STT prompt)

Key Takeaways

  • BM25 is not legacy — it's essential. Every production RAG system needs BM25 alongside vector search. Exact term queries on product codes, IDs, and named entities will never be well-served by approximate nearest neighbor alone.
  • HNSW default settings are wrong for RAG. The default ef_search=50 gives 93% recall. Set ef_search=100–200 for production retrieval — the latency cost is 10–20ms, and the 7% missed answers compound across every query.
  • RRF is the right merger for hybrid search — not score averaging or linear combination. Scores from BM25 and cosine similarity are incomparable; rank positions are not.
  • Azure AI Search's semantic ranker is a managed cross-encoder — it adds ~100ms and 5–8 precision points. Worth it for every use case where accuracy matters more than raw throughput.
  • Multimodal search requires a shared embedding space — text and image queries only work together if both are embedded by the same multimodal model (CLIP, Azure AI Vision). You can't mix embedding models across modalities.
  • Voice search accuracy on domain vocabulary requires an initial prompt or custom STT model — without it, "RESPA" becomes "Respa" and retrieval fails silently.

Coming Up in This Series

  • Day 6: Evaluation — RAGAS, context recall, answer faithfulness, and how to run a retrieval A/B test
  • Day 7: Production Patterns — caching, index freshness, multi-tenant isolation, and cost governance