Skip to content

Best Search Ranking Stack for AI Products

The best search ranking stack is a funnel. Cheap recall comes first. Expensive precision comes last. Most teams get into trouble when they replace lexical search with embeddings instead of combining them.

My default: start with BM25 and filters. Add dense retrieval for semantic recall. Fuse sparse and dense candidates with RRF or a similar method. Rerank the shortlist with a cross-encoder. Use LLM reranking only for a tiny candidate set where the quality gain pays for the latency and cost.

Stage Default Job
Filtering Structured filters Enforce tenant, permissions, product, language, time, and availability.
Lexical retrieval BM25 Exact names, IDs, error codes, legal terms, and high-precision tokens.
Dense retrieval Embeddings Synonyms, paraphrases, fuzzy intent, and semantic recall.
Fusion Reciprocal Rank Fusion or weighted retriever composition Merge sparse and dense candidates without pretending scores are comparable.
Reranking Cross-encoder Reorder 20 to 100 candidates with query-document interaction.
Final precision LLM reranker or answer model Resolve nuanced relevance only after the list is small.
Evaluation Recall@k, nDCG, MRR, click labels, human labels Prove each stage improves the previous one.

Use-case defaults

Product surface Good default Why
Documentation search BM25 plus embeddings plus cross-encoder Exact API names and semantic questions both matter.
RAG retrieval Hybrid retrieval plus reranker plus citation checks Missing evidence is usually worse than slow generation.
Product search Lexical filters plus hybrid retrieval plus business features Availability, price, popularity, and exact facets matter.
Support search Hybrid retrieval plus freshness and ticket metadata Similar wording and current policy both matter.
Internal knowledge base BM25 baseline, then dense retrieval from query logs Start measurable before adding model cost.
Legal or compliance search Lexical baseline plus strict filters, then careful semantic expansion False positives and false negatives both have high cost.

Why BM25 still belongs in the stack

Embeddings are good at semantic similarity. They are not reliable replacements for exact tokens. Error codes, function names, product SKUs, legal phrases, and person names often carry intent through exact text. BM25 remains a strong baseline because it rewards terms the user actually typed.

Dense retrieval adds recall when users do not know the exact vocabulary. The right pattern is not BM25 or embeddings. Use BM25 for lexical recall, embeddings for semantic recall, and fusion to combine them.

When to add a reranker

Add a cross-encoder when the right documents appear somewhere in the top 50 but not near the top. That is the cleanest signal that candidate generation works and ranking needs help.

Do not add an LLM reranker before a cross-encoder unless the candidate set is tiny. It also needs a relevance judgment subtle enough to justify the cost. LLM reranking can help, but it is expensive and slower. Measure it against a cheaper reranker.

Evaluation sequence

  1. Label a small set of real queries.
  2. Measure BM25 alone.
  3. Add dense retrieval and measure recall delta.
  4. Add fusion and measure nDCG and Recall@k.
  5. Add cross-encoder reranking and measure Precision@1 and nDCG.
  6. Add LLM reranking only if it improves quality after cost and latency are included.
  7. Watch production metrics: zero-result rate, reformulation rate, click-through, answer correction, p95 latency, and cost.

Deeper reading

References