A ranking function that scores document relevance using term frequency, inverse document frequency, and document length normalization. The default in Elasticsearch, Solr, and most search infrastructure.

Is BM25 better than vector search?

They solve different problems. BM25 handles exact keyword matching, vector search handles semantic similarity. Production systems use both together because the combination outperforms either alone.

What do k1 and b mean in BM25?

k1 controls how quickly term frequency saturates. b controls how much document length affects scoring. Standard defaults are k1=1.2, b=0.75.

Can BM25 handle typos or synonyms?

Not natively. BM25 matches exact tokens. Typo tolerance needs fuzzy matching. Synonym handling needs a synonym map or query expansion layer.

Why not just use Algolia or a hosted search API?

For a static blog, a hosted API adds cost, latency, and a dependency. BM25 with a pre-built index runs entirely in the browser with zero API calls.

Is BM25 used in RAG pipelines?

Yes. Most production RAG systems use BM25 as the first-pass retriever before vector similarity or reranking. It catches exact keyword matches embeddings miss.

The 1994 Algorithm That Still Ranks Your Search Results

Every search result you’ve ever clicked was ranked by an algorithm from 1994.

BM25. Short for “Best Matching 25.” It’s the default ranking function in Elasticsearch, Solr, and Lucene. It’s the retrieval backbone behind most RAG pipelines. It powers the search bar on this blog. And the entire thing is built on three ideas simple enough to explain to anyone.

I recently added BM25 search to this site and went down the rabbit hole of understanding how it actually works. Here’s what I learned, why this 30-year-old formula keeps winning against newer approaches, and where it shows up in places you probably didn’t expect.

Three ideas that run everything

BM25 answers one question: given a search query and a document, how relevant is this document?

It does this by combining three intuitions that are so obvious they almost feel too simple.

Rare words matter more. The word “the” appears in every document. The word “kubernetes” appears in very few. BM25 gives more weight to terms that are rare across all your content. If your search term appears in 2 out of 1,000 documents, that’s a strong signal. If it appears in 900 out of 1,000, it’s basically noise. This is called Inverse Document Frequency (IDF), and it’s the reason BM25 doesn’t treat all words equally.

Repetition has limits. If a document mentions “agent” once, that’s relevant. If it mentions “agent” five times, that’s more relevant. But if it mentions “agent” 200 times? That shouldn’t score 200x higher than once. BM25 uses a saturation curve. The score rises fast for the first few occurrences, then flattens. This is what kills keyword stuffing. It doesn’t matter how many times you repeat a word, the returns diminish quickly.

Chart comparing BM25 saturating term frequency curve versus linear TF-IDF, showing how BM25 flattens after a few term occurrences while TF-IDF keeps climbing

Click to expand

Short documents get a fair shot. A 10,000-word post has more chances to mention any given term than a 500-word post. That’s not relevance, that’s just volume. BM25 normalizes for document length so a focused 500-word article that mentions your search term three times can rank higher than a sprawling document that mentions it the same number of times but buries it among thousands of other words.

That’s the entire algorithm. Three ideas, combined into one score. The formula has two tuning parameters (called k1 and b), and the defaults set in the original 1994 paper (1.2 and 0.75) are still what almost everyone uses today. Nobody’s found better ones in 30 years.

Why it keeps winning

BM25 has properties that newer, fancier approaches struggle to match.

No training data required. You don’t need labeled examples, embedding models, or GPU time. Point BM25 at a new collection of documents and it works immediately. The first document you add is already searchable. There’s no cold start problem.

Deterministic. Same query, same corpus, same results. Every time. No temperature parameter, no randomness, no “it worked yesterday but gives different results today.” When your search results look wrong, you can trace exactly why a document scored the way it did.

Fast. The core operation is looking up words in an index (essentially a dictionary lookup) and doing basic math. No matrix operations, no forward passes through a neural network. BM25 returns results in milliseconds even on massive datasets because the hard work (building the index) happens once, and scoring is just arithmetic.

Explainable. Every result comes with a breakdown showing exactly how much each word contributed to the score. “Agent” in the title added 3.2 points. “Security” in the body added 1.1 points. You can see exactly why result #1 beat result #2. Try getting that from an embedding model.

Compare the infrastructure requirements. Neural search needs a trained model, labeled data, and inference compute. Vector search needs an embedding pipeline and a vector database. LLM-powered search needs an API call (and payment) per query. BM25 needs a dictionary and multiplication.

Where BM25 runs (and you didn’t know it)

BM25 isn’t niche. It’s the default ranking algorithm across most of the search infrastructure you interact with daily.

Elasticsearch and OpenSearch. BM25 became the default scoring function in Elasticsearch 5.0 back in 2016, replacing the older TF-IDF model. Every Elasticsearch query you’ve run since then uses BM25 unless you explicitly overrode it. Companies running Elasticsearch clusters with billions of documents? BM25 is doing the ranking. No training pipeline. No embedding model. Just the same formula from 1994.

RAG retrieval pipelines. This is where BM25 is having its biggest moment right now. Most production Retrieval-Augmented Generation systems don’t rely on vector search alone. They use BM25 as the first-pass retriever to find candidate documents quickly, then apply vector similarity or a reranker for the final ordering. The reason is simple: BM25 catches exact keyword matches that embedding models sometimes miss. If you search for “error code 5032,” a vector model might return results about “error handling in general.” BM25 returns the document that actually contains “5032.”

E-commerce product search. When you search “blue running shoes size 10” on a shopping site, BM25 handles the keyword matching that narrows millions of products to hundreds of candidates. Filters handle the constraints. Neural reranking handles personalization. But BM25 does the heavy lifting of initial retrieval, and it does it fast enough to feel instant.

Code search. GitHub’s code search, Sourcegraph, and most internal code search tools use BM25-style scoring under the hood. Code has a naturally strong structure for this: function names are like titles (high signal), comments are like descriptions (medium signal), and the body is structured. BM25’s ability to weight different parts of a document differently maps perfectly to code.

Log and observability search. Splunk, Datadog, Grafana Loki. When you’re searching logs for an error message during an incident, BM25 mechanics are doing the ranking. Speed matters here. You need results in milliseconds across terabytes of log data. BM25 delivers because the core operation is dictionary lookups and arithmetic, not model inference.

Static site search. Pagefind, Lunr.js, and this blog. Client-side search for static sites almost always uses BM25 or a close variant because it offers the best quality-to-complexity ratio you can get without a server. The index is a JSON file, the scoring logic is minimal, and the results are genuinely good.

BM25 + vectors: the hybrid that actually works

The modern pattern that’s emerging across search and RAG isn’t “BM25 vs. vector search.” It’s both, combined.

BM25 handles lexical matching: exact terms, specific identifiers, precise phrases. Vector search handles semantic matching: synonyms, paraphrases, conceptual similarity. Neither alone covers the full range of what users search for.

The typical architecture: BM25 retrieves the top candidates by keyword relevance. A vector model retrieves the top candidates by meaning similarity. A fusion step merges and reorders the combined set.

This hybrid approach consistently outperforms either method alone in benchmarks and production systems. It’s why Pinecone, Weaviate, and Qdrant all added keyword search alongside their vector indexes. It’s why LangChain and LlamaIndex offer hybrid retrievers out of the box.

The intuition is straightforward. Searching for “error code 5032” with vector-only search might return results about “debugging and error handling” (semantically close, but missing the exact code). BM25 returns the document that literally contains “5032.” Searching for “how to make search results better” with BM25-only might miss a document titled “Improving Retrieval Quality” because the exact words don’t match. Vector search catches that semantic overlap.

You need both. “Just use vector search” is a common recommendation that usually comes from people who haven’t debugged why their RAG pipeline missed an obvious document.

Honest trade-offs

BM25 has no semantic understanding. “Car” and “automobile” are completely different terms. “Authentication” and “login” don’t match. If the user’s vocabulary doesn’t overlap with the document’s vocabulary, BM25 returns nothing. No amount of tuning fixes this, it’s fundamental to how the algorithm works.

It’s a bag-of-words model. Word order doesn’t matter. “Dog bites man” and “man bites dog” get identical scores. For most search use cases this is fine, but it means BM25 can’t capture meaning that depends on structure.

The ranking depends on good text processing. English is straightforward (split on whitespace and punctuation), but CJK languages, compound words in German, and domain-specific terminology all need special handling. Bad text processing makes BM25 useless regardless of how good the scoring math is.

For this blog with a handful of posts, BM25 is honestly more than what’s needed. A simple text search on the title would probably work fine. But the algorithm scales to thousands (or billions) of documents without changing anything, and the ranking quality is dramatically better than substring matching once the content grows.

The weighting (title matches matter more than body matches, rare words matter more than common words) is configured, not learned. There’s no feedback loop, no click-through optimization. If the weights are wrong, the results are wrong, and you won’t know until someone searches for something and the best match doesn’t show up first.

The algorithm that just works

BM25 is 30 years old. It predates Google. It predates every transformer, every embedding model, every LLM. Stephen Robertson and Karen Sparck Jones developed the probabilistic framework it’s built on in the 1970s. Robertson and Hugo Zaragoza refined it into BM25 in 1994.

Three decades later, it’s still the default in every major search engine, the first-pass retriever in most RAG pipelines, and the ranking function behind more production search infrastructure than any neural model.

Sometimes the best algorithm is the one that just works.