How Hybrid Search Works — Building the Engine Behind Memory Vault
Every AI memory system needs search. But not all search is equal. I tried pure vector search first — semantic embeddings, cosine similarity, the standard approach. It worked for vague "find me something about X" queries. It failed badly for exact matches.
Ask for "RRF merging" and vector search returns results about "combining search results" and "ranking algorithms" from months ago. Relevant? Loosely. But it misses the chunk that literally contains the words "RRF merging" because it scored slightly lower on cosine similarity.
That's what made me build hybrid search. And after two months of running it daily, it's the single biggest improvement I've made to how I work with AI.
The two arms
Memory Vault runs every search query through two parallel paths:
Vector search — the query gets converted to a 384-dimension embedding using all-MiniLM-L6-v2 (runs locally on CPU, no API calls). That embedding gets compared against every stored memory using an HNSW index in PostgreSQL via pgvector. This finds conceptually similar content even when the words are completely different.
Full-text search — the same query gets broken into keywords, filtered through stop words, and run against a tsvector column with a GIN index. Standard PostgreSQL full-text search. This finds exact word matches that vector search might rank lower.
Neither method is good enough alone. Vector search misses exact keywords. Full-text search misses semantic meaning. The magic is in how you combine them.
Reciprocal Rank Fusion
The merge strategy is called RRF — Reciprocal Rank Fusion. It's simple and it works:
Each search arm produces a ranked list. For each result, its RRF score is calculated as:
score = 1 / (k + rank)
Where k is a constant (I use 60, the standard value from the literature). Results that appear in both lists get both scores added together.
Why this works: RRF doesn't care about the actual similarity scores from each arm — only the rank positions. This means you don't need to figure out how to normalize cosine similarity against ts_rank. You don't need a weight parameter to tune. A chunk that's #3 in vector search and #5 in full-text search gets a combined score that pushes it above something that's #1 in vector but doesn't appear in full-text at all.
The result: better recall than either method alone, with almost no tuning required.
Query enrichment
Before the search even runs, Memory Vault generates up to three variations of your query.
The original query goes through as-is. Then the embedding model's WordPiece tokenizer extracts key terms — words that break into multiple subword tokens are likely technical or domain-specific terms, which makes them more valuable for search. These get assembled into a keyword-focused variation. A third variation strips question words and restructures into a statement.
All three get embedded and searched in parallel (UNION ALL), then deduplicated by chunk ID, keeping the best similarity score. This catches results that one phrasing would miss.
What this looks like in practice
A search for "how does the ingestion pipeline handle markdown files" hits three times:
- The original query finds chunks about ingestion and document processing (semantic match)
- The keyword variation — "ingestion pipeline markdown" — finds chunks with those exact terms (keyword match via full-text)
- The broad variation finds chunks about file parsing and adapter patterns (broader semantic match)
RRF merges all three into a single ranked list. The chunk that talks specifically about the markdown adapter in the ingestion pipeline floats to the top because it appeared in multiple arms.
The stack
Everything runs in PostgreSQL. No separate vector database. No Elasticsearch sidecar. One database handles both vector similarity (pgvector HNSW index) and full-text search (tsvector GIN index) in a single query.
The embedding model (all-MiniLM-L6-v2) runs locally on CPU. 384 dimensions. Fast enough for real-time search, small enough to run on any machine. No API calls, no data leaving your machine.
The ingestion pipeline handles three formats out of the box:
- Markdown — splits by headings, preserves document structure
- Plain text — paragraph-based chunking with smart merging
- Claude JSON — parses Claude conversation exports directly
Each file goes through: detect format → parse into chunks → batch embed → store with metadata.
What's next
The code for all of this just shipped in Milestone 2. You can clone the repo, run the migrations, and start ingesting and searching right now (manual setup — Docker comes in M3).
Next up: Docker one-command setup (M3), then MCP integration so Claude can use this as a live memory system during conversations (M4).