The systems behind modern AI memory, recommendation engines, semantic search, and retrieval-augmented generation all share a foundation that's remarkably simple. At the core of each one is a piece of math that most engineers learned in a linear algebra course and promptly forgot: the dot product. Understanding why this works, and where it stops working, is the difference between building retrieval systems that actually perform and ones that look good in demos.
Embeddings: Turning Meaning Into Geometry
An embedding is a vector. A list of numbers, typically 768 to 3072 of them, that represents the meaning of a piece of text, an image, a user profile, or anything else you can encode. The embedding model learns to place similar things close together in this high-dimensional space and dissimilar things far apart.
The word "coffee" and the phrase "morning espresso" end up near each other. "Corporate tax law" ends up somewhere else entirely. This is learned, statistical proximity. The model doesn't understand meaning in any deep sense. It has learned that certain words and phrases co-occur in similar contexts, and it encodes that pattern as geometric distance.
The math is a dot product
To find how similar two embeddings are, you compute the cosine similarity: the dot product of the two vectors divided by the product of their magnitudes. The result is a number between -1 and 1, where 1 means identical direction, 0 means orthogonal (unrelated), and -1 means opposite.
That's it. The entire foundation of semantic search, recommendation systems, and RAG is a dot product followed by sorting. You embed your query, compute the cosine similarity against every stored vector, and return the closest matches. The elegance is that this single operation captures semantic relationships that keyword search misses entirely. A search for "reducing cloud spend" will match a document about "infrastructure cost optimization" even though the two phrases share zero words.
Why this simple math works at scale
The reason dot products power billion-dollar recommendation systems is that they're embarrassingly parallelizable. GPUs are built for exactly this operation: multiplying matrices and summing the results. A nearest-neighbor search across millions of vectors can execute in milliseconds on modern hardware. Approximate nearest-neighbor algorithms like HNSW and IVF make it feasible at even larger scale by trading a small amount of recall accuracy for orders-of-magnitude speed gains.
This is the fundamental insight: a simple mathematical operation, applied to learned representations, produces results that feel intelligent. Netflix recommendations, Spotify's Discover Weekly, Google's semantic search, and every RAG system in production all reduce to variations of "embed, compare, rank."
Vector Databases: Choosing the Right Storage Layer
Vector databases exist because traditional databases aren't optimized for nearest-neighbor search across high-dimensional vectors. Pinecone, Chroma, Weaviate, Qdrant, and Milvus all solve the same core problem: store millions of vectors, index them for fast approximate search, and filter results by metadata.
What they actually do
A vector database handles three things:
- Storage of vectors alongside metadata (the source document, timestamps, user IDs, categories).
- Indexing using algorithms like HNSW that organize vectors for fast retrieval.
- Filtered search, where you combine vector similarity with traditional filters ("find the most similar documents, published after January 2025, in the legal category").
The build-versus-buy question
Postgres with pgvector handles this for many teams. If your vector count is under a few million and your query volume is moderate, pgvector inside your existing Postgres instance eliminates an entire infrastructure dependency. The dedicated vector databases justify themselves at scale: hundreds of millions of vectors, thousands of queries per second, or when you need features like real-time index updates and sophisticated hybrid search.
The pattern mirrors the database specialization question more broadly. Start with what you have. Move to a specialized system when the performance gap is measurable and the operational cost is justified.
RAG: How to Ground AI Responses in Your Own Data
Retrieval-augmented generation solves a fundamental limitation of language models: they only know what they were trained on. Your company's internal documents, recent data, proprietary knowledge, and anything that changed after the training cutoff is invisible to the model.
RAG addresses this by retrieving relevant context before generating a response. The process: embed the user's query, search a vector database for similar documents, inject the top results into the prompt, and let the model generate an answer grounded in that retrieved context.
Why RAG beats fine-tuning for most knowledge tasks
Fine-tuning bakes knowledge into model weights. This is expensive, requires retraining when data changes, and makes it hard to update individual facts. RAG keeps knowledge external. Update a document in your vector store, and the next query that retrieves it gets the updated information. There's no retraining cycle. The source material is auditable. You can trace exactly which documents informed a given answer.
For knowledge that changes frequently, RAG is the practical choice. Fine-tuning still has its place for teaching a model new behaviors or domain-specific reasoning patterns, but for factual recall and document-grounded answers, RAG is simpler and more maintainable.
Where vanilla RAG breaks down
Vanilla RAG has well-documented problems:
- Semantic false positives. The retrieval step can return documents that are semantically similar to the query but contextually irrelevant. A question about "Apple's revenue" might retrieve documents about apple farming because the embedding space places them in overlapping regions.
- Chunking sensitivity. Split documents too small and you lose context. Too large and you dilute relevance. The boundary choices have an outsized impact on retrieval quality.
- Hallucination despite context. The model can hallucinate even when the correct answer is in the retrieved documents, particularly when those documents are contradictory or ambiguous.
Memory Systems: Making AI Remember Users Across Sessions
The next layer above RAG is persistent memory. Tools like Mem0 add a memory layer that lets AI applications remember context across sessions. Instead of treating each conversation as stateless, the system stores facts about users, extracts preferences from past interactions, and retrieves relevant history when the user returns.
How memory databases work
Mem0 and similar systems combine several retrieval strategies:
- Semantic search. Memories are stored as embeddings, so the system can retrieve contextually relevant facts even when the wording differs.
- Knowledge graphs. Structured relationships ("this user works at Company X, which is in the fintech sector") allow for relational reasoning that pure vector search can't handle.
- Temporal weighting. Recent memories rank higher than old ones, so the system surfaces what's currently relevant rather than what was relevant six months ago.
When a user starts a new conversation, the system retrieves relevant memories and injects them as context. The math underneath is still cosine similarity and graph traversal. The intelligence comes from the extraction pipeline: deciding what's worth remembering, how to structure it, and when to surface it. A well-implemented memory system makes AI assistants feel genuinely personalized. A poorly implemented one surfaces irrelevant or outdated context and erodes trust.
The compound effect
Memory systems compound in value over time. The first conversation has no context. By the tenth, the system knows the user's role, their current projects, their communication preferences, and the decisions they've already made. This is the same dynamic that makes recommendation engines better with more data. Each interaction produces signal that improves future retrieval.
Intent-Based RAG: Classifying Before Retrieving
Standard RAG treats every query the same way: embed it, search, retrieve, generate. This is a problem because queries have fundamentally different intents, and the optimal retrieval strategy depends on the intent.
Why one retrieval strategy fails
Consider three queries to a company knowledge base:
- "What is our refund policy?" is a factual lookup. It needs a precise, narrow match from a single document.
- "Why did we change the pricing model last quarter?" requires synthesis across multiple documents and possibly meeting notes from different sources.
- "What should we price the new enterprise tier at?" is an analytical question that needs market data, cost data, and strategic context, possibly with different ranking criteria.
A single retrieval strategy can't serve all three well. The factual lookup drowns in broad results. The synthesis question starves with narrow retrieval. The analytical question needs structured data that vector search alone won't find.
How intent routing works
Intent-based RAG adds a classification step before retrieval. A lightweight model (or even a rule-based classifier) categorizes the incoming query by type: factual lookup, comparison, synthesis, analysis, conversational, or procedural. Each category maps to a different retrieval configuration.
Factual queries use strict similarity thresholds and return fewer, more precise chunks. Synthesis queries cast a wider net, retrieving from multiple document collections and potentially using recursive summarization (as in RAPTOR) to build context from hierarchical document structures. Analytical queries combine vector search with structured database queries to pull in numerical data alongside narrative context.
The measurable improvement
Teams that implement intent routing consistently report better answer quality, lower hallucination rates, and more efficient token usage. The reason is straightforward: matching the retrieval strategy to the query type means the model receives more relevant context and less noise. When the context window is filled with precisely the right information, the model generates better answers. When it's filled with loosely related documents, it guesses.
Building multi-index architectures for production
Intent routing naturally leads to multi-index architectures. Instead of one vector store for everything, you maintain separate indexes optimized for different content types:
- Policy documents chunked small and indexed precisely for exact-match questions.
- Meeting transcripts chunked larger, with speaker metadata, for synthesis and context questions.
- Financial data stored in structured form with temporal filters for analytical queries.
The intent classifier determines which indexes to query and how to merge the results.
This adds complexity. You're maintaining multiple indexes, a classifier, and routing logic. The tradeoff is worthwhile when answer quality directly impacts business outcomes: customer support accuracy, internal knowledge management, compliance queries, or any domain where a wrong answer has consequences.
Putting the Full Stack Together
A modern AI retrieval system layers these components, each building on the one below it:
- Embedding models convert raw content into vectors.
- Vector storage (pgvector, Pinecone, Chroma) indexes and serves similarity search.
- A memory layer (Mem0 or custom) persists context across interactions.
- An intent classifier routes queries to the appropriate retrieval strategy.
- RAG orchestration retrieves context, constructs prompts, and manages the generation step.
- Evaluation infrastructure measures retrieval quality and answer accuracy over time.
Each layer is built on the same mathematical foundation: learned representations compared via dot products. The sophistication is in the orchestration, the chunking strategy, the intent classification, and the evaluation loops. The math stays simple.
Three Things That Matter More Than Your Vector Database
The teams that build effective retrieval systems tend to focus on three things:
- Chunking strategy. How you split documents matters more than which vector database you use. Chunk boundaries that break mid-thought produce poor retrievals regardless of how fast your index is.
- Evaluation. Without systematic measurement of retrieval quality and answer accuracy, you're guessing about whether your system works. Build eval harnesses before optimizing infrastructure.
- Intent awareness. Once your basic RAG pipeline works, the highest-leverage improvement is usually routing different query types to different retrieval strategies.
The math behind all of it is a dot product. Everything else is engineering.
