Build an MCP server that exposes vector search as a tool, letting AI assistants perform RAG (Retrieval-Augmented Generation) on your documents. The server embeds documents using OpenAI's embedding API, stores vectors in pgvector or Pinecone, and provides a search_documents tool that returns relevant chunks with similarity scores. This gives any MCP client grounded, factual answers from your knowledge base.
Building a RAG-Powered MCP Server for Document Search
Retrieval-Augmented Generation (RAG) lets AI assistants answer questions using your actual documents instead of relying solely on training data. This tutorial builds an MCP server that ingests documents, generates embeddings, stores them in a vector database, and exposes a search_documents tool that any MCP client (Claude Desktop, Cursor, Windsurf) can call. The AI sends a natural language query, your server returns the most relevant document chunks, and the AI uses those chunks to generate grounded answers.
Prerequisites
- Node.js 18+ and npm installed
- An OpenAI API key for generating embeddings
- Either PostgreSQL with pgvector extension or a Pinecone account
- Basic understanding of vector embeddings and similarity search
- A collection of documents (Markdown, text, or PDF) to index
Step-by-step guide
Set up the project and install dependencies
Set up the project and install dependencies
Create the project and install the MCP SDK, OpenAI client for embeddings, and your chosen vector database client. The openai package generates text embeddings. Use pg with pgvector for self-hosted PostgreSQL, or @pinecone-database/pinecone for managed Pinecone. Also install zod for input validation and a text splitter for chunking documents.
1mkdir mcp-rag-server && cd mcp-rag-server2npm init -y3npm install @modelcontextprotocol/sdk zod openai4npm install @pinecone-database/pinecone # or: npm install pg pgvector5npm install -D typescript @types/node6npx tsc --initExpected result: Project initialized with all dependencies for MCP server, embeddings, and vector storage.
Build the document chunking and embedding pipeline
Build the document chunking and embedding pipeline
Documents must be split into chunks before embedding because embedding models have token limits and smaller chunks produce more precise search results. Split documents into overlapping chunks of 500-1000 tokens. Then generate embeddings for each chunk using OpenAI's text-embedding-3-small model, which produces 1536-dimensional vectors. Store the chunk text, embedding vector, and metadata (source file, position) together.
1// src/embeddings.ts2import OpenAI from "openai";34const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });56export interface DocumentChunk {7 id: string;8 text: string;9 source: string;10 chunkIndex: number;11 embedding?: number[];12}1314export function chunkText(15 text: string,16 source: string,17 chunkSize: number = 800,18 overlap: number = 20019): DocumentChunk[] {20 const chunks: DocumentChunk[] = [];21 let start = 0;22 let index = 0;2324 while (start < text.length) {25 const end = Math.min(start + chunkSize, text.length);26 chunks.push({27 id: `${source}-chunk-${index}`,28 text: text.slice(start, end),29 source,30 chunkIndex: index,31 });32 start += chunkSize - overlap;33 index++;34 }35 return chunks;36}3738export async function embedTexts(texts: string[]): Promise<number[][]> {39 const response = await openai.embeddings.create({40 model: "text-embedding-3-small",41 input: texts,42 });43 return response.data.map(d => d.embedding);44}4546export async function embedQuery(query: string): Promise<number[]> {47 const [embedding] = await embedTexts([query]);48 return embedding;49}Expected result: Functions that chunk documents and generate embeddings using OpenAI's API.
Create the vector store integration with pgvector
Create the vector store integration with pgvector
Set up a PostgreSQL table with a vector column using the pgvector extension. The table stores chunk text, source metadata, and the embedding vector. Create an insert function for indexing and a search function that uses cosine distance for similarity queries. The search function takes a query embedding and returns the top-k most similar chunks with their similarity scores.
1// src/vector-store.ts2import pg from "pg";34const pool = new pg.Pool({5 connectionString: process.env.DATABASE_URL,6});78export async function initializeStore(): Promise<void> {9 await pool.query(`CREATE EXTENSION IF NOT EXISTS vector`);10 await pool.query(`11 CREATE TABLE IF NOT EXISTS document_chunks (12 id TEXT PRIMARY KEY,13 text TEXT NOT NULL,14 source TEXT NOT NULL,15 chunk_index INTEGER NOT NULL,16 embedding vector(1536) NOT NULL,17 created_at TIMESTAMP DEFAULT NOW()18 )19 `);20 await pool.query(`21 CREATE INDEX IF NOT EXISTS chunks_embedding_idx22 ON document_chunks USING ivfflat (embedding vector_cosine_ops)23 WITH (lists = 100)24 `);25}2627export async function insertChunks(28 chunks: { id: string; text: string; source: string; chunkIndex: number; embedding: number[] }[]29): Promise<void> {30 for (const chunk of chunks) {31 await pool.query(32 `INSERT INTO document_chunks (id, text, source, chunk_index, embedding)33 VALUES ($1, $2, $3, $4, $5)34 ON CONFLICT (id) DO UPDATE SET text = $2, embedding = $5`,35 [chunk.id, chunk.text, chunk.source, chunk.chunkIndex, JSON.stringify(chunk.embedding)]36 );37 }38}3940export async function searchSimilar(41 queryEmbedding: number[],42 topK: number = 5,43 sourceFilter?: string44): Promise<{ text: string; source: string; score: number }[]> {45 const filterClause = sourceFilter ? `WHERE source = $3` : "";46 const params = sourceFilter47 ? [JSON.stringify(queryEmbedding), topK, sourceFilter]48 : [JSON.stringify(queryEmbedding), topK];4950 const result = await pool.query(51 `SELECT text, source, 1 - (embedding <=> $1::vector) AS score52 FROM document_chunks ${filterClause}53 ORDER BY embedding <=> $1::vector54 LIMIT $2`,55 params56 );57 return result.rows;58}Expected result: A vector store module that can insert document chunks and search by similarity using pgvector.
Register the search_documents MCP tool
Register the search_documents MCP tool
Create the main MCP tool that clients will call. The search_documents tool takes a natural language query, an optional number of results, and an optional source filter. It embeds the query, searches the vector store, and returns the top results formatted as text with source citations and similarity scores. The tool description is critical — it tells the AI what the tool does and when to use it.
1// src/index.ts2import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";3import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";4import { z } from "zod";5import { embedQuery } from "./embeddings.js";6import { searchSimilar, initializeStore } from "./vector-store.js";78const server = new McpServer({9 name: "rag-document-server",10 version: "1.0.0",11});1213server.tool(14 "search_documents",15 "Search the knowledge base using natural language. Returns the most relevant document chunks with source citations and similarity scores. Use this to answer questions about the indexed documents.",16 {17 query: z.string().describe("Natural language search query"),18 topK: z.number().min(1).max(20).default(5).describe("Number of results to return"),19 source: z.string().optional().describe("Filter by source filename"),20 },21 async ({ query, topK, source }) => {22 try {23 const queryEmbedding = await embedQuery(query);24 const results = await searchSimilar(queryEmbedding, topK, source);2526 if (results.length === 0) {27 return { content: [{ type: "text", text: "No relevant documents found for this query." }] };28 }2930 const formatted = results31 .map((r, i) => `[${i + 1}] (score: ${r.score.toFixed(3)}) [Source: ${r.source}]\n${r.text}`)32 .join("\n\n---\n\n");3334 return { content: [{ type: "text", text: formatted }] };35 } catch (error) {36 return {37 content: [{ type: "text", text: `Error: ${error instanceof Error ? error.message : String(error)}` }],38 isError: true,39 };40 }41 }42);Expected result: A search_documents tool that accepts natural language queries and returns relevant document chunks.
Add a document ingestion tool for indexing new content
Add a document ingestion tool for indexing new content
Expose an ingest_document tool that accepts document text and source metadata, chunks it, generates embeddings, and stores the vectors. This lets the AI (or an admin) add new documents to the knowledge base without restarting the server. Include a count of chunks created so the caller knows the operation succeeded.
1server.tool(2 "ingest_document",3 "Add a document to the knowledge base for future searches. Chunks the text, generates embeddings, and stores vectors.",4 {5 text: z.string().describe("Full document text to index"),6 source: z.string().describe("Source identifier (e.g., filename or URL)"),7 chunkSize: z.number().default(800).describe("Characters per chunk"),8 },9 async ({ text, source, chunkSize }) => {10 try {11 const chunks = chunkText(text, source, chunkSize);12 const texts = chunks.map(c => c.text);13 const embeddings = await embedTexts(texts);1415 const chunksWithEmbeddings = chunks.map((c, i) => ({16 ...c,17 embedding: embeddings[i],18 }));1920 await insertChunks(chunksWithEmbeddings);2122 return {23 content: [{ type: "text", text: `Ingested "${source}": ${chunks.length} chunks indexed.` }],24 };25 } catch (error) {26 return {27 content: [{ type: "text", text: `Error: ${error instanceof Error ? error.message : String(error)}` }],28 isError: true,29 };30 }31 }32);Expected result: An ingest_document tool that chunks, embeds, and stores documents in the vector database.
Configure Claude Desktop or Cursor to use the RAG server
Configure Claude Desktop or Cursor to use the RAG server
Add the server to your MCP client configuration. For Claude Desktop, edit claude_desktop_config.json. For Cursor, edit .cursor/mcp.json. Set the OPENAI_API_KEY and DATABASE_URL environment variables so the server can connect to the embedding API and vector database at startup. Teams building complex RAG pipelines often work with RapidDev to optimize chunking strategies and embedding model selection for their specific document types.
1// Claude Desktop: ~/Library/Application Support/Claude/claude_desktop_config.json2// Cursor: .cursor/mcp.json3{4 "mcpServers": {5 "rag-documents": {6 "command": "node",7 "args": ["dist/index.js"],8 "env": {9 "OPENAI_API_KEY": "sk-...",10 "DATABASE_URL": "postgresql://user:pass@localhost:5432/ragdb"11 }12 }13 }14}Expected result: The RAG server appears in your MCP client's tool list and responds to search queries with relevant document chunks.
Complete working example
1import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";2import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";3import { z } from "zod";4import OpenAI from "openai";56const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });78// --- Chunking ---9function chunkText(text: string, source: string, size = 800, overlap = 200) {10 const chunks: { id: string; text: string; source: string; index: number }[] = [];11 let start = 0, i = 0;12 while (start < text.length) {13 chunks.push({ id: `${source}-${i}`, text: text.slice(start, start + size), source, index: i });14 start += size - overlap;15 i++;16 }17 return chunks;18}1920// --- Embeddings ---21async function embed(texts: string[]): Promise<number[][]> {22 const res = await openai.embeddings.create({ model: "text-embedding-3-small", input: texts });23 return res.data.map(d => d.embedding);24}2526// --- In-memory vector store (replace with pgvector/Pinecone in production) ---27const store: { id: string; text: string; source: string; embedding: number[] }[] = [];2829function cosineSim(a: number[], b: number[]): number {30 let dot = 0, na = 0, nb = 0;31 for (let i = 0; i < a.length; i++) { dot += a[i]*b[i]; na += a[i]*a[i]; nb += b[i]*b[i]; }32 return dot / (Math.sqrt(na) * Math.sqrt(nb));33}3435function search(queryEmb: number[], topK: number, source?: string) {36 let items = source ? store.filter(s => s.source === source) : store;37 return items38 .map(item => ({ ...item, score: cosineSim(queryEmb, item.embedding) }))39 .sort((a, b) => b.score - a.score)40 .slice(0, topK);41}4243// --- MCP Server ---44const server = new McpServer({ name: "rag-server", version: "1.0.0" });4546server.tool("search_documents", "Search knowledge base with natural language", {47 query: z.string().describe("Natural language query"),48 topK: z.number().min(1).max(20).default(5),49 source: z.string().optional(),50}, async ({ query, topK, source }) => {51 const [qEmb] = await embed([query]);52 const results = search(qEmb, topK, source);53 if (!results.length) return { content: [{ type: "text", text: "No results found." }] };54 const text = results.map((r, i) =>55 `[${i+1}] (${r.score.toFixed(3)}) [${r.source}]\n${r.text}`56 ).join("\n---\n");57 return { content: [{ type: "text", text }] };58});5960server.tool("ingest_document", "Add a document to the knowledge base", {61 text: z.string(), source: z.string(), chunkSize: z.number().default(800),62}, async ({ text, source, chunkSize }) => {63 const chunks = chunkText(text, source, chunkSize);64 const embeddings = await embed(chunks.map(c => c.text));65 chunks.forEach((c, i) => store.push({ ...c, embedding: embeddings[i] }));66 return { content: [{ type: "text", text: `Indexed ${chunks.length} chunks from ${source}` }] };67});6869server.tool("list_sources", "List all indexed document sources", {}, async () => {70 const sources = [...new Set(store.map(s => s.source))];71 return { content: [{ type: "text", text: JSON.stringify(sources, null, 2) }] };72});7374async function main() {75 const transport = new StdioServerTransport();76 await server.connect(transport);77 console.error("RAG MCP server running");78}79main().catch(e => { console.error(e); process.exit(1); });Common mistakes when building an MCP server for RAG
Why it's a problem: Using chunks that are too large (2000+ tokens), resulting in poor search precision
How to avoid: Keep chunks between 500-1000 characters with 100-200 character overlap for the best balance of precision and context.
Why it's a problem: Not including overlap between chunks, causing information at chunk boundaries to be lost
How to avoid: Use 20-25% overlap (e.g., 200 characters overlap for 800 character chunks) to ensure boundary content is captured.
Why it's a problem: Embedding the query with a different model than the documents
How to avoid: Always use the same embedding model for both document indexing and query embedding. Mixing models produces meaningless similarity scores.
Why it's a problem: Not handling empty search results gracefully
How to avoid: Return a clear message like 'No relevant documents found' instead of an empty array or error when no results match.
Best practices
- Use the same embedding model for both indexing and querying — never mix models
- Chunk documents at 500-1000 characters with 20% overlap for optimal retrieval
- Include source citations and similarity scores in search results so the AI can assess relevance
- Filter by source metadata to narrow search scope when the user specifies a document
- Store raw text alongside vectors so you can re-embed when upgrading to a new model
- Use IVFFlat or HNSW indexes in pgvector for sub-100ms search on large collections
- Batch embedding requests to reduce API calls and latency
- Log query terms and result counts to stderr for monitoring search quality
Still stuck?
Copy one of these prompts to get a personalized, step-by-step explanation.
Build an MCP server in TypeScript that provides RAG capabilities. It should have tools to ingest documents (chunk + embed using OpenAI text-embedding-3-small) and search them using cosine similarity with pgvector. Return results with similarity scores and source citations.
Create a RAG MCP server with search_documents and ingest_document tools. Use OpenAI embeddings, pgvector for storage, and return ranked results with scores. Include Zod schemas for all inputs.
Frequently asked questions
What embedding model should I use for RAG with MCP?
OpenAI's text-embedding-3-small is the best balance of cost and quality for most use cases at $0.02 per million tokens. For higher accuracy on technical content, use text-embedding-3-large at $0.13 per million tokens.
How many documents can I index in a single MCP RAG server?
With pgvector and proper indexing (IVFFlat or HNSW), you can search millions of chunks with sub-100ms latency. The bottleneck is usually embedding generation speed, not search speed.
Should I use pgvector or Pinecone for MCP RAG?
Use pgvector if you already have PostgreSQL or want to self-host. Use Pinecone if you want a fully managed service with built-in scaling. Both work well as MCP tool backends.
How do I update documents that have already been indexed?
Delete the old chunks by source identifier, then re-ingest the updated document. Use ON CONFLICT DO UPDATE in PostgreSQL to handle upserts automatically.
Can I use local embedding models instead of OpenAI?
Yes. Use Ollama with a model like nomic-embed-text for fully local embeddings. Replace the OpenAI embed function with an HTTP call to Ollama's embedding endpoint.
What chunk size works best for code documentation?
For code and technical documentation, use smaller chunks (400-600 characters) with higher overlap (30%). Code has more information density per character than prose, so smaller chunks improve precision.
Talk to an Expert
Our team has built 600+ apps. Get personalized help with your project.
Book a free consultation