Skip to main content
RapidDev - Software Development Agency
mcp-tutorial

How to build an MCP server for RAG

Build an MCP server that exposes vector search as a tool, letting AI assistants perform RAG (Retrieval-Augmented Generation) on your documents. The server embeds documents using OpenAI's embedding API, stores vectors in pgvector or Pinecone, and provides a search_documents tool that returns relevant chunks with similarity scores. This gives any MCP client grounded, factual answers from your knowledge base.

What you'll learn

  • How to embed documents and store vectors for MCP-based RAG
  • How to build a search_documents MCP tool with similarity scoring
  • How to integrate pgvector or Pinecone as vector backends
  • How to chunk documents for optimal retrieval quality
  • How to combine semantic search with metadata filtering
Book a free consultation
4.9Clutch rating
600+Happy partners
17+Countries served
190+Team members
Advanced11 min read40-60 minMCP TypeScript SDK v1.x, Node.js 18+, OpenAI API, pgvector or PineconeMarch 2026RapidDev Engineering Team
TL;DR

Build an MCP server that exposes vector search as a tool, letting AI assistants perform RAG (Retrieval-Augmented Generation) on your documents. The server embeds documents using OpenAI's embedding API, stores vectors in pgvector or Pinecone, and provides a search_documents tool that returns relevant chunks with similarity scores. This gives any MCP client grounded, factual answers from your knowledge base.

Building a RAG-Powered MCP Server for Document Search

Retrieval-Augmented Generation (RAG) lets AI assistants answer questions using your actual documents instead of relying solely on training data. This tutorial builds an MCP server that ingests documents, generates embeddings, stores them in a vector database, and exposes a search_documents tool that any MCP client (Claude Desktop, Cursor, Windsurf) can call. The AI sends a natural language query, your server returns the most relevant document chunks, and the AI uses those chunks to generate grounded answers.

Prerequisites

  • Node.js 18+ and npm installed
  • An OpenAI API key for generating embeddings
  • Either PostgreSQL with pgvector extension or a Pinecone account
  • Basic understanding of vector embeddings and similarity search
  • A collection of documents (Markdown, text, or PDF) to index

Step-by-step guide

1

Set up the project and install dependencies

Create the project and install the MCP SDK, OpenAI client for embeddings, and your chosen vector database client. The openai package generates text embeddings. Use pg with pgvector for self-hosted PostgreSQL, or @pinecone-database/pinecone for managed Pinecone. Also install zod for input validation and a text splitter for chunking documents.

typescript
1mkdir mcp-rag-server && cd mcp-rag-server
2npm init -y
3npm install @modelcontextprotocol/sdk zod openai
4npm install @pinecone-database/pinecone # or: npm install pg pgvector
5npm install -D typescript @types/node
6npx tsc --init

Expected result: Project initialized with all dependencies for MCP server, embeddings, and vector storage.

2

Build the document chunking and embedding pipeline

Documents must be split into chunks before embedding because embedding models have token limits and smaller chunks produce more precise search results. Split documents into overlapping chunks of 500-1000 tokens. Then generate embeddings for each chunk using OpenAI's text-embedding-3-small model, which produces 1536-dimensional vectors. Store the chunk text, embedding vector, and metadata (source file, position) together.

typescript
1// src/embeddings.ts
2import OpenAI from "openai";
3
4const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
5
6export interface DocumentChunk {
7 id: string;
8 text: string;
9 source: string;
10 chunkIndex: number;
11 embedding?: number[];
12}
13
14export function chunkText(
15 text: string,
16 source: string,
17 chunkSize: number = 800,
18 overlap: number = 200
19): DocumentChunk[] {
20 const chunks: DocumentChunk[] = [];
21 let start = 0;
22 let index = 0;
23
24 while (start < text.length) {
25 const end = Math.min(start + chunkSize, text.length);
26 chunks.push({
27 id: `${source}-chunk-${index}`,
28 text: text.slice(start, end),
29 source,
30 chunkIndex: index,
31 });
32 start += chunkSize - overlap;
33 index++;
34 }
35 return chunks;
36}
37
38export async function embedTexts(texts: string[]): Promise<number[][]> {
39 const response = await openai.embeddings.create({
40 model: "text-embedding-3-small",
41 input: texts,
42 });
43 return response.data.map(d => d.embedding);
44}
45
46export async function embedQuery(query: string): Promise<number[]> {
47 const [embedding] = await embedTexts([query]);
48 return embedding;
49}

Expected result: Functions that chunk documents and generate embeddings using OpenAI's API.

3

Create the vector store integration with pgvector

Set up a PostgreSQL table with a vector column using the pgvector extension. The table stores chunk text, source metadata, and the embedding vector. Create an insert function for indexing and a search function that uses cosine distance for similarity queries. The search function takes a query embedding and returns the top-k most similar chunks with their similarity scores.

typescript
1// src/vector-store.ts
2import pg from "pg";
3
4const pool = new pg.Pool({
5 connectionString: process.env.DATABASE_URL,
6});
7
8export async function initializeStore(): Promise<void> {
9 await pool.query(`CREATE EXTENSION IF NOT EXISTS vector`);
10 await pool.query(`
11 CREATE TABLE IF NOT EXISTS document_chunks (
12 id TEXT PRIMARY KEY,
13 text TEXT NOT NULL,
14 source TEXT NOT NULL,
15 chunk_index INTEGER NOT NULL,
16 embedding vector(1536) NOT NULL,
17 created_at TIMESTAMP DEFAULT NOW()
18 )
19 `);
20 await pool.query(`
21 CREATE INDEX IF NOT EXISTS chunks_embedding_idx
22 ON document_chunks USING ivfflat (embedding vector_cosine_ops)
23 WITH (lists = 100)
24 `);
25}
26
27export async function insertChunks(
28 chunks: { id: string; text: string; source: string; chunkIndex: number; embedding: number[] }[]
29): Promise<void> {
30 for (const chunk of chunks) {
31 await pool.query(
32 `INSERT INTO document_chunks (id, text, source, chunk_index, embedding)
33 VALUES ($1, $2, $3, $4, $5)
34 ON CONFLICT (id) DO UPDATE SET text = $2, embedding = $5`,
35 [chunk.id, chunk.text, chunk.source, chunk.chunkIndex, JSON.stringify(chunk.embedding)]
36 );
37 }
38}
39
40export async function searchSimilar(
41 queryEmbedding: number[],
42 topK: number = 5,
43 sourceFilter?: string
44): Promise<{ text: string; source: string; score: number }[]> {
45 const filterClause = sourceFilter ? `WHERE source = $3` : "";
46 const params = sourceFilter
47 ? [JSON.stringify(queryEmbedding), topK, sourceFilter]
48 : [JSON.stringify(queryEmbedding), topK];
49
50 const result = await pool.query(
51 `SELECT text, source, 1 - (embedding <=> $1::vector) AS score
52 FROM document_chunks ${filterClause}
53 ORDER BY embedding <=> $1::vector
54 LIMIT $2`,
55 params
56 );
57 return result.rows;
58}

Expected result: A vector store module that can insert document chunks and search by similarity using pgvector.

4

Register the search_documents MCP tool

Create the main MCP tool that clients will call. The search_documents tool takes a natural language query, an optional number of results, and an optional source filter. It embeds the query, searches the vector store, and returns the top results formatted as text with source citations and similarity scores. The tool description is critical — it tells the AI what the tool does and when to use it.

typescript
1// src/index.ts
2import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
3import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
4import { z } from "zod";
5import { embedQuery } from "./embeddings.js";
6import { searchSimilar, initializeStore } from "./vector-store.js";
7
8const server = new McpServer({
9 name: "rag-document-server",
10 version: "1.0.0",
11});
12
13server.tool(
14 "search_documents",
15 "Search the knowledge base using natural language. Returns the most relevant document chunks with source citations and similarity scores. Use this to answer questions about the indexed documents.",
16 {
17 query: z.string().describe("Natural language search query"),
18 topK: z.number().min(1).max(20).default(5).describe("Number of results to return"),
19 source: z.string().optional().describe("Filter by source filename"),
20 },
21 async ({ query, topK, source }) => {
22 try {
23 const queryEmbedding = await embedQuery(query);
24 const results = await searchSimilar(queryEmbedding, topK, source);
25
26 if (results.length === 0) {
27 return { content: [{ type: "text", text: "No relevant documents found for this query." }] };
28 }
29
30 const formatted = results
31 .map((r, i) => `[${i + 1}] (score: ${r.score.toFixed(3)}) [Source: ${r.source}]\n${r.text}`)
32 .join("\n\n---\n\n");
33
34 return { content: [{ type: "text", text: formatted }] };
35 } catch (error) {
36 return {
37 content: [{ type: "text", text: `Error: ${error instanceof Error ? error.message : String(error)}` }],
38 isError: true,
39 };
40 }
41 }
42);

Expected result: A search_documents tool that accepts natural language queries and returns relevant document chunks.

5

Add a document ingestion tool for indexing new content

Expose an ingest_document tool that accepts document text and source metadata, chunks it, generates embeddings, and stores the vectors. This lets the AI (or an admin) add new documents to the knowledge base without restarting the server. Include a count of chunks created so the caller knows the operation succeeded.

typescript
1server.tool(
2 "ingest_document",
3 "Add a document to the knowledge base for future searches. Chunks the text, generates embeddings, and stores vectors.",
4 {
5 text: z.string().describe("Full document text to index"),
6 source: z.string().describe("Source identifier (e.g., filename or URL)"),
7 chunkSize: z.number().default(800).describe("Characters per chunk"),
8 },
9 async ({ text, source, chunkSize }) => {
10 try {
11 const chunks = chunkText(text, source, chunkSize);
12 const texts = chunks.map(c => c.text);
13 const embeddings = await embedTexts(texts);
14
15 const chunksWithEmbeddings = chunks.map((c, i) => ({
16 ...c,
17 embedding: embeddings[i],
18 }));
19
20 await insertChunks(chunksWithEmbeddings);
21
22 return {
23 content: [{ type: "text", text: `Ingested "${source}": ${chunks.length} chunks indexed.` }],
24 };
25 } catch (error) {
26 return {
27 content: [{ type: "text", text: `Error: ${error instanceof Error ? error.message : String(error)}` }],
28 isError: true,
29 };
30 }
31 }
32);

Expected result: An ingest_document tool that chunks, embeds, and stores documents in the vector database.

6

Configure Claude Desktop or Cursor to use the RAG server

Add the server to your MCP client configuration. For Claude Desktop, edit claude_desktop_config.json. For Cursor, edit .cursor/mcp.json. Set the OPENAI_API_KEY and DATABASE_URL environment variables so the server can connect to the embedding API and vector database at startup. Teams building complex RAG pipelines often work with RapidDev to optimize chunking strategies and embedding model selection for their specific document types.

typescript
1// Claude Desktop: ~/Library/Application Support/Claude/claude_desktop_config.json
2// Cursor: .cursor/mcp.json
3{
4 "mcpServers": {
5 "rag-documents": {
6 "command": "node",
7 "args": ["dist/index.js"],
8 "env": {
9 "OPENAI_API_KEY": "sk-...",
10 "DATABASE_URL": "postgresql://user:pass@localhost:5432/ragdb"
11 }
12 }
13 }
14}

Expected result: The RAG server appears in your MCP client's tool list and responds to search queries with relevant document chunks.

Complete working example

src/index.ts
1import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
2import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
3import { z } from "zod";
4import OpenAI from "openai";
5
6const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
7
8// --- Chunking ---
9function chunkText(text: string, source: string, size = 800, overlap = 200) {
10 const chunks: { id: string; text: string; source: string; index: number }[] = [];
11 let start = 0, i = 0;
12 while (start < text.length) {
13 chunks.push({ id: `${source}-${i}`, text: text.slice(start, start + size), source, index: i });
14 start += size - overlap;
15 i++;
16 }
17 return chunks;
18}
19
20// --- Embeddings ---
21async function embed(texts: string[]): Promise<number[][]> {
22 const res = await openai.embeddings.create({ model: "text-embedding-3-small", input: texts });
23 return res.data.map(d => d.embedding);
24}
25
26// --- In-memory vector store (replace with pgvector/Pinecone in production) ---
27const store: { id: string; text: string; source: string; embedding: number[] }[] = [];
28
29function cosineSim(a: number[], b: number[]): number {
30 let dot = 0, na = 0, nb = 0;
31 for (let i = 0; i < a.length; i++) { dot += a[i]*b[i]; na += a[i]*a[i]; nb += b[i]*b[i]; }
32 return dot / (Math.sqrt(na) * Math.sqrt(nb));
33}
34
35function search(queryEmb: number[], topK: number, source?: string) {
36 let items = source ? store.filter(s => s.source === source) : store;
37 return items
38 .map(item => ({ ...item, score: cosineSim(queryEmb, item.embedding) }))
39 .sort((a, b) => b.score - a.score)
40 .slice(0, topK);
41}
42
43// --- MCP Server ---
44const server = new McpServer({ name: "rag-server", version: "1.0.0" });
45
46server.tool("search_documents", "Search knowledge base with natural language", {
47 query: z.string().describe("Natural language query"),
48 topK: z.number().min(1).max(20).default(5),
49 source: z.string().optional(),
50}, async ({ query, topK, source }) => {
51 const [qEmb] = await embed([query]);
52 const results = search(qEmb, topK, source);
53 if (!results.length) return { content: [{ type: "text", text: "No results found." }] };
54 const text = results.map((r, i) =>
55 `[${i+1}] (${r.score.toFixed(3)}) [${r.source}]\n${r.text}`
56 ).join("\n---\n");
57 return { content: [{ type: "text", text }] };
58});
59
60server.tool("ingest_document", "Add a document to the knowledge base", {
61 text: z.string(), source: z.string(), chunkSize: z.number().default(800),
62}, async ({ text, source, chunkSize }) => {
63 const chunks = chunkText(text, source, chunkSize);
64 const embeddings = await embed(chunks.map(c => c.text));
65 chunks.forEach((c, i) => store.push({ ...c, embedding: embeddings[i] }));
66 return { content: [{ type: "text", text: `Indexed ${chunks.length} chunks from ${source}` }] };
67});
68
69server.tool("list_sources", "List all indexed document sources", {}, async () => {
70 const sources = [...new Set(store.map(s => s.source))];
71 return { content: [{ type: "text", text: JSON.stringify(sources, null, 2) }] };
72});
73
74async function main() {
75 const transport = new StdioServerTransport();
76 await server.connect(transport);
77 console.error("RAG MCP server running");
78}
79main().catch(e => { console.error(e); process.exit(1); });

Common mistakes when building an MCP server for RAG

Why it's a problem: Using chunks that are too large (2000+ tokens), resulting in poor search precision

How to avoid: Keep chunks between 500-1000 characters with 100-200 character overlap for the best balance of precision and context.

Why it's a problem: Not including overlap between chunks, causing information at chunk boundaries to be lost

How to avoid: Use 20-25% overlap (e.g., 200 characters overlap for 800 character chunks) to ensure boundary content is captured.

Why it's a problem: Embedding the query with a different model than the documents

How to avoid: Always use the same embedding model for both document indexing and query embedding. Mixing models produces meaningless similarity scores.

Why it's a problem: Not handling empty search results gracefully

How to avoid: Return a clear message like 'No relevant documents found' instead of an empty array or error when no results match.

Best practices

  • Use the same embedding model for both indexing and querying — never mix models
  • Chunk documents at 500-1000 characters with 20% overlap for optimal retrieval
  • Include source citations and similarity scores in search results so the AI can assess relevance
  • Filter by source metadata to narrow search scope when the user specifies a document
  • Store raw text alongside vectors so you can re-embed when upgrading to a new model
  • Use IVFFlat or HNSW indexes in pgvector for sub-100ms search on large collections
  • Batch embedding requests to reduce API calls and latency
  • Log query terms and result counts to stderr for monitoring search quality

Still stuck?

Copy one of these prompts to get a personalized, step-by-step explanation.

ChatGPT Prompt

Build an MCP server in TypeScript that provides RAG capabilities. It should have tools to ingest documents (chunk + embed using OpenAI text-embedding-3-small) and search them using cosine similarity with pgvector. Return results with similarity scores and source citations.

MCP Prompt

Create a RAG MCP server with search_documents and ingest_document tools. Use OpenAI embeddings, pgvector for storage, and return ranked results with scores. Include Zod schemas for all inputs.

Frequently asked questions

What embedding model should I use for RAG with MCP?

OpenAI's text-embedding-3-small is the best balance of cost and quality for most use cases at $0.02 per million tokens. For higher accuracy on technical content, use text-embedding-3-large at $0.13 per million tokens.

How many documents can I index in a single MCP RAG server?

With pgvector and proper indexing (IVFFlat or HNSW), you can search millions of chunks with sub-100ms latency. The bottleneck is usually embedding generation speed, not search speed.

Should I use pgvector or Pinecone for MCP RAG?

Use pgvector if you already have PostgreSQL or want to self-host. Use Pinecone if you want a fully managed service with built-in scaling. Both work well as MCP tool backends.

How do I update documents that have already been indexed?

Delete the old chunks by source identifier, then re-ingest the updated document. Use ON CONFLICT DO UPDATE in PostgreSQL to handle upserts automatically.

Can I use local embedding models instead of OpenAI?

Yes. Use Ollama with a model like nomic-embed-text for fully local embeddings. Replace the OpenAI embed function with an HTTP call to Ollama's embedding endpoint.

What chunk size works best for code documentation?

For code and technical documentation, use smaller chunks (400-600 characters) with higher overlap (30%). Code has more information density per character than prose, so smaller chunks improve precision.

RapidDev

Talk to an Expert

Our team has built 600+ apps. Get personalized help with your project.

Book a free consultation

Need help with your project?

Our experts have built 600+ apps and can accelerate your development. Book a free consultation — no strings attached.

Book a free consultation

We put the rapid in RapidDev

Need a dedicated strategic tech and growth partner? Discover what RapidDev can do for your business! Book a call with our team to schedule a free, no-obligation consultation. We'll discuss your project and provide a custom quote at no cost.