Build MCP Server for RAG | Tutorial

TL;DR

Build an MCP server that exposes vector search as a tool, letting AI assistants perform RAG (Retrieval-Augmented Generation) on your documents. The server embeds documents using OpenAI's embedding API, stores vectors in pgvector or Pinecone, and provides a search_documents tool that returns relevant chunks with similarity scores. This gives any MCP client grounded, factual answers from your knowledge base.

Building a RAG-Powered MCP Server for Document Search

Retrieval-Augmented Generation (RAG) lets AI assistants answer questions using your actual documents instead of relying solely on training data. This tutorial builds an MCP server that ingests documents, generates embeddings, stores them in a vector database, and exposes a search_documents tool that any MCP client (Claude Desktop, Cursor, Windsurf) can call. The AI sends a natural language query, your server returns the most relevant document chunks, and the AI uses those chunks to generate grounded answers.

Prerequisites

Node.js 18+ and npm installed
An OpenAI API key for generating embeddings
Either PostgreSQL with pgvector extension or a Pinecone account
Basic understanding of vector embeddings and similarity search
A collection of documents (Markdown, text, or PDF) to index

Step-by-step guide

Set up the project and install dependencies

Create the project and install the MCP SDK, OpenAI client for embeddings, and your chosen vector database client. The openai package generates text embeddings. Use pg with pgvector for self-hosted PostgreSQL, or @pinecone-database/pinecone for managed Pinecone. Also install zod for input validation and a text splitter for chunking documents.

typescript

1mkdir mcp-rag-server && cd mcp-rag-server
2npm init -y
3npm install @modelcontextprotocol/sdk zod openai
4npm install @pinecone-database/pinecone  # or: npm install pg pgvector
5npm install -D typescript @types/node
6npx tsc --init

Expected result: Project initialized with all dependencies for MCP server, embeddings, and vector storage.

Build the document chunking and embedding pipeline

Documents must be split into chunks before embedding because embedding models have token limits and smaller chunks produce more precise search results. Split documents into overlapping chunks of 500-1000 tokens. Then generate embeddings for each chunk using OpenAI's text-embedding-3-small model, which produces 1536-dimensional vectors. Store the chunk text, embedding vector, and metadata (source file, position) together.

typescript

1// src/embeddings.ts
2import OpenAI from "openai";
3
4const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
5
6export interface DocumentChunk {
7  id: string;
8  text: string;
9  source: string;
10  chunkIndex: number;
11  embedding?: number[];
12}
13
14export function chunkText(
15  text: string,
16  source: string,
17  chunkSize: number = 800,
18  overlap: number = 200
19): DocumentChunk[] {
20  const chunks: DocumentChunk[] = [];
21  let start = 0;
22  let index = 0;
23
24  while (start < text.length) {
25    const end = Math.min(start + chunkSize, text.length);
26    chunks.push({
27      id: `${source}-chunk-${index}`,
28      text: text.slice(start, end),
29      source,
30      chunkIndex: index,
31    });
32    start += chunkSize - overlap;
33    index++;
34  }
35  return chunks;
36}
37
38export async function embedTexts(texts: string[]): Promise<number[][]> {
39  const response = await openai.embeddings.create({
40    model: "text-embedding-3-small",
41    input: texts,
42  });
43  return response.data.map(d => d.embedding);
44}
45
46export async function embedQuery(query: string): Promise<number[]> {
47  const [embedding] = await embedTexts([query]);
48  return embedding;
49}

Expected result: Functions that chunk documents and generate embeddings using OpenAI's API.

Create the vector store integration with pgvector

Set up a PostgreSQL table with a vector column using the pgvector extension. The table stores chunk text, source metadata, and the embedding vector. Create an insert function for indexing and a search function that uses cosine distance for similarity queries. The search function takes a query embedding and returns the top-k most similar chunks with their similarity scores.

typescript

1// src/vector-store.ts
2import pg from "pg";
3
4const pool = new pg.Pool({
5  connectionString: process.env.DATABASE_URL,
6});
7
8export async function initializeStore(): Promise<void> {
9  await pool.query(`CREATE EXTENSION IF NOT EXISTS vector`);
10  await pool.query(`
11    CREATE TABLE IF NOT EXISTS document_chunks (
12      id TEXT PRIMARY KEY,
13      text TEXT NOT NULL,
14      source TEXT NOT NULL,
15      chunk_index INTEGER NOT NULL,
16      embedding vector(1536) NOT NULL,
17      created_at TIMESTAMP DEFAULT NOW()
18    )
19  `);
20  await pool.query(`
21    CREATE INDEX IF NOT EXISTS chunks_embedding_idx
22    ON document_chunks USING ivfflat (embedding vector_cosine_ops)
23    WITH (lists = 100)
24  `);
25}
26
27export async function insertChunks(
28  chunks: { id: string; text: string; source: string; chunkIndex: number; embedding: number[] }[]
29): Promise<void> {
30  for (const chunk of chunks) {
31    await pool.query(
32      `INSERT INTO document_chunks (id, text, source, chunk_index, embedding)
33       VALUES ($1, $2, $3, $4, $5)
34       ON CONFLICT (id) DO UPDATE SET text = $2, embedding = $5`,
35      [chunk.id, chunk.text, chunk.source, chunk.chunkIndex, JSON.stringify(chunk.embedding)]
36    );
37  }
38}
39
40export async function searchSimilar(
41  queryEmbedding: number[],
42  topK: number = 5,
43  sourceFilter?: string
44): Promise<{ text: string; source: string; score: number }[]> {
45  const filterClause = sourceFilter ? `WHERE source = $3` : "";
46  const params = sourceFilter
47    ? [JSON.stringify(queryEmbedding), topK, sourceFilter]
48    : [JSON.stringify(queryEmbedding), topK];
49
50  const result = await pool.query(
51    `SELECT text, source, 1 - (embedding <=> $1::vector) AS score
52     FROM document_chunks ${filterClause}
53     ORDER BY embedding <=> $1::vector
54     LIMIT $2`,
55    params
56  );
57  return result.rows;
58}

Expected result: A vector store module that can insert document chunks and search by similarity using pgvector.

Register the search_documents MCP tool

Create the main MCP tool that clients will call. The search_documents tool takes a natural language query, an optional number of results, and an optional source filter. It embeds the query, searches the vector store, and returns the top results formatted as text with source citations and similarity scores. The tool description is critical — it tells the AI what the tool does and when to use it.

typescript

1// src/index.ts
2import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
3import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
4import { z } from "zod";
5import { embedQuery } from "./embeddings.js";
6import { searchSimilar, initializeStore } from "./vector-store.js";
7
8const server = new McpServer({
9  name: "rag-document-server",
10  version: "1.0.0",
11});
12
13server.tool(
14  "search_documents",
15  "Search the knowledge base using natural language. Returns the most relevant document chunks with source citations and similarity scores. Use this to answer questions about the indexed documents.",
16  {
17    query: z.string().describe("Natural language search query"),
18    topK: z.number().min(1).max(20).default(5).describe("Number of results to return"),
19    source: z.string().optional().describe("Filter by source filename"),
20  },
21  async ({ query, topK, source }) => {
22    try {
23      const queryEmbedding = await embedQuery(query);
24      const results = await searchSimilar(queryEmbedding, topK, source);
25
26      if (results.length === 0) {
27        return { content: [{ type: "text", text: "No relevant documents found for this query." }] };
28      }
29
30      const formatted = results
31        .map((r, i) => `[${i + 1}] (score: ${r.score.toFixed(3)}) [Source: ${r.source}]\n${r.text}`)
32        .join("\n\n---\n\n");
33
34      return { content: [{ type: "text", text: formatted }] };
35    } catch (error) {
36      return {
37        content: [{ type: "text", text: `Error: ${error instanceof Error ? error.message : String(error)}` }],
38        isError: true,
39      };
40    }
41  }
42);

Expected result: A search_documents tool that accepts natural language queries and returns relevant document chunks.

Add a document ingestion tool for indexing new content

Expose an ingest_document tool that accepts document text and source metadata, chunks it, generates embeddings, and stores the vectors. This lets the AI (or an admin) add new documents to the knowledge base without restarting the server. Include a count of chunks created so the caller knows the operation succeeded.

typescript

1server.tool(
2  "ingest_document",
3  "Add a document to the knowledge base for future searches. Chunks the text, generates embeddings, and stores vectors.",
4  {
5    text: z.string().describe("Full document text to index"),
6    source: z.string().describe("Source identifier (e.g., filename or URL)"),
7    chunkSize: z.number().default(800).describe("Characters per chunk"),
8  },
9  async ({ text, source, chunkSize }) => {
10    try {
11      const chunks = chunkText(text, source, chunkSize);
12      const texts = chunks.map(c => c.text);
13      const embeddings = await embedTexts(texts);
14
15      const chunksWithEmbeddings = chunks.map((c, i) => ({
16        ...c,
17        embedding: embeddings[i],
18      }));
19
20      await insertChunks(chunksWithEmbeddings);
21
22      return {
23        content: [{ type: "text", text: `Ingested "${source}": ${chunks.length} chunks indexed.` }],
24      };
25    } catch (error) {
26      return {
27        content: [{ type: "text", text: `Error: ${error instanceof Error ? error.message : String(error)}` }],
28        isError: true,
29      };
30    }
31  }
32);

Expected result: An ingest_document tool that chunks, embeds, and stores documents in the vector database.

Configure Claude Desktop or Cursor to use the RAG server

Add the server to your MCP client configuration. For Claude Desktop, edit claude_desktop_config.json. For Cursor, edit .cursor/mcp.json. Set the OPENAI_API_KEY and DATABASE_URL environment variables so the server can connect to the embedding API and vector database at startup. Teams building complex RAG pipelines often work with RapidDev to optimize chunking strategies and embedding model selection for their specific document types.

typescript

1// Claude Desktop: ~/Library/Application Support/Claude/claude_desktop_config.json
2// Cursor: .cursor/mcp.json
3{
4  "mcpServers": {
5    "rag-documents": {
6      "command": "node",
7      "args": ["dist/index.js"],
8      "env": {
9        "OPENAI_API_KEY": "sk-...",
10        "DATABASE_URL": "postgresql://user:pass@localhost:5432/ragdb"
11      }
12    }
13  }
14}

Expected result: The RAG server appears in your MCP client's tool list and responds to search queries with relevant document chunks.

Complete working example

src/index.ts

1import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
2import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
3import { z } from "zod";
4import OpenAI from "openai";
5
6const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
7
8// --- Chunking ---
9function chunkText(text: string, source: string, size = 800, overlap = 200) {
10  const chunks: { id: string; text: string; source: string; index: number }[] = [];
11  let start = 0, i = 0;
12  while (start < text.length) {
13    chunks.push({ id: `${source}-${i}`, text: text.slice(start, start + size), source, index: i });
14    start += size - overlap;
15    i++;
16  }
17  return chunks;
18}
19
20// --- Embeddings ---
21async function embed(texts: string[]): Promise<number[][]> {
22  const res = await openai.embeddings.create({ model: "text-embedding-3-small", input: texts });
23  return res.data.map(d => d.embedding);
24}
25
26// --- In-memory vector store (replace with pgvector/Pinecone in production) ---
27const store: { id: string; text: string; source: string; embedding: number[] }[] = [];
28
29function cosineSim(a: number[], b: number[]): number {
30  let dot = 0, na = 0, nb = 0;
31  for (let i = 0; i < a.length; i++) { dot += a[i]*b[i]; na += a[i]*a[i]; nb += b[i]*b[i]; }
32  return dot / (Math.sqrt(na) * Math.sqrt(nb));
33}
34
35function search(queryEmb: number[], topK: number, source?: string) {
36  let items = source ? store.filter(s => s.source === source) : store;
37  return items
38    .map(item => ({ ...item, score: cosineSim(queryEmb, item.embedding) }))
39    .sort((a, b) => b.score - a.score)
40    .slice(0, topK);
41}
42
43// --- MCP Server ---
44const server = new McpServer({ name: "rag-server", version: "1.0.0" });
45
46server.tool("search_documents", "Search knowledge base with natural language", {
47  query: z.string().describe("Natural language query"),
48  topK: z.number().min(1).max(20).default(5),
49  source: z.string().optional(),
50}, async ({ query, topK, source }) => {
51  const [qEmb] = await embed([query]);
52  const results = search(qEmb, topK, source);
53  if (!results.length) return { content: [{ type: "text", text: "No results found." }] };
54  const text = results.map((r, i) =>
55    `[${i+1}] (${r.score.toFixed(3)}) [${r.source}]\n${r.text}`
56  ).join("\n---\n");
57  return { content: [{ type: "text", text }] };
58});
59
60server.tool("ingest_document", "Add a document to the knowledge base", {
61  text: z.string(), source: z.string(), chunkSize: z.number().default(800),
62}, async ({ text, source, chunkSize }) => {
63  const chunks = chunkText(text, source, chunkSize);
64  const embeddings = await embed(chunks.map(c => c.text));
65  chunks.forEach((c, i) => store.push({ ...c, embedding: embeddings[i] }));
66  return { content: [{ type: "text", text: `Indexed ${chunks.length} chunks from ${source}` }] };
67});
68
69server.tool("list_sources", "List all indexed document sources", {}, async () => {
70  const sources = [...new Set(store.map(s => s.source))];
71  return { content: [{ type: "text", text: JSON.stringify(sources, null, 2) }] };
72});
73
74async function main() {
75  const transport = new StdioServerTransport();
76  await server.connect(transport);
77  console.error("RAG MCP server running");
78}
79main().catch(e => { console.error(e); process.exit(1); });

Common mistakes when building an MCP server for RAG

Why it's a problem: Using chunks that are too large (2000+ tokens), resulting in poor search precision

How to avoid: Keep chunks between 500-1000 characters with 100-200 character overlap for the best balance of precision and context.

Why it's a problem: Not including overlap between chunks, causing information at chunk boundaries to be lost

How to avoid: Use 20-25% overlap (e.g., 200 characters overlap for 800 character chunks) to ensure boundary content is captured.

Why it's a problem: Embedding the query with a different model than the documents

How to avoid: Always use the same embedding model for both document indexing and query embedding. Mixing models produces meaningless similarity scores.

Why it's a problem: Not handling empty search results gracefully

How to avoid: Return a clear message like 'No relevant documents found' instead of an empty array or error when no results match.

Best practices

Use the same embedding model for both indexing and querying — never mix models
Chunk documents at 500-1000 characters with 20% overlap for optimal retrieval
Include source citations and similarity scores in search results so the AI can assess relevance
Filter by source metadata to narrow search scope when the user specifies a document
Store raw text alongside vectors so you can re-embed when upgrading to a new model
Use IVFFlat or HNSW indexes in pgvector for sub-100ms search on large collections
Batch embedding requests to reduce API calls and latency
Log query terms and result counts to stderr for monitoring search quality

Still stuck?

Copy one of these prompts to get a personalized, step-by-step explanation.

ChatGPT Prompt

Build an MCP server in TypeScript that provides RAG capabilities. It should have tools to ingest documents (chunk + embed using OpenAI text-embedding-3-small) and search them using cosine similarity with pgvector. Return results with similarity scores and source citations.

MCP Prompt

Create a RAG MCP server with search_documents and ingest_document tools. Use OpenAI embeddings, pgvector for storage, and return ranked results with scores. Include Zod schemas for all inputs.

Frequently asked questions

What embedding model should I use for RAG with MCP?

OpenAI's text-embedding-3-small is the best balance of cost and quality for most use cases at $0.02 per million tokens. For higher accuracy on technical content, use text-embedding-3-large at $0.13 per million tokens.

How many documents can I index in a single MCP RAG server?

With pgvector and proper indexing (IVFFlat or HNSW), you can search millions of chunks with sub-100ms latency. The bottleneck is usually embedding generation speed, not search speed.

Should I use pgvector or Pinecone for MCP RAG?

Use pgvector if you already have PostgreSQL or want to self-host. Use Pinecone if you want a fully managed service with built-in scaling. Both work well as MCP tool backends.

How do I update documents that have already been indexed?

Delete the old chunks by source identifier, then re-ingest the updated document. Use ON CONFLICT DO UPDATE in PostgreSQL to handle upserts automatically.

Can I use local embedding models instead of OpenAI?

Yes. Use Ollama with a model like nomic-embed-text for fully local embeddings. Replace the OpenAI embed function with an HTTP call to Ollama's embedding endpoint.

What chunk size works best for code documentation?

For code and technical documentation, use smaller chunks (400-600 characters) with higher overlap (30%). Code has more information density per character than prose, so smaller chunks improve precision.

Talk to an Expert

Our team has built 600+ apps. Get personalized help with your project.

Book a free consultation

How to build an MCP server for RAG

What you'll learn

What you'll learn

Building a RAG-Powered MCP Server for Document Search

Prerequisites

Step-by-step guide

Set up the project and install dependencies

Set up the project and install dependencies

Build the document chunking and embedding pipeline

Build the document chunking and embedding pipeline

Create the vector store integration with pgvector

Create the vector store integration with pgvector

Register the search_documents MCP tool

Register the search_documents MCP tool

Add a document ingestion tool for indexing new content

Add a document ingestion tool for indexing new content

Configure Claude Desktop or Cursor to use the RAG server

Configure Claude Desktop or Cursor to use the RAG server

Complete working example

Common mistakes when building an MCP server for RAG

Best practices

Still stuck?

Frequently asked questions

Talk to an Expert

Need help with your project?

We put the rapid in RapidDev

How to build an MCP server for RAG

What you'll learn

Building a RAG-Powered MCP Server for Document Search

Prerequisites

Step-by-step guide

Set up the project and install dependencies

Set up the project and install dependencies

Build the document chunking and embedding pipeline

Build the document chunking and embedding pipeline

Create the vector store integration with pgvector

Create the vector store integration with pgvector

Register the search_documents MCP tool

Register the search_documents MCP tool

Add a document ingestion tool for indexing new content

Add a document ingestion tool for indexing new content

Configure Claude Desktop or Cursor to use the RAG server

Configure Claude Desktop or Cursor to use the RAG server

Complete working example

Common mistakes when building an MCP server for RAG

Best practices

Still stuck?

Related tutorials

How to Build a Multi-Tool MCP Server

How to Add Caching to an MCP Server

How to Use MCP for Document Search

How to Use MCP for Data Analysis

Frequently asked questions

Talk to an Expert

Need help with your project?

We put the rapid in RapidDev