Add Semantic Search to Claude Code with pmem: A Local RAG That Cuts Token Costs 75%

Install pmem, a local RAG MCP server, to give Claude Code instant semantic search over your entire project's history, slashing token usage for file retrieval.

GAla Smith & AI Research Desk·7h ago·4 min read·229 views·AI-Generated
Share:
Source: dev.tovia devto_claudecode, devto_anthropic, gn_claude_code, reddit_claude, medium_anthropic, gn_claude_community, hn_claude_code, hn_claude_cli, medium_claudeWidely Reported

The Problem: Claude Code Can't See Your Project's Memory

Your project isn't just code. It's ROADMAP.md, ARCHITECTURE.md, hundreds of CLAUDE.md session logs, and task folders full of lessons learned. Claude Code can only read what you point it at, or it burns 20,000+ tokens on exploratory searches through your file system. For projects with 500+ markdown files, this token overhead makes deep historical context prohibitively expensive.

The Solution: pmem — Local RAG as an MCP Server

pmem (project memory) is an open-source tool that gives Claude Code semantic search over your entire project. It works as a Model Context Protocol (MCP) server, a technology Claude Code uses extensively to integrate external tools. The flow is simple: you ask a question in natural language, pmem finds the most relevant document chunks from your project's history, and returns them with citations.

Key Results from the Source:

  • Query: "identify governance-related blog posts"
  • Fresh Claude Code Search: ~90 seconds, ~22,000 tokens, 11 posts found
  • pmem Search: ~20 seconds, ~5,500 tokens, 18 posts found

That's a 75% reduction in token cost and 60% faster retrieval, with more thorough results because semantic search finds connections keyword matching misses.

How To Install and Use It Now

1. Install Prerequisites

You need Python 3.11+, Ollama running locally, and the nomic-embed-text embedding model.

pip install pmem-project-memory
ollama pull nomic-embed-text

Cover image for I Built a Local RAG for Claude Code: Semantic Search Over Your Own Project

2. Initialize Your Project

Navigate to your project root and run:

cd ~/your-project
pmem init
pmem index

The first index command walks through your .md and .txt files, splits them using header-aware chunking (so a section stays with its heading), and creates a local vector database using ChromaDB. No data leaves your machine.

3. Install the Session Skills & Configure MCP

Run the command to install convenient slash commands and get the MCP server configuration block:

pmem install-skills

Add the provided MCP config to your ~/.claude.json file for global access or to a project-specific .mcp.json. This registers pmem as a tool Claude Code can call directly.

4. Integrate Into Your Workflow

Three slash commands automate memory management:

  • /welcome: Run at session start to refresh the index with recent changes.
  • /sleep: Run at session end to capture everything you just worked on.
  • /reindex: Force a full re-index mid-session if needed.

Once configured, Claude Code can call the memory_query tool. In a session, you can now prompt: "Use the memory tool to find all past discussions about our API rate limiting decisions."

Why This Architecture Works

The author made several key decisions that make pmem practical for daily Claude Code use:

  1. No LangChain: The RAG pipeline is ~2,000 lines of pure Python—embed, store, search, (optionally) synthesize. No heavyweight framework overhead.
  2. ChromaDB: A file-based vector database with no server process. It's persistent and lives in your project's .memory directory.
  3. Header-Aware Chunking: Unlike naive character splitting, this preserves semantic units. A section titled "Why we chose CloudFront" stays intact.
  4. Incremental Indexing: The tool tracks file hashes. Subsequent index commands only process changed files, taking under a second for most updates.

The Prompt That Built a Better Tool

The source highlights a critical lesson for working with Claude Code: finish thinking before you start asking.

The vague prompt "I want to give agents better memory." leads to wasted time. The effective prompt specified:

  • The exact problem (token/time sink for large projects)
  • The technology (Ollama embeddings, RAG)
  • The integration point (a tool for Claude Code)
  • Constraints (local only, focus on .md/.txt first)
  • Scope (no external APIs)

This precision allowed the agent to build exactly what was needed, not a sprawling infrastructure project.

When To Use pmem vs. Claude's Built-in Search

Use pmem for targeted retrieval questions where you know the answer is in your project's history:

  • "What was the rationale for choosing library X over Y?"
  • "Find all bug post-mortems related to database timeouts."
  • "What decisions did we make about the user onboarding flow?"

Use Claude Code's native Explore agent for open-ended exploration of unfamiliar codebases or when you need reasoning across search results. pmem complements, doesn't replace, the core agent.

What's Next

Phase 2 features like pmem watch for auto-reindexing are nearly complete. The roadmap includes multi-collection support, non-markdown file parsing, and a pmem diff to track how answers evolve. The tool is MIT licensed and available on GitHub.

AI Analysis

Claude Code users managing complex, documentation-heavy projects should install `pmem` today. The setup is a 5-minute investment that pays off immediately in reduced token costs and faster context retrieval. **Change your workflow:** Start every session with `/welcome` and end with `/sleep`. This bakes index maintenance into your ritual. When you need historical context, explicitly prompt Claude to "use the memory tool to find..." instead of letting it burn tokens on a manual file search. For large projects, this can save tens of thousands of tokens per week. **Think about prompts differently:** The source's contrast between vague and precise prompts is the real lesson. When using Claude Code to *build* tools, emulate the successful prompt: define the problem, technology, integration, constraints, and scope *before* you write the first line. This turns the agent from a wandering brainstormer into a precise engineer.
Enjoyed this article?
Share:

Related Articles

More in Products & Launches

View all