How This Developer Built a Production-Ready RAG System with Claude Code in One Weekend
Open SourceScore: 92

How This Developer Built a Production-Ready RAG System with Claude Code in One Weekend

A developer used Claude Code to create a structured JSON-to-PDF knowledge base with 105 quotes, demonstrating how to build RAG-ready datasets faster than ever.

Ggentic.news Editorial·3h ago·4 min read·13 views
Share:
Source: github.comvia hn_claude_codeSingle Source

How This Developer Built a Production-Ready RAG System with Claude Code in One Weekend

The Technique — Structured Data Generation as a Claude Code Workflow

The developer behind "Gaokao Mentor Wisdom" didn't just scrape quotes—they built a complete structured knowledge base with Claude Code. The project contains 105 quotes from Chinese education expert Zhang Xuefeng, organized into six categories (专业选择, 就业前景, 人生哲理, 院校推荐, 学习建议, 志愿填报策略), each with bilingual text, tags, source attribution, sentiment analysis, and target audience metadata.

What makes this noteworthy for Claude Code users is the workflow: starting from raw content (videos, articles, transcripts), they used Claude Code to extract, structure, validate, and export the data into multiple formats ready for immediate use.

Why It Works — Claude Code's Multi-File Editing and Validation

This project demonstrates Claude Code's strength at handling structured data workflows that involve:

  1. Consistent schema application across 105+ items
  2. Multi-file coordination (6 JSON files, validation scripts, export scripts)
  3. Data transformation pipelines (JSON → Markdown → PDF → training data)

Check the repository structure:

data/
├── mentors.json           # 名师档案
├── categories.json        # 6 大分类
├── zhangxuefeng/          # 张雪峰语录(105 条)
│   ├── zhuanye.json       # 🎯 专业选择(28 条)
│   ├── jiuye.json         # 💼 就业前景(18 条)
│   └── ...
└── _template/             # 新名师模板

Each quote follows this exact schema:

{
  "id": "zxf-zhuanye-001",
  "text": "如果你家境一般,没有任何社会资源,不要学金融。",
  "text_en": "If your family is average with no connections, don't study finance.",
  "context": "直播中谈论金融专业就业门槛",
  "tags": ["金融", "家境", "专业选择"],
  "target_audience": ["高考生", "家长"],
  "source": { "platform": "抖音", "type": "直播" },
  "sentiment": "cautionary",
  "confidence": "attributed",
  "related_majors": ["金融学", "经济学"]
}

How To Apply It — Your RAG Dataset Blueprint

Step 1: Start with a CLAUDE.md Template

Create a CLAUDE.md file in your project:

# RAG Dataset Creation Workflow

## Schema Requirements
- Each item must have: id, text, context, tags, source, sentiment
- All text fields should include English translations
- Tags should be consistent across similar items
- Source attribution must include platform and content type

## Validation Rules
- Run `python scripts/validate.py` after each batch
- Check for duplicate IDs
- Ensure all required fields are present
- Verify bilingual consistency

## Export Targets
1. Markdown documentation
2. PDF ebook
3. RAG chunks (JSONL)
4. Fine-tuning data

Step 2: Use Claude Code for Batch Processing

Instead of manually structuring each quote, prompt Claude Code:

claude code "Take the raw quotes in raw_quotes.txt and structure them according to our schema. Create one JSON file per category. Add appropriate tags based on content. Generate English translations for all Chinese text."

Step 3: Build the Validation Pipeline

The project includes scripts/validate.py—a perfect task for Claude Code:

claude code "Create a Python validation script that checks all JSON files in the data directory. It should verify: 1) schema compliance, 2) no duplicate IDs, 3) all required fields present, 4) tags are from approved list. Output validation report."

Step 4: Generate Multiple Output Formats

The real power comes from the export scripts:

  • json2md.py: Converts structured JSON to readable Markdown
  • export-training.py: Creates rag_chunks.jsonl and fine_tune.jsonl

These are exactly the type of repetitive, schema-aware tasks where Claude Code excels.

The Takeaway — RAG Systems Start with Quality Data

While most developers focus on the retrieval and generation parts of RAG, this project highlights the often-overlooked foundation: well-structured source data. The developer didn't just create another RAG tutorial—they built a production-ready dataset that's immediately usable for:

  • AI Chatbots: Import exports/rag_chunks.jsonl directly
  • Fine-tuning: Use exports/fine_tune.jsonl with Claude or other models
  • Documentation: Generated Markdown and PDF versions
  • Future expansion: Template system for adding more experts

This follows the enterprise trend we've seen where RAG systems are preferred over fine-tuning for production AI systems (as reported on 2026-03-24), but with a crucial insight: RAG success depends entirely on your source data quality.

Try This Weekend Project

Clone the repository and examine the structure:

git clone https://github.com/dongsheng123132/gaokao-mentor-wisdom.git
cd gaokao-mentor-wisdom
# Study the schema and scripts
cat data/zhangxuefeng/zhuanye.json | head -20
# Run the validation
python scripts/validate.py
# Generate outputs
python scripts/json2md.py
python scripts/export-training.py

Then adapt this pattern for your own domain: customer support transcripts, technical documentation, legal precedents, medical guidelines—anywhere you need structured knowledge extraction.

The key insight isn't the specific content (Chinese education advice) but the workflow: using Claude Code to transform unstructured information into RAG-ready structured data at scale.

AI Analysis

Claude Code users should immediately adopt this structured data workflow for any RAG project. Instead of starting with vector databases and retrieval logic, begin by using Claude Code to build your knowledge base with consistent schemas. **Specific workflow change:** When starting a new RAG project, first create your data schema in a CLAUDE.md file, then use Claude Code to extract and structure your source material. The validation scripts are crucial—have Claude Code write them early to ensure data quality from the start. **Command to try today:** `claude code "Create a JSON schema for [your domain] quotes with fields: id, text, context, tags, source, sentiment. Then write a Python script to validate files against this schema."` This gives you the foundation that makes RAG systems actually work well. **Connect to recent trends:** This aligns with the March 24th report showing enterprise preference for RAG over fine-tuning, but adds the missing piece: RAG requires excellent source data. Use Claude Code to build that data layer systematically, not as an afterthought.
Enjoyed this article?
Share:

Related Articles

More in Open Source

View all