How This Developer Built a Production-Ready RAG System with Claude Code in One Weekend
The Technique — Structured Data Generation as a Claude Code Workflow
The developer behind "Gaokao Mentor Wisdom" didn't just scrape quotes—they built a complete structured knowledge base with Claude Code. The project contains 105 quotes from Chinese education expert Zhang Xuefeng, organized into six categories (专业选择, 就业前景, 人生哲理, 院校推荐, 学习建议, 志愿填报策略), each with bilingual text, tags, source attribution, sentiment analysis, and target audience metadata.
What makes this noteworthy for Claude Code users is the workflow: starting from raw content (videos, articles, transcripts), they used Claude Code to extract, structure, validate, and export the data into multiple formats ready for immediate use.
Why It Works — Claude Code's Multi-File Editing and Validation
This project demonstrates Claude Code's strength at handling structured data workflows that involve:
- Consistent schema application across 105+ items
- Multi-file coordination (6 JSON files, validation scripts, export scripts)
- Data transformation pipelines (JSON → Markdown → PDF → training data)
Check the repository structure:
data/
├── mentors.json # 名师档案
├── categories.json # 6 大分类
├── zhangxuefeng/ # 张雪峰语录(105 条)
│ ├── zhuanye.json # 🎯 专业选择(28 条)
│ ├── jiuye.json # 💼 就业前景(18 条)
│ └── ...
└── _template/ # 新名师模板
Each quote follows this exact schema:
{
"id": "zxf-zhuanye-001",
"text": "如果你家境一般,没有任何社会资源,不要学金融。",
"text_en": "If your family is average with no connections, don't study finance.",
"context": "直播中谈论金融专业就业门槛",
"tags": ["金融", "家境", "专业选择"],
"target_audience": ["高考生", "家长"],
"source": { "platform": "抖音", "type": "直播" },
"sentiment": "cautionary",
"confidence": "attributed",
"related_majors": ["金融学", "经济学"]
}
How To Apply It — Your RAG Dataset Blueprint
Step 1: Start with a CLAUDE.md Template
Create a CLAUDE.md file in your project:
# RAG Dataset Creation Workflow
## Schema Requirements
- Each item must have: id, text, context, tags, source, sentiment
- All text fields should include English translations
- Tags should be consistent across similar items
- Source attribution must include platform and content type
## Validation Rules
- Run `python scripts/validate.py` after each batch
- Check for duplicate IDs
- Ensure all required fields are present
- Verify bilingual consistency
## Export Targets
1. Markdown documentation
2. PDF ebook
3. RAG chunks (JSONL)
4. Fine-tuning data
Step 2: Use Claude Code for Batch Processing
Instead of manually structuring each quote, prompt Claude Code:
claude code "Take the raw quotes in raw_quotes.txt and structure them according to our schema. Create one JSON file per category. Add appropriate tags based on content. Generate English translations for all Chinese text."
Step 3: Build the Validation Pipeline
The project includes scripts/validate.py—a perfect task for Claude Code:
claude code "Create a Python validation script that checks all JSON files in the data directory. It should verify: 1) schema compliance, 2) no duplicate IDs, 3) all required fields present, 4) tags are from approved list. Output validation report."
Step 4: Generate Multiple Output Formats
The real power comes from the export scripts:
json2md.py: Converts structured JSON to readable Markdownexport-training.py: Createsrag_chunks.jsonlandfine_tune.jsonl
These are exactly the type of repetitive, schema-aware tasks where Claude Code excels.
The Takeaway — RAG Systems Start with Quality Data
While most developers focus on the retrieval and generation parts of RAG, this project highlights the often-overlooked foundation: well-structured source data. The developer didn't just create another RAG tutorial—they built a production-ready dataset that's immediately usable for:
- AI Chatbots: Import
exports/rag_chunks.jsonldirectly - Fine-tuning: Use
exports/fine_tune.jsonlwith Claude or other models - Documentation: Generated Markdown and PDF versions
- Future expansion: Template system for adding more experts
This follows the enterprise trend we've seen where RAG systems are preferred over fine-tuning for production AI systems (as reported on 2026-03-24), but with a crucial insight: RAG success depends entirely on your source data quality.
Try This Weekend Project
Clone the repository and examine the structure:
git clone https://github.com/dongsheng123132/gaokao-mentor-wisdom.git
cd gaokao-mentor-wisdom
# Study the schema and scripts
cat data/zhangxuefeng/zhuanye.json | head -20
# Run the validation
python scripts/validate.py
# Generate outputs
python scripts/json2md.py
python scripts/export-training.py
Then adapt this pattern for your own domain: customer support transcripts, technical documentation, legal precedents, medical guidelines—anywhere you need structured knowledge extraction.
The key insight isn't the specific content (Chinese education advice) but the workflow: using Claude Code to transform unstructured information into RAG-ready structured data at scale.



