A developer has released an open-source project that enables Anthropic's Claude Code to run entirely locally on Apple Silicon Macs, completely bypassing API fees and cloud dependencies. The breakthrough comes from replacing the conventional proxy-based approach with a native 200-line server that speaks Anthropic's API protocol directly.
What's New: Native API Server Eliminates Proxy Bottleneck

The key innovation is architectural: instead of using a translation proxy between Claude Code's interface and local models (the standard approach that creates latency), this implementation writes a minimal server that communicates directly using Anthropic's API schema. This eliminates the translation layer that typically adds significant overhead.
According to the developer, previous local Claude Code setups suffered from "133-second wait times" due to proxy bottlenecks. The native server approach reduces this to "17.6 seconds per task" on Apple Silicon hardware.
Technical Details: Hardware Requirements and Performance
The implementation requires specific hardware and software:
Hardware Requirements:
- M2, M3, M4, or M5 Max chip
- 64–128 GB unified memory
- Local storage for model weights
Software Stack:
- Python 3.12+
- Claude Code application installed
- The 200-line server implementation (MIT licensed)
Performance Metrics:
- Model Size: 122 billion parameters
- Inference Speed: 65 tokens per second
- Task Completion: 17.6 seconds per task (vs. 133 seconds with proxy)
- Connectivity: Works fully offline once configured
How It Works: Direct API Communication
The technical approach is straightforward but effective. Claude Code expects to communicate with Anthropic's servers using their proprietary API protocol. Instead of:
Claude Code → Proxy → Local Model
The new implementation uses:
Claude Code → Native Server → Local Model
The native server understands Anthropic's API schema and translates requests directly to the local model's inference interface without intermediate translation layers. This reduces latency from multiple serialization/deserialization steps to a single direct call.
Setup and Usage: One-Command Installation
Setup appears remarkably simple for those with compatible hardware:
- Clone the repository
- Run a single setup command
- Configure Claude Code to point to the local server
- Use normally with no API keys or subscriptions
An interesting bonus feature: the setup enables remote control via iMessage. Users can send coding tasks from their iPhone while their Mac handles the inference, with responses returning to the mobile device.
Limitations and Caveats

While impressive, this approach has clear limitations:
- Hardware Requirements: Only works on high-end Apple Silicon Macs with substantial unified memory (64-128GB). Most consumer Macs have 8-24GB.
- Model Compatibility: Currently supports specific model configurations that fit within memory constraints.
- Maintenance Burden: Users must manage model updates, server maintenance, and potential compatibility issues with Claude Code updates.
- Performance Trade-offs: While faster than proxy-based approaches, it may still lag behind Anthropic's optimized cloud infrastructure for complex tasks.
Competitive Context: The Local AI Movement
This development fits into the broader trend of bringing AI inference local. Over the past year, we've seen:
- LM Studio and Ollama making local model deployment accessible
- Apple's MLX framework optimizing for Apple Silicon
- Microsoft's Phi models designed for edge deployment
- Meta's Llama models with increasingly efficient variants
What distinguishes this project is its focus on compatibility with existing commercial interfaces rather than creating new ones.
gentic.news Analysis
This development represents a significant milestone in the democratization of AI tooling, but its practical impact may be narrower than initial excitement suggests. The 122B parameter model requires substantial memory (likely quantized to 4-bit or similar), placing it out of reach for most developers who don't own $3,000+ MacBook Pros with maxed-out memory configurations.
Technically, the approach is clever but not revolutionary. Creating native API servers for local models has been done before for OpenAI's API (with projects like LocalAI and llama.cpp's server mode). The innovation here is specifically targeting Anthropic's API schema and Claude Code's workflow. What's more interesting is the timing: this comes as Anthropic has been aggressively expanding its enterprise offerings and API pricing, creating demand for cost-effective alternatives.
From a business perspective, this poses an interesting challenge for Anthropic. While most enterprise customers will continue paying for cloud reliability and support, individual developers and small teams might increasingly opt for local solutions as hardware capabilities improve. This mirrors the trajectory we saw with Stable Diffusion in image generation: initial excitement about local deployment, followed by market segmentation between convenience (cloud) and control (local).
Looking at the broader ecosystem, this development aligns with Apple's strategic push into on-device AI. With rumors of Apple developing its own large language models and the upcoming macOS Sequoia featuring enhanced AI capabilities, the hardware requirements for this project (M-series Max chips with 64+ GB RAM) suggest Apple's high-end machines are becoming viable platforms for serious AI development work.
Frequently Asked Questions
Can I run Claude Code locally on a MacBook Air?
No, this implementation requires M2/M3/M4/M5 Max chips with 64-128GB unified memory. Most MacBook Air models have 8-24GB RAM and less powerful chips, making them incompatible with the 122B parameter model.
Is this legal given Anthropic's terms of service?
The project uses an open-source MIT license and runs local models, not Anthropic's proprietary models. However, using Claude Code's interface with local models might violate Anthropic's terms if you're bypassing their authentication system. The legal gray area involves whether the API protocol itself is protected intellectual property.
How does performance compare to Anthropic's cloud service?
While the developer reports 65 tokens/second and 17.6-second task completion, this likely represents best-case scenarios. Anthropic's cloud infrastructure benefits from optimized hardware, model parallelism across multiple GPUs, and continuous updates that local deployments can't match for complex or lengthy tasks.
What local model is actually running?
The source doesn't specify which 122B parameter model is being used, but likely candidates include Llama 3.1 70B (quantized), Mixtral 8x22B, or a fine-tuned variant. The exact model would need to be compatible with the inference framework and fit within the memory constraints.








