local inference
30 articles about local inference in AI news
Hermes Agent Hits 140K GitHub Stars, Nvidia RTX as Local Inference Bedrock
Hermes Agent hit 140K GitHub stars, most-used on OpenRouter. Runs locally on Nvidia RTX with self-evolving skills and Qwen 3.6 models that beat prior 120B-parameter models.
Atomic Chat's TurboQuant Enables Gemma 4 Local Inference on 16GB MacBook Air
Atomic Chat's new TurboQuant algorithm aggressively compresses the KV cache, allowing models requiring 32GB+ RAM to run on 16GB MacBook Airs at 25 tokens/sec, advancing local AI deployment.
Gemma 4 Ported to MLX-Swift, Runs Locally on Apple Silicon
Google's Gemma 4 language model has been ported to the MLX-Swift framework by a community developer, making it available for local inference on Apple Silicon Macs and iOS devices through the LocallyAI app.
Ollama Now Supports Apple MLX Backend for Local LLM Inference on macOS
Ollama, the popular framework for running large language models locally, has added support for Apple's MLX framework as a backend. This enables more efficient execution of models like Llama 3.2 and Mistral on Apple Silicon Macs.
7 Free GitHub Repos for Running LLMs Locally on Laptop Hardware
A developer shared a list of seven key GitHub repositories, including AnythingLLM and llama.cpp, that allow users to run LLMs locally without cloud costs. This reflects the growing trend of efficient, private on-device AI inference.
Text-to-Speech Cost Plummets from $0.15/Word to Free Local Models Using 3GB RAM
High-quality text-to-speech has shifted from a $0.15 per word cloud service to free, local models requiring only 3GB of RAM in 12 months, signaling a broader price collapse in AI inference.
Median Coding Agent Hits 96k Input Tokens, Rewriting Inference Economics
SemiAnalysis found median coding agent uses 96k input tokens from 432k requests, shifting inference cost focus from output to context.
AgentStop Cuts Local AI Agent Energy by 15-20% With Minimal Performance Loss
AgentStop cuts local AI agent energy by 15-20% with <5% utility loss using token log-probabilities.
Ollama Now Runs Codex Locally: DeepSeek V4, Gemma 4, Qwen 3.6 Supported
Ollama integrates Codex support for DeepSeek V4, Gemma 4, Qwen 3.6, enabling free local code generation, challenging OpenAI's API model.
DeepSeek-V4 Ported to MLX for Apple Silicon Inference
A developer has ported DeepSeek-V4 to Apple's MLX framework, allowing the large language model to run on Apple Silicon Macs. Early results show functional inference with room for optimization.
Qwen3.6-27B: How to Run a 17GB Local Model That Beats 397B MoE on Coding Tasks
Qwen3.6-27B delivers flagship-level coding performance in a 55.6GB model that can be quantized to 16.8GB, making high-quality local coding assistance accessible.
Prefill-as-a-Service Paper Claims to Decouple LLM Inference Bottleneck
A research paper proposes a 'Prefill-as-a-Service' architecture to separate the heavy prefill computation from the lighter decoding phase in LLM inference. This could enable new deployment models where resource-constrained devices handle only the decoding step.
Modly Desktop App Generates 3D Models from Images, Runs Locally
A developer has launched Modly, a desktop application that creates 3D models from images and processes them entirely on a user's local machine, eliminating cloud dependency.
Claude Code Runs 100% Locally on Mac via Native 200-Line API Server
A developer created a 200-line server that speaks Anthropic's API natively, allowing Claude Code to run entirely locally on M-series Macs at 65 tokens/second with no cloud dependency.
Project N.O.M.A.D. Solar-Powered Mini PC Packs Local AI, Wikipedia, Khan Academy
Project N.O.M.A.D. is a 100% open-source, solar-powered mini PC designed for offline operation. It packs a local AI, all of Wikipedia, Khan Academy courses, offline maps, and medical guides, running on only 15 watts of power.
Mac Studio Runs 122B-Parameter AI Model Locally, Beats AWS on Cost
A developer demonstrated that a $3,999 Mac Studio can run a 122B-parameter AI model locally. Compared to a $5/hour AWS instance, the Mac pays for itself in roughly five weeks of continuous use.
MLX Enables Local Grounded Reasoning for Satellite, Security, Robotics AI
Apple's MLX framework is enabling 'local grounded reasoning' for AI applications in satellite imagery, security systems, and robotics, moving complex tasks from the cloud to on-device processing.
Open-Source 'Claude Cowork' Alternative Emerges with Local Voice & Agent Features
Developers have launched a free, open-source alternative to Anthropic's Claude Cowork. It runs 100% locally, supports voice, background agents, and connects to any LLM.
GPT4All Hits 77K GitHub Stars, Adds DeepSeek R1 for Free Local AI
The GPT4All project has surpassed 77,000 GitHub stars as it adds support for distilled DeepSeek R1 models, enabling reasoning-capable AI to run locally on consumer CPUs with zero API costs.
Open-Source AI Crew Replaces Notion, Obsidian with 8 Local Agents
A researcher has built a fully local, open-source system of 8 specialized AI agents that work together to manage an Obsidian vault—handling notes, inboxes, meetings, and deadlines. It replaces separate tools like Notion and inbox triagers with an autonomous, interconnected crew.
OpenCAD Browser Tool Enables Local, Private Text-to-CAD Conversion Without Cloud API
A developer has released an open-source text-to-CAD tool that runs entirely in a user's browser, enabling private, local 3D model generation from natural language descriptions. This approach bypasses cloud API costs and data privacy issues inherent in most current AI CAD solutions.
Browser-Based Text-to-CAD Tool Emerges, Enabling Local 3D Model Generation from Prompts
A developer has built a text-to-CAD application that operates entirely within a web browser, enabling local generation and manipulation of 3D models from natural language descriptions. This approach eliminates cloud dependency and could lower barriers for rapid prototyping.
Open-Source AI Assistant Runs Locally on MacBook Air M4 with 16GB RAM, No API Keys Required
A developer showcased a complete AI assistant running entirely on a MacBook Air M4 with 16GB RAM, using open-source models with no cloud API calls. This demonstrates the feasibility of capable local AI on consumer-grade Apple Silicon hardware.
Apple M5 Max NPU Benchmarks 2x Faster Than Intel Panther Lake NPU in Parakeet v3 AI Inference Test
A leaked benchmark using the Parakeet v3 AI speech recognition model shows Apple's next-generation M5 Max Neural Processing Unit (NPU) delivering double the inference speed of Intel's competing Panther Lake NPU. This real-world test provides early performance data in the intensifying on-device AI hardware race.
Atomic Chat Launches Hermes Agent: A Free, Local Agent Stack Powered by Gemma 4
Atomic Chat has launched Hermes Agent, an open-source agent stack powered by Google's Gemma 4 model that runs entirely locally and is free to use. This makes advanced AI agent functionality accessible without cloud dependencies or API costs.
Sam3 + MLX Enables Local, Multi-Object Video Tracking Without Cloud APIs
A developer has combined Meta's Segment Anything 3 (Sam3) with Apple's MLX framework to enable local, on-device object tracking in videos. This bypasses cloud API costs and latency for computer vision tasks.
mlx-vlm v0.4.2 Adds SAM3, DOTS-MOCR Models and Critical Fixes for Vision-Language Inference on Apple Silicon
mlx-vlm v0.4.2 released with support for Meta's SAM3 segmentation model and DOTS-MOCR document OCR, plus fixes for Qwen3.5, LFM2-VL, and Magistral models. Enables efficient vision-language inference on Apple Silicon via MLX framework.
Atomic Chat Integrates Google TurboQuant for Local Qwen3.5-9B, Claims 3x Speed Boost on M4 MacBook Air
Atomic Chat now runs Qwen3.5-9B with Google's TurboQuant locally, claiming a 3x processing speed increase and support for 100k+ context windows on consumer hardware like the M4 MacBook Air.
arXiv Survey Maps KV Cache Optimization Landscape: 5 Strategies for Million-Token LLM Inference
A comprehensive arXiv review categorizes five principal KV cache optimization techniques—eviction, compression, hybrid memory, novel attention, and combinations—to address the linear memory scaling bottleneck in long-context LLM inference. The analysis finds no single dominant solution, with optimal strategy depending on context length, hardware, and workload.
TTQ: A New Framework for On-the-Fly Quantization of LLMs at Inference Time
Researchers propose TTQ, a test-time quantization method that compresses large language models dynamically during inference. It uses efficient online calibration to adapt to any prompt, aiming to solve domain-shift issues and accelerate inference without retraining.