Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Engineer demonstrating RunAnywhere's MetalRT engine on a MacBook, showing real-time voice AI inference under 200ms…

RunAnywhere's MetalRT Engine Delivers Breakthrough AI Performance on Apple Silicon

RunAnywhere has launched MetalRT, a proprietary GPU inference engine that dramatically accelerates on-device AI workloads on Apple Silicon. Their open-source RCLI tool demonstrates sub-200ms voice AI pipelines, outperforming existing solutions like llama.cpp and Apple's MLX.

AAAla SMITH & AI Research Desk·Mar 10, 2026·5 min read··279 views·AI-Generated·Report error

Source: github.comvia hacker_news_topSingle Source

A new player in the AI inference space is making waves with claims of unprecedented performance on Apple's proprietary hardware. RunAnywhere, a Y Combinator W26 startup founded by Sanchit and Shubham, has developed MetalRT—a custom inference engine that reportedly outperforms established solutions like llama.cpp, Apple's MLX, Ollama, and sherpa-onnx across multiple AI modalities.

The Performance Breakthrough

According to benchmark tests conducted on an M4 Max with 64GB RAM, MetalRT delivers impressive speed improvements across three critical AI workloads:

Large Language Model Inference:

Qwen3-0.6B: 658 tokens/second (1.19x faster than MLX, 2.23x faster than llama.cpp)
Qwen3-4B: 186 tokens/second (1.09x faster than MLX, 2.14x faster than llama.cpp)
LFM2.5-1.2B: 570 tokens/second (1.12x faster than MLX, 1.53x faster than llama.cpp)
Time-to-first-token: Just 6.6 milliseconds

Speech-to-Text Processing:
MetalRT achieves what the developers call "714x real-time" transcription—processing 70 seconds of audio in just 101 milliseconds. This represents a 4.6x speed improvement over mlx-whisper.

Text-to-Speech Synthesis:
At 178 milliseconds for synthesis, MetalRT performs 2.8x faster than both mlx-audio and sherpa-onnx.

The Technical Approach: Going Straight to Metal

What sets MetalRT apart is its architectural approach. Rather than building on existing frameworks, the team went "straight to Metal"—Apple's low-level graphics and compute API. By writing custom Metal shaders and eliminating framework overhead, they've created an inference engine specifically optimized for Apple Silicon's unified memory architecture and GPU capabilities.

Local

"We built this because demoing on-device AI is easy but shipping it is brutal," the founders explain in their launch announcement. "Voice is the hardest test: you're chaining STT, LLM, and TTS sequentially, and if any stage is slow, the user feels it."

RCLI: The Complete On-Device Voice AI Pipeline

To demonstrate their technology's capabilities, RunAnywhere has open-sourced RCLI (RunAnywhere Command Line Interface), which they describe as "the fastest end-to-end voice AI pipeline on Apple Silicon." The tool enables microphone-to-spoken-response interactions entirely on-device, requiring no cloud services or API keys.

RCLI implements a sophisticated pipeline architecture:

Voice Activity Detection using Silero
Speech-to-Text with Zipformer streaming plus Whisper/Parakeet offline capabilities
LLM Processing with Qwen3/LFM2/Qwen3.5 models featuring KV cache continuation and Flash Attention
Text-to-Speech with double-buffered sentence-level synthesis
Tool Calling with native LLM tool call formats
Multi-turn Memory with sliding window conversation history

The system supports 43 macOS actions via voice commands and includes local RAG (Retrieval-Augmented Generation) capabilities for querying documents. Users can interact through push-to-talk, continuous listening, or text-based commands.

The Latency Challenge in Voice AI

The founders highlight a critical insight that drove their development: "The thing that's hard to solve is latency compounding. In a voice pipeline, you're stacking three models in sequence. If each adds 200ms, you're at 600ms before the user hears a word, and that feels broken."

macOS

This compounding effect explains why many teams "fall back to cloud APIs not because local models are bad, but because local inference infrastructure is" inadequate for delivering responsive user experiences. By optimizing every stage of the pipeline and running them concurrently on Metal GPU with three concurrent threads, RunAnywhere achieves sub-200ms end-to-end latency.

Getting Started with RCLI

Installation is straightforward via either Homebrew or a direct shell script:

# Homebrew method
brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git
brew install rcli
rcli setup   # downloads ~1 GB of models
rcli         # interactive mode with push-to-talk

# Direct installation
curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash

The setup process downloads approximately 1GB of models. The system requires macOS 13+ running on Apple Silicon (M1 or later processors).

Implications for On-Device AI Development

RunAnywhere's breakthrough comes at a pivotal moment in AI development. As privacy concerns grow and cloud costs remain significant, on-device AI offers compelling advantages. However, until now, performance limitations have constrained practical applications, particularly for latency-sensitive use cases like voice interfaces.

Apple Silicon

MetalRT's performance claims, if validated by independent testing, could accelerate the shift toward edge computing for AI applications. The technology demonstrates that with proper hardware-specific optimization, Apple Silicon devices—from MacBooks to future iPhone and iPad models—could become powerful AI inference platforms without relying on cloud services.

The Competitive Landscape

RunAnywhere enters a competitive field that includes:

llama.cpp: The widely-used C++ implementation of Facebook's LLaMA model
Apple MLX: Apple's own machine learning framework for Apple Silicon
Ollama: A popular tool for running large language models locally
sherpa-onnx: An open-source speech recognition toolkit

What distinguishes MetalRT is its singular focus on Apple hardware optimization. While other solutions aim for cross-platform compatibility, RunAnywhere has sacrificed generality for maximum performance on a specific hardware platform—a strategy that appears to be paying dividends in benchmark results.

Looking Ahead

The open-sourcing of RCLI represents both a demonstration of capability and an invitation to the developer community. By providing a complete, working implementation of their technology, RunAnywhere enables developers to experience the performance improvements firsthand while potentially building their own applications on the MetalRT engine.

As AI continues its trajectory toward ubiquity, breakthroughs in inference efficiency like those claimed by RunAnywhere could prove as important as advances in model architecture. Faster, more efficient inference enables new applications, reduces costs, and brings sophisticated AI capabilities to devices without constant internet connectivity.

For now, developers interested in on-device AI for Apple platforms have a new tool to explore—one that promises to make "talk to your Mac" experiences not just possible, but pleasantly responsive.

Source: gentic.news · Mar 10, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

RunAnywhere's MetalRT represents a significant technical achievement in the optimization of AI inference for specific hardware architectures. By bypassing higher-level frameworks and writing directly to Apple's Metal API, the team has demonstrated that substantial performance gains are possible when developers tailor their solutions to the underlying hardware rather than targeting broad compatibility. The implications extend beyond mere benchmark improvements. Sub-200ms latency for complete voice AI pipelines enables truly conversational interfaces that feel responsive rather than laggy. This could accelerate adoption of voice-controlled applications across productivity, accessibility, and entertainment domains. The compounding latency problem that RunAnywhere addresses is fundamental to multi-stage AI pipelines, and their solution demonstrates that careful engineering at every stage can yield dramatic overall improvements. From a market perspective, RunAnywhere's approach highlights an emerging trend: as AI hardware becomes more specialized, software must become equally specialized to extract maximum performance. This creates opportunities for companies that deeply understand specific hardware platforms, potentially fragmenting the AI inference landscape but also pushing performance boundaries that benefit end users. The open-source release of RCLI is strategically smart—it provides immediate credibility while potentially establishing MetalRT as a de facto standard for high-performance AI on Apple Silicon.

#edge computing #artificial intelligence #hardware acceleration

Compare side-by-side

Meta vs Apple

→

Mentioned in this article

Meta Apple

Enjoyed this article?