npu

30 articles about npu in AI news

llada.cpp Cuts LLaDA-8B Latency 17-42x on Mobile NPU

llada.cpp, the first NPU-aware dLLM inference framework, cuts LLaDA-8B latency 17-42x on smartphones, enabling real-time on-device generation.

Jun 15, 202684% relevant

Median Coding Agent Hits 96k Input Tokens, Rewriting Inference Economics

SemiAnalysis found median coding agent uses 96k input tokens from 432k requests, shifting inference cost focus from output to context.

May 22, 202695% relevant

DOE Seeks Input on AI Infrastructure for Federal Lands

The U.S. Department of Energy has published a Request for Information (RFI) to solicit input on developing AI and high-performance computing infrastructure on DOE-owned lands. This marks a significant step in the federal government's strategy to directly address the national AI compute shortage.

Apr 17, 202672% relevant

HUOZIIME: A Research Framework for On-Device LLM-Powered Input Methods

A new research paper introduces HUOZIIME, a personalized on-device input method powered by a lightweight LLM. It uses a hierarchical memory mechanism to capture user-specific input history, enabling privacy-preserving, real-time text generation tailored to individual writing styles.

Apr 17, 202676% relevant

GPT-5.4 Spends 3 Hours Optimizing Embedding Model for Qualcomm NPU

An X user observed GPT-5.4 working for three hours to optimize an embedding model specifically for the Qualcomm NPU. This suggests a practical application of advanced AI for hardware-specific model tuning.

Apr 15, 202685% relevant

Qualcomm NPU Shows 6-8x OCR Speed-Up Over CPU in Mobile Workload

A benchmark shows Qualcomm's dedicated NPU processing OCR workloads 6-8 times faster than the device's CPU. This highlights the growing efficiency gap for AI tasks on mobile silicon.

Apr 5, 202685% relevant

Developer Ranks NPU Model Compilation Ease: Apple 1st, AMD Last

Developer @mweinbach ranked the ease of using AI coding agents to compile ML models for NPUs. Apple's ecosystem was rated easiest, while AMD's tooling was ranked most difficult.

Apr 5, 202675% relevant

X Post Reveals Audible Quality Differences in GPU vs. NPU AI Inference

A developer demonstrated audible quality differences in AI text-to-speech output when run on GPU, CPU, and NPU hardware, highlighting a key efficiency vs. fidelity trade-off for on-device AI.

Apr 5, 202675% relevant

Apple M5 Max NPU Benchmarks 2x Faster Than Intel Panther Lake NPU in Parakeet v3 AI Inference Test

A leaked benchmark using the Parakeet v3 AI speech recognition model shows Apple's next-generation M5 Max Neural Processing Unit (NPU) delivering double the inference speed of Intel's competing Panther Lake NPU. This real-world test provides early performance data in the intensifying on-device AI hardware race.

Apr 3, 202685% relevant

Open-Source 'Manus Alternative' Emerges: Fully Local AI Agent with Web Browsing, Code Execution, and Voice Input

An open-source project has been released that replicates core features of AI agent platforms like Manus—autonomous web browsing, multi-language code execution, and voice input—while running entirely locally on user hardware with no external API dependencies.

Mar 26, 202685% relevant

Cursor Launches Composer 2 with $0.50/M Input Token Pricing, Claims Major Benchmark Gains

Cursor has released Composer 2, a coding AI model priced at $0.50 per million input tokens and $2.50 per million output tokens. The company reports significant benchmark improvements over previous versions across CursorBench, Terminal-Bench 2.0, and SWE-bench Multilingual.

Mar 19, 202695% relevant

AI Medical Chatbots' Accuracy Plummets to 35% with Real Human Input

New evidence shows AI chatbots for health advice achieve ~95% accuracy on structured cases but crash to ~35% with the messy, partial descriptions typical of real patients. This reveals a fundamental brittleness in deploying LLMs for frontline medical triage.

Apr 19, 202685% relevant

How Structured JSON Inputs Eliminated Hallucinations in a Fine-Tuned 7B Code Model

A developer fine-tuned a 7B code model on consumer hardware to generate Laravel PHP files. Hallucinations persisted until prompts were replaced with structured JSON specs, which eliminated ambiguous gap-filling errors and reduced debugging time dramatically.

Mar 31, 202692% relevant

Slap to Submit: The Physical Input Hack That Makes Claude Code Approval 10x Faster

Install slapclaude.com to use your MacBook's accelerometer for instant prompt submission and tool call approval in Claude Code.

Mar 27, 202697% relevant

WSL 3 Preview: Cut Claude Code's Local Inference Latency on Windows

WSL 3 preview delivers near-native GPU/NPU for Claude Code + Ollama on Copilot+ laptops, but WSL 2 still handles NVIDIA CUDA fine for desktop users.

Jun 23, 202678% relevant

AMD's Lemonade v10.8 Adds MCP Support, Letting Claude Desktop and Cursor Route Tasks to Local AMD GPUs

AMD-backed Lemonade v10.8, released June 17, now exposes a Model Context Protocol server, letting Claude Desktop, Cursor, and GitHub Copilot route inference tasks to local AMD Ryzen AI NPUs, Radeon GPUs, or plain CPUs — no cloud API required. The update also adds Moonshine speech-to-text, expanded R

Jun 17, 202670% relevant

How to Build Claude Code Tools That Ask Users Questions Mid-Execution

Datasette Agent 0.2a0's `context.ask_user()` lets tools pause for user input mid-execution. Claude Code users can adopt this pattern for safer, more interactive tool workflows.

Jun 10, 202685% relevant

mlx-vlm v0.6.2 Adds Gemma 4 QAT Support for Local GPUs

mlx-vlm v0.6.2 adds launch-day support for Google DeepMind's Gemma 4 QAT checkpoints, enabling local inference on consumer GPUs and edge devices with video input for the 12B model.

Jun 5, 2026100% relevant

ModelBest Drops BitCPM-CANN: First 1.58-bit LLM on Ascend 910B

ModelBest released BitCPM-CANN, the first 1.58-bit ternary LLM on Ascend 910B NPUs, using 6× less VRAM than BF16 with minimal capability loss.

May 24, 202687% relevant

DeepSeek v4 Pricing Cuts 75%: $0.43/M Tokens In

DeepSeek v4 API pricing permanently cut 75% to $0.43/M input, $0.87/M output, enabled by 27% compute and 10% cache vs v3.2.

May 22, 2026100% relevant

Anthropic Study: Model Character Needs Clergy, Not Just Coders

Anthropic's study argues frontier AI needs input from clergy and philosophers, treating model behavior as moral formation. A self-reminder tool lowered misaligned behavior in internal tests.

May 20, 202686% relevant

Opus 4.7's Tokenizer Change: How to Measure Your Real Claude Code Costs

Claude Opus 4.7's updated tokenizer means the same input can cost 40%+ more than 4.6. Use the Claude Token Counter to measure real costs before upgrading.

Apr 20, 2026100% relevant

NVIDIA's Audio Flamingo Next: 30-Min Audio, Time-Grounded Reasoning

NVIDIA has launched Audio Flamingo Next, a next-generation open audio-language model supporting 30-minute audio inputs and time-grounded reasoning. Trained on over 1 million hours of data, it reportedly outperforms larger models on key audio understanding benchmarks.

Apr 19, 202695% relevant

The 270-Second Rule: How to Cut Claude Code API Costs by 90% with Smart

Anthropic's prompt cache has a 5-minute TTL. Orchestrator loops running faster than 270 seconds pay ~10% of full input token costs.

Apr 16, 2026100% relevant

Tencent's HY-World 2.0 Generates Navigable 3D Worlds in Single Forward Pass

Tencent has open-sourced HY-World 2.0 on Hugging Face, a 3D world model that generates navigable 3D environments from text or image inputs in a single forward pass, advancing beyond video generation.

Apr 15, 202695% relevant

ByteDance's OmniShow Unifies Text, Image, Audio, Pose for Video Gen

ByteDance introduced OmniShow, a unified multimodal framework for video generation that accepts text, reference images, audio, and pose inputs simultaneously. It claims state-of-the-art performance across diverse conditioning settings.

Apr 14, 202685% relevant

Claude Mythos Preview Priced at $25/$125 Per Million Tokens

Anthropic's Claude Mythos model is available in private preview at $25 per million input tokens and $125 per million output tokens. This positions it as a premium but competitively priced option in the high-performance LLM market.

Apr 9, 202697% relevant

Developer Open-Sources 'Prompt-to-3D' Tool for Instant, Navigable World Generation

A developer has released an open-source tool that creates interactive 3D worlds from text or image inputs. This moves 3D asset generation from static models to instant, explorable environments.

Apr 3, 202691% relevant

Qwen3.5-Omni Demonstrates 'Audio-Visual Vibe Coding' as an Emergent Ability

Alibaba's Qwen3.5-Omni model appears to have developed an emergent ability to generate code from combined audio and visual inputs without specific training. This suggests a significant leap in multimodal reasoning for a model already positioned as a strong GPT-4 competitor.

Apr 1, 202685% relevant

Qwen 3.6 Plus Preview Launches on OpenRouter with Free 1M Token Context, Disrupting API Pricing

Alibaba's Qwen team has released a preview of Qwen 3.6 Plus on OpenRouter with a 1 million token context window, charging $0 for both input and output tokens. This directly undercuts paid long-context offerings from Anthropic and OpenAI.

Mar 30, 202697% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety