single gpu
30 articles about single gpu in AI news
train-llm-from-scratch: 1B-Parameter LLM on a Single GPU
train-llm-from-scratch trains billion-parameter LLMs on a single GPU, cutting costs from $10M+ to consumer hardware.
LeWorldModel Solves JEPA Collapse with 15M Params, Trains on Single GPU
Researchers published LeWorldModel, solving the representation collapse problem in Yann LeCun's JEPA architecture. The 15M-parameter model trains on a single GPU and demonstrates intrinsic physics understanding.
Karpathy's 'Autoresearch' Tool Democratizes AI Research: One GPU, One Night, 100 Experiments
Andrej Karpathy has open-sourced 'autoresearch,' a tool that enables AI to autonomously improve its own training code. By writing simple prompts in Markdown, researchers can have AI agents run hundreds of experiments overnight on a single GPU, dramatically accelerating the research process.
LeCun's Team Publishes LeWorldModel: A 15M-Parameter World Model That Mathematically Prevents Training Collapse
Yann LeCun's team has open-sourced LeWorldModel, a 15M-parameter world model that uses a novel SIGReg regularizer to make representation collapse mathematically impossible. It trains on a single GPU in hours and enables efficient physical prediction for robotics and autonomous systems.
Karpathy's Autoresearch: Democratizing AI Experimentation with Minimalist Agentic Tools
Andrej Karpathy releases 'autoresearch,' a 630-line Python tool enabling AI agents to autonomously conduct machine learning experiments on single GPUs. This minimalist framework transforms how researchers approach iterative ML optimization.
RoundPipe: Full Fine-Tune 32B Models on a Single 24GB GPU
RoundPipe fine-tunes 32B models on a single 24GB GPU with 1.5-2.2× speedups via round-robin pipeline dispatch.
ByteDance's Helios: A 14B Parameter Video Generation Model Running at 19.5 FPS on a Single H100 GPU
ByteDance has introduced Helios, a 14-billion parameter video generation model that reportedly runs at 19.5 frames per second on a single NVIDIA H100 GPU. This represents a significant step in making high-quality, real-time video synthesis more computationally accessible.
FASTER Method Compresses Multi-Step Denoising to Single Step, Enabling 10x Faster Action Sampling for Real-Time VLAs
The FASTER method compresses multi-step denoising into a single step, achieving 10x faster action sampling for real-time Vision-Language-Action models. This enables immediate reaction in dynamic tasks like table tennis on consumer GPUs like the RTX 4060.
Cerebras IPO Challenges GPU Scaling Orthodoxy
Cerebras filed for IPO on April 21, betting wafer-scale chips can disrupt Nvidia's GPU cluster model for AI workloads.
NHN Deploys 7,656-GPU AI Cluster in Seoul
NHN launched a 7,656-GPU cluster in Seoul, South Korea, for domestic enterprise AI workloads. The cluster targets inference and training, competing with Naver and Kakao.
OpenAI's MRC Protocol Sprays Packets Across 100+ Paths to Fix GPU Stragglers
OpenAI open-sourced MRC, a networking protocol that sprays packets across hundreds of paths to reduce GPU idle time from congestion and failures, contributed to OCP.
SemiAnalysis: NVIDIA's Customer Data Drives Disaggregated Inference, LPU Surpasses GPU
SemiAnalysis states NVIDIA's direct customer feedback is leading the industry toward disaggregated inference architectures. In this model, specialized LPUs can outperform GPUs for specific pipeline tasks.
Gur Singh Claims 7 M4 MacBooks Match A100, Calls Cloud GPU Training a 'Scam'
Developer Gur Singh posted that seven M4 MacBooks (2.9 TFLOPS each) match an NVIDIA A100's performance, calling cloud GPU training a 'scam' and advocating for distributed, consumer-hardware approaches.
Claude MCP GPU Debugging: AI Agent Identifies PyTorch Bottleneck in Kernel
A developer used an AI agent powered by Claude Code and the Model Context Protocol (MCP) to diagnose a severe GPU performance bottleneck. The agent analyzed system kernel traces, pinpointing excessive CPU context switches as the culprit, demonstrating a practical application of agentic AI for complex technical debugging.
AI Compute Crisis: GPU Prices Up 48%, Anthropic API at 98.95% Uptime
The AI industry faces a severe compute capacity crisis, with GPU prices up 48%, Anthropic API uptime falling to 98.95%, and OpenAI shutting down Sora to reallocate resources. Demand for agentic AI is outstripping supply, forcing rationing and product cancellations.
A Practical Guide to Fine-Tuning an LLM on RunPod H100 GPUs with QLoRA
The source is a technical tutorial on using QLoRA for parameter-efficient fine-tuning of an LLM, leveraging RunPod's cloud H100 GPUs. It focuses on the practical setup and execution steps for engineers.
Intel, SambaNova Blueprint Pairs GPUs for AI Prefill, RDUs for Decoding
Intel and SambaNova Systems have outlined a new inference architecture for agentic AI workloads. It splits tasks between GPUs for 'prefill' and SambaNova's Reconfigurable Dataflow Units (RDUs) for high-throughput token generation.
Google's 5M H100-Equivalent GPU Fleet Powers Anthropic's AI Expansion
An analyst estimates Google's compute capacity at ~5 million Nvidia H100-equivalent GPUs, providing the infrastructure backbone for Anthropic's model deployment and growth. This highlights the strategic shift where foundational AI labs rely on hyperscaler scale for distribution.
Nvidia Claims MLPerf Inference v6.0 Records with 288-GPU Blackwell Ultra Systems, Highlights 2.7x Software Gains
MLCommons released MLPerf Inference v6.0 results, introducing multimodal and video model tests. Nvidia set records using 288-GPU Blackwell Ultra systems and achieved a 2.7x performance jump on DeepSeek-R1 via software optimizations alone.
Open-source AI system running on $500 GPU reportedly outperforms Claude Sonnet
An open-source AI system running on consumer-grade $500 GPU hardware claims to outperform Anthropic's Claude Sonnet model while costing only $0.004 per task, eliminating cloud dependencies and API costs.
Claude Code, Gemini, and 50+ Dev Tools Dockerized into Single AI Coding Workstation
A developer packaged Claude Code's browser UI, Gemini, Codex, Cursor, TaskMaster CLIs, Playwright with Chromium, and 50+ development tools into a single Docker Compose setup, creating a pre-configured AI coding environment that uses existing Claude subscriptions.
Google's TurboQuant Compresses LLM KV Cache 6x with Zero Accuracy Loss, Cutting GPU Memory by 80%
Google researchers introduced TurboQuant, a method that compresses LLM KV cache from 32-bit to 3-bit precision without accuracy degradation. This reduces GPU memory consumption by over 80% and speeds up inference 8x on H100 GPUs.
Sparton: A New GPU Kernel Dramatically Speeds Up Learned Sparse Retrieval
Researchers propose Sparton, a fused Triton GPU kernel for Learned Sparse Retrieval models like Splade. It avoids materializing a massive vocabulary-sized matrix, achieving up to 4.8x speedups and 26x larger batch sizes. This is a core infrastructure breakthrough for efficient AI-powered search.
NVIDIA Releases NVPanoptix-3D on Hugging Face: Single-Image 3D Indoor Scene Reconstruction
NVIDIA has open-sourced NVPanoptix-3D, a model that reconstructs complete 3D indoor scenes—including panoptic segmentation, depth, and geometry—from a single RGB image in one forward pass.
How a GPU Memory Leak Nearly Cost an AI Team a Major Client During a Live Demo
A detailed post-mortem of a critical AI inference failure during a client demo reveals how silent GPU memory leaks, inadequate health checks, and missing circuit breakers can bring down a production pipeline. The author shares the architectural fixes implemented to prevent recurrence.
98× Faster LLM Routing Without a Dedicated GPU: Technical Breakthrough for vLLM Semantic Router
New research presents a three-stage optimization pipeline for the vLLM Semantic Router, achieving 98× speedup and enabling long-context classification on shared GPUs. This solves critical memory and latency bottlenecks for system-level LLM routing.
Qualcomm's Arduino Ventuno Q: A Powerhouse Single-Board Computer for the Next Wave of Physical AI
Qualcomm and Arduino have launched the Ventuno Q, a high-performance single-board computer designed specifically for robotics and physical AI applications. Powered by the Dragonwing IQ8 processor with a dedicated NPU and paired with a low-latency microcontroller, it enables complex, offline AI tasks like object tracking and gesture recognition for systems that interact with the real world.
Lilly's AI Factory: How a 9,000+ GPU SuperPOD is Rewriting Pharmaceutical Discovery
Eli Lilly has launched 'LillyPod,' the world's most powerful privately-owned AI factory for drug discovery. Powered by NVIDIA's new DGX B300 systems with over 1,000 Blackwell Ultra GPUs, it promises to accelerate medical breakthroughs at unprecedented scale.
Cloud GPU vs. Colocation: H100 Costs $8k/Month on Google Cloud vs. $1k Colo
A technical founder highlights the stark economics: renting one H100 on Google Cloud costs ~$8,000/month, while the retail hardware is ~$30,000. At that rate, 4 months of cloud rental equals the cost of outright ownership, making colocation at ~$1k/month a compelling alternative for sustained AI workloads.
AI Data Centers Now Consume 10% of US Electricity, With Single Facilities Reaching 400+ Megawatt Loads
Data centers powering AI and cloud computing now account for 10% of total U.S. electricity consumption, with individual facilities reaching 400+ megawatt capacities. New half-mile-long structures require advanced water-cooling systems to manage chips generating 2kW of heat each.