Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

gpu memory

30 articles about gpu memory in AI news

Google's TurboQuant Compresses LLM KV Cache 6x with Zero Accuracy Loss, Cutting GPU Memory by 80%

Google researchers introduced TurboQuant, a method that compresses LLM KV cache from 32-bit to 3-bit precision without accuracy degradation. This reduces GPU memory consumption by over 80% and speeds up inference 8x on H100 GPUs.

97% relevant

How a GPU Memory Leak Nearly Cost an AI Team a Major Client During a Live Demo

A detailed post-mortem of a critical AI inference failure during a client demo reveals how silent GPU memory leaks, inadequate health checks, and missing circuit breakers can bring down a production pipeline. The author shares the architectural fixes implemented to prevent recurrence.

95% relevant

Flash-KMeans Achieves 200x Speedup Over FAISS by Targeting GPU Memory Bottlenecks

Flash-KMeans is an IO-aware GPU implementation of exact k-means that runs 30x faster than cuML and 200x faster than FAISS. At million-scale datasets, it completes iterations in milliseconds, enabling dynamic re-indexing and real-time quantization.

95% relevant

Flash-KMeans: An IO-Aware GPU Implementation That Rethinks K-Means Memory Access

Flash-KMeans is a new, exact k-means clustering implementation designed for GPUs. It focuses on optimizing memory access patterns to overcome I/O bottlenecks that limit performance.

85% relevant

Google's TurboQuant Cuts LLM KV Cache Memory by 6x, Enables 3-Bit Storage Without Accuracy Loss

Google released TurboQuant, a novel two-stage quantization algorithm that compresses the KV cache in long-context LLMs. It reduces memory by 6x, achieves 3-bit storage with no accuracy drop, and speeds up attention scoring by up to 8x on H100 GPUs.

95% relevant

98× Faster LLM Routing Without a Dedicated GPU: Technical Breakthrough for vLLM Semantic Router

New research presents a three-stage optimization pipeline for the vLLM Semantic Router, achieving 98× speedup and enabling long-context classification on shared GPUs. This solves critical memory and latency bottlenecks for system-level LLM routing.

80% relevant

Flash-KMeans Revolutionizes GPU Clustering with 200x Speedup Over FAISS

New Flash-KMeans algorithm achieves dramatic speed improvements in GPU-based clustering through innovative IO-aware FlashAssign kernels that eliminate memory bottlenecks and atomic contention, potentially transforming large-scale data analysis.

85% relevant

Cerebras Hits 981 Tokens/sec on 1T-Parameter Kimi K2.6, Claims 6.7× GPU Cloud Speedup

Cerebras reported 981 tokens/sec on the 1T-parameter Kimi K2.6 model, a 6.7× speedup over the next GPU cloud, validated by an independent third party.

93% relevant

vLLM Optimizations Cut Voice AI Latency by 40% on 6-GPU Cluster

vLLM optimizations on a 6-GPU cluster reduced voice AI latency by 40% for a Qwen-based system, enabling 500 concurrent sessions per node without hardware upgrades.

82% relevant

MLX CUDA Backend Passes All Tests, Closing Apple GPU Gap

MLX CUDA backend passes all tests, enabling NVIDIA GPU support. Milestone bridges Apple Silicon and CUDA ecosystems for ML workloads.

77% relevant

Roundhill Memory ETF (DRAM) Surges 90% in 36 Days, Fastest ETF Ever

Roundhill Memory ETF surged 90% since April 2, hitting $6.5B assets in 36 days—fastest ETF ever—driven by AI demand for DRAM.

75% relevant

OpenAI's MRC Protocol Sprays Packets Across 100+ Paths to Fix GPU Stragglers

OpenAI open-sourced MRC, a networking protocol that sprays packets across hundreds of paths to reduce GPU idle time from congestion and failures, contributed to OCP.

88% relevant

RoundPipe: Full Fine-Tune 32B Models on a Single 24GB GPU

RoundPipe fine-tunes 32B models on a single 24GB GPU with 1.5-2.2× speedups via round-robin pipeline dispatch.

85% relevant

Open-Weight 1T Model Inference Margins Hit 88% on Rented GPUs

Renting a 128 GPU cluster to serve a 1T open model yields ~88% margin on tokens sold at $0.002/1K, exposing a structural arbitrage over proprietary APIs.

85% relevant

SemiAnalysis: NVIDIA's Customer Data Drives Disaggregated Inference, LPU Surpasses GPU

SemiAnalysis states NVIDIA's direct customer feedback is leading the industry toward disaggregated inference architectures. In this model, specialized LPUs can outperform GPUs for specific pipeline tasks.

85% relevant

Cisco Reveals Scale-Across GPU Networking Needs 14x DCI Bandwidth

Cisco's chief architect detailed the massive bandwidth requirements for connecting AI clusters via 'scale-across' GPU networking, which needs 14x the capacity of traditional data center interconnects. This shift is creating a multi-billion dollar market for 800G coherent pluggables and deep-buffered switches.

85% relevant

Gur Singh Claims 7 M4 MacBooks Match A100, Calls Cloud GPU Training a 'Scam'

Developer Gur Singh posted that seven M4 MacBooks (2.9 TFLOPS each) match an NVIDIA A100's performance, calling cloud GPU training a 'scam' and advocating for distributed, consumer-hardware approaches.

77% relevant

Claude MCP GPU Debugging: AI Agent Identifies PyTorch Bottleneck in Kernel

A developer used an AI agent powered by Claude Code and the Model Context Protocol (MCP) to diagnose a severe GPU performance bottleneck. The agent analyzed system kernel traces, pinpointing excessive CPU context switches as the culprit, demonstrating a practical application of agentic AI for complex technical debugging.

72% relevant

Nvidia to Ship 1.19 Exabytes of HBM in 2026, Apple iPhone Memory 2x Larger

An analysis projects Nvidia will ship ~1.19 exabytes of HBM memory in 2026 for AI infrastructure, while Apple will ship ~2.4 exabytes of LPDDR5 for iPhones, putting AI's massive hardware scale in consumer market perspective.

85% relevant

A Practical Guide to Fine-Tuning an LLM on RunPod H100 GPUs with QLoRA

The source is a technical tutorial on using QLoRA for parameter-efficient fine-tuning of an LLM, leveraging RunPod's cloud H100 GPUs. It focuses on the practical setup and execution steps for engineers.

76% relevant

Intel, SambaNova Blueprint Pairs GPUs for AI Prefill, RDUs for Decoding

Intel and SambaNova Systems have outlined a new inference architecture for agentic AI workloads. It splits tasks between GPUs for 'prefill' and SambaNova's Reconfigurable Dataflow Units (RDUs) for high-throughput token generation.

85% relevant

Cursor AI Claims 1.84x Faster MoE Inference on NVIDIA Blackwell GPUs

Cursor AI announced a rebuilt inference engine for Mixture-of-Experts models on NVIDIA's new Blackwell GPUs, resulting in a claimed 1.84x speedup and improved output accuracy.

85% relevant

X Post Reveals Audible Quality Differences in GPU vs. NPU AI Inference

A developer demonstrated audible quality differences in AI text-to-speech output when run on GPU, CPU, and NPU hardware, highlighting a key efficiency vs. fidelity trade-off for on-device AI.

75% relevant

Fine-Tuning an LLM on a 4GB GPU: A Practical Guide for Resource-Constrained Engineers

A Medium article provides a practical, constraint-driven guide for fine-tuning LLMs on a 4GB GPU, covering model selection, quantization, and parameter-efficient methods. This makes bespoke AI model development more accessible without high-end cloud infrastructure.

100% relevant

Sparton: A New GPU Kernel Dramatically Speeds Up Learned Sparse Retrieval

Researchers propose Sparton, a fused Triton GPU kernel for Learned Sparse Retrieval models like Splade. It avoids materializing a massive vocabulary-sized matrix, achieving up to 4.8x speedups and 26x larger batch sizes. This is a core infrastructure breakthrough for efficient AI-powered search.

72% relevant

Google's TurboQuant AI Research Report Sparks Sell-Off in Micron, Samsung, and SK Hynix Memory Stocks

Google's TurboQuant research blog publication triggered immediate market reaction, with shares of major memory manufacturers dropping 2-4% as investors anticipate AI-driven efficiency gains reducing future memory demand.

85% relevant

Memory Market Squeeze Threatens iPhone Price Hikes as AI Demands Strain Supply

A global RAM shortage and price increases could force Apple to raise iPhone prices by up to $250, according to industry analysis. The tech giant is reportedly unwilling to absorb the cost, passing it directly to consumers amid surging memory demands from AI applications.

85% relevant

AI Gold Rush Strains Apple Hardware: High-Memory Macs Sell Out as Local AI Agents Go Mainstream

A surge in demand for local AI development has created severe inventory shortages for high-memory Apple hardware. Mac Studio orders with 128GB or 512GB RAM face 6+ week delays as consumers buy up every available unit to run powerful AI agents like OpenClaw.

85% relevant

Lilly's AI Factory: How a 9,000+ GPU SuperPOD is Rewriting Pharmaceutical Discovery

Eli Lilly has launched 'LillyPod,' the world's most powerful privately-owned AI factory for drug discovery. Powered by NVIDIA's new DGX B300 systems with over 1,000 Blackwell Ultra GPUs, it promises to accelerate medical breakthroughs at unprecedented scale.

80% relevant

LM Link Bridges the AI Hardware Divide: Secure Remote GPU Access Goes Mainstream

Tailscale and LM Studio have launched 'LM Link,' a zero-configuration service that creates encrypted, point-to-point tunnels to private GPU hardware. This allows developers to securely access powerful local workstations from anywhere, eliminating the productivity gap between location-bound 'Big Rigs' and portable laptops.

70% relevant