quantization

30 articles about quantization in AI news

KV Cache Quantization Silently Breaks Safety Alignment, Paper Shows

KV cache quantization silently breaks LLM safety alignment, with Mistral-7B losing 15.2% refusals at 1.03x perplexity. PCR diagnostic recovers up to 97% alignment in 35 GPU-minutes.

Jun 10, 202679% relevant

Product Quantization: The Hidden Engine Behind Scalable Vector Search

The article explains Product Quantization (PQ), a method for compressing high-dimensional vectors to enable fast and memory-efficient similarity search. This is a foundational technology for scalable AI applications like semantic search and recommendation engines.

Apr 16, 202688% relevant

TTQ: A New Framework for On-the-Fly Quantization of LLMs at Inference Time

Researchers propose TTQ, a test-time quantization method that compresses large language models dynamically during inference. It uses efficient online calibration to adapt to any prompt, aiming to solve domain-shift issues and accelerate inference without retraining.

Mar 23, 202670% relevant

Efficient Fine-Tuning of Vision-Language Models with LoRA & Quantization

A technical guide details methods for fine-tuning large VLMs like GPT-4V and LLaVA using Low-Rank Adaptation (LoRA) and quantization. This reduces computational cost and memory footprint, making custom VLM training more accessible.

Mar 15, 202680% relevant

The Quantization Paradox: How Compressing Multimodal AI Impacts Reliability

New research reveals that compressing multimodal AI models through quantization significantly reduces their reliability, making them more likely to produce confidently wrong answers. The study identifies methods to mitigate these effects while maintaining efficiency gains.

Feb 17, 202670% relevant

Paper Details Full-Stack MFM Acceleration: Quant, Spec Decode, HW Co-Design

A research paper details a full-stack approach for accelerating multimodal foundation models, combining hierarchy-aware mixed-precision quantization, structural pruning, speculative decoding, model cascading, and a specialized hardware accelerator. Demonstrated on medical and code generation tasks.

Apr 27, 202672% relevant

Fine-Tuning an LLM on a 4GB GPU: A Practical Guide for Resource-Constrained Engineers

A Medium article provides a practical, constraint-driven guide for fine-tuning LLMs on a 4GB GPU, covering model selection, quantization, and parameter-efficient methods. This makes bespoke AI model development more accessible without high-end cloud infrastructure.

Apr 2, 2026100% relevant

TurboQuant Ported to Apple MLX, Claims 75% Memory Reduction with Minimal Performance Loss

Developer Prince Canuma has successfully ported the TurboQuant quantization method to Apple's MLX framework, reporting a 75% reduction in memory usage with nearly no performance degradation for on-device AI models.

Mar 26, 202685% relevant

Google's TurboQuant Cuts LLM KV Cache Memory by 6x, Enables 3-Bit Storage Without Accuracy Loss

Google released TurboQuant, a novel two-stage quantization algorithm that compresses the KV cache in long-context LLMs. It reduces memory by 6x, achieves 3-bit storage with no accuracy drop, and speeds up attention scoring by up to 8x on H100 GPUs.

Mar 25, 202695% relevant

Flash-KMeans Achieves 200x Speedup Over FAISS by Targeting GPU Memory Bottlenecks

Flash-KMeans is an IO-aware GPU implementation of exact k-means that runs 30x faster than cuML and 200x faster than FAISS. At million-scale datasets, it completes iterations in milliseconds, enabling dynamic re-indexing and real-time quantization.

Mar 20, 202695% relevant

Quantized Inference Breakthrough for Next-Gen Recommender Systems: OneRec-V2 Achieves 49% Latency Reduction with FP8

New research shows FP8 quantization can dramatically speed up modern generative recommender systems like OneRec-V2, achieving 49% lower latency and 92% higher throughput with no quality loss. This breakthrough bridges the gap between LLM optimization techniques and industrial recommendation workloads.

Mar 13, 202697% relevant

LeCun's Team Uncovers Hidden Transformer Flaws: How Architectural Artifacts Sabotage AI Efficiency

NYU researchers led by Yann LeCun reveal that Transformer language models contain systematic artifacts—massive activations and attention sinks—that degrade efficiency. These phenomena, stemming from architectural choices rather than fundamental properties, directly impact quantization, pruning, and memory management.

Mar 7, 202695% relevant

LittleBit-2: How Geometric Alignment Unlocks Ultra-Efficient AI Below 1-Bit

Researchers have developed LittleBit-2, a framework that achieves state-of-the-art performance in sub-1-bit LLM compression by solving latent geometry misalignment. The method uses internal latent rotation and joint iterative quantization to align model parameters with binary representations without inference overhead.

Mar 3, 202675% relevant

AutoQRA: The Breakthrough That Makes AI Fine-Tuning 4x More Efficient

Researchers have developed AutoQRA, a novel framework that jointly optimizes quantization precision and LoRA adapters for large language models. This breakthrough enables near-full-precision performance with dramatically reduced memory requirements, potentially revolutionizing how organizations fine-tune AI models on limited hardware.

Feb 27, 202675% relevant

OpenAI Cuts Inference Costs by Half on Some Models

OpenAI cut inference costs by 50%+ on some models for logged-out ChatGPT users, per The Information. The move reduces operational expenses.

Jul 2, 202691% relevant

NVIDIA Blackwell Cuts DeepSeek V4 Token Costs 5x in One Month

NVIDIA claims Blackwell inference stack cut DeepSeek V4 token costs 5x in one month, per a newly published report shared by @rohanpaul_ai.

Jun 30, 2026100% relevant

JetSpec hits 1,000 t/s on Qwen-8B with speculative decoding

JetSpec achieves 1,000 t/s on Qwen-8B with a B200 GPU, claiming superiority over prior speculative decoding methods, but lacks independent verification.

Jun 26, 202689% relevant

Qualcomm Launches AI Data Center Program With Hyperscaler Customer

Qualcomm launched an AI data center program with a major hyperscaler customer, targeting inference workloads. Financial terms and partner identity undisclosed.

Jun 17, 202685% relevant

Pareto LoRA Boosts Image Quality 44.9% vs Vanilla LoRA on Emu2

Pareto LoRA reformulates multimodal instruction tuning as bi-objective optimization, achieving up to 44.9% image quality gains on Emu2 while maintaining text performance.

Jun 17, 202690% relevant

llada.cpp Cuts LLaDA-8B Latency 17-42x on Mobile NPU

llada.cpp, the first NPU-aware dLLM inference framework, cuts LLaDA-8B latency 17-42x on smartphones, enabling real-time on-device generation.

Jun 15, 202684% relevant

Mirage Probes Paper Reveals Two Distinct VLM Failure Modes

Mirage Probes paper reveals VLMs have two distinct failure modes—textual biases and spurious images—requiring different mitigations. Text cleaning only fixes one; the other needs representational interventions.

Jun 15, 202690% relevant

NVIDIA NVFP4 on Blackwell Cuts JAX Training by 1.8x in MaxText

NVIDIA NVFP4 on Blackwell achieves 1.8x training speedup over FP8 in JAX/MaxText with no claimed accuracy loss for models up to 70B, but larger-scale validation is needed.

Jun 8, 202685% relevant

mlx-vlm v0.6.2 Adds Gemma 4 QAT Support for Local GPUs

mlx-vlm v0.6.2 adds launch-day support for Google DeepMind's Gemma 4 QAT checkpoints, enabling local inference on consumer GPUs and edge devices with video input for the 12B model.

Jun 5, 2026100% relevant

Nemotron 3 Ultra matches GPT-5.5 on physics test at 10X lower cost

Nemotron 3 Ultra matched GPT-5.5 on a physics test at 10X lower cost ($0.051 vs $0.57), highlighting MoE efficiency.

Jun 5, 202685% relevant

Google Releases Magenta RealTime 2 for Open-Weight Music Generation

Google released Magenta RealTime 2 on Hugging Face, the only open-weights model for real-time continuous music generation on device with ~200ms latency.

Jun 3, 202685% relevant

Median Coding Agent Hits 96k Input Tokens, Rewriting Inference Economics

SemiAnalysis found median coding agent uses 96k input tokens from 432k requests, shifting inference cost focus from output to context.

May 22, 202695% relevant

ColPali Beats OCR Pipelines for Document RAG: 8× Storage Cost, 0% Chunking

ColPali eliminates OCR and chunking for document-heavy RAG by encoding each 16×16 image patch into a 128-dim vector. It outperforms prior SOTA on the ViDoRe benchmark but costs 8× more storage per page.

May 18, 202684% relevant

Qwen 3.6 27B Hits 34 tok/s on M5 Max MacBook Pro

Qwen 3.6 27B hits 34 tok/s on M5 Max MacBook Pro with 90% acceptance rate, per @rohanpaul_ai. Shows viable local LLM inference on Apple Silicon.

May 14, 202675% relevant

Gemini Flash Rumored at 92% of GPT-5.5 Coding, 15-20x Cheaper

Unconfirmed rumor claims Gemini Flash achieves 92% of GPT-5.5 coding performance at 15-20x lower cost. Source is a single X post; no official confirmation.

May 14, 202689% relevant

Perplexity Claims 3x Blackwell Inference Throughput for 70B Models

Perplexity AI claims 3x inference throughput for 70B models on Nvidia Blackwell GPUs via FP4 and custom scheduling. The gain exceeds Nvidia's own 2x marketing claim.

May 12, 202685% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety