Inference Optimization
Inference optimization is the engineering discipline of making trained AI models run faster, with less memory, and at lower cost during prediction — without meaningfully degrading their accuracy. Core techniques include quantization (reducing numerical precision of weights), speculative decoding (using a cheap draft model to propose tokens a larger model then verifies), KV-cache management, kernel fusion, and continuous batching. The field spans the full stack from algorithmic changes (pruning, distillation) to hardware-aware runtime tuning (CUDA graphs, TensorRT engines, FlashAttention).
As organizations scale LLM deployments to millions of requests per day, GPU compute costs dominate the infrastructure bill; a 2-4x throughput improvement directly cuts spend in half without changing the model. Serving frameworks such as vLLM, TensorRT-LLM, and SGLang have become production-critical infrastructure, and companies actively hire engineers who can profile bottlenecks, tune batching strategies, and ship quantized models that meet latency SLOs. Regulatory and sustainability pressure to reduce AI energy consumption is also pushing the industry to treat inference efficiency as a first-class product requirement.
🎓 Courses
Quantization Fundamentals with Hugging Face
by Younes Belkada and Marc Sun (Hugging Face)
The entry point for model compression: covers linear quantization, symmetric vs asymmetric modes, and applying quantization to real open-source LLMs with the Quanto library. Free short course, solid foundation before diving deeper.
Quantization in Depth
by Hugging Face (via DeepLearning.AI)
Goes beyond fundamentals: build a custom 8-bit quantizer from scratch in PyTorch, implement per-channel and per-group quantization, and pack 2-bit weights. Directly applicable to production model compression work.
Efficient Inference with SGLang: Text and Image Generation
by SGLang team (via DeepLearning.AI)
Teaches KV cache mechanics and RadixAttention to eliminate redundant computation. Covers both text and vision workloads, and directly addresses cost reduction for production LLM serving.
LLM Inference Optimization — Hugging Face Transformers Docs
by Hugging Face engineering team
Official, continuously updated reference covering static KV cache, torch.compile integration (up to 4x speedup), FlashAttention, and speculative decoding inside the Transformers library. Essential reading for anyone deploying HF models.
Mastering LLM Techniques: Inference Optimization
by NVIDIA engineering team
Comprehensive practitioner guide from NVIDIA covering inflight batching, tensor parallelism, KV cache quantization, speculative decoding, and TensorRT-LLM / vLLM / SGLang tooling. Written by the team that builds the hardware these systems run on.
📖 Books
Enhancing LLM Performance: Efficacy, Fine-Tuning, and Inference Techniques
Peyman Passban, Mehdi Rezagholizadeh, Andy Way (eds.) · 2025
The most recent academic book (Springer, July 2025) dedicated to LLM performance, with dedicated chapters on inference acceleration, model compression, and deployment-oriented architecture choices. Suited to researchers and senior engineers who want theoretical grounding alongside practical techniques.
🛠️ Tutorials & Guides
LLM Inference Handbook
Practitioner-oriented online handbook covering tensor/pipeline/expert parallelism, offline batch inference, and serving architecture trade-offs. Kept up-to-date by the team behind BentoML, with concrete code examples.
The Complete Guide to LLM Inference Optimization: vLLM, TensorRT-LLM, Speculative Decoding
End-to-end practitioner guide (2026) comparing vLLM PagedAttention, TensorRT-LLM FP8/NVFP4, and speculative decoding with working code. Includes production benchmarks, troubleshooting scenarios, and a clear recommendation on when to use each tool.
torch.compile and CUDA Graphs for LLM Inference: Production PyTorch 2.6 Guide
Concrete guide to two often-overlooked but high-impact PyTorch-native optimizations. Explains when to use CUDA graph capture vs torch.compile, how they interact with vLLM and TensorRT-LLM, and what to avoid. Practical and code-heavy.
Learning resources last updated: June 18, 2026