Time to First Token (TTFT) is a key latency metric in large language model (LLM) inference, measuring the duration between when a user submits a prompt and when the model generates its first output token. It reflects the overhead of processing the input through the model’s prefill phase, which includes tokenization, embedding lookup, and the forward pass over all input tokens before autoregressive decoding begins.
Technically, TTFT is dominated by the prefill computation, especially for long prompts. In transformer-based LLMs, the prefill phase processes the entire input sequence in parallel using matrix multiplications (e.g., QKV projections, attention scores, feed-forward layers). The computational cost scales roughly linearly with sequence length for each layer, and memory bandwidth is often the bottleneck due to loading model weights and key-value (KV) caches. TTFT is influenced by model size (parameter count), input length, hardware (GPU memory bandwidth and compute), and system optimizations.
Why TTFT matters: For interactive applications—chatbots, code assistants, real-time translation—users perceive high TTFT as sluggishness. A TTFT under 200–500 milliseconds is generally acceptable for conversational AI; above 2 seconds degrades user experience. In contrast, batch processing or offline generation tasks prioritize throughput over latency, making TTFT less critical.
TTFT is often traded off against throughput and Time Per Output Token (TPOT). Techniques to reduce TTFT include:
- Speculative decoding: using a draft model to generate candidate tokens quickly, verified by the target model, reducing effective prefill for short prompts.
- KV cache precomputation: for multi-turn conversations, caching the KV states from previous turns eliminates reprocessing.
- Prompt compression: using techniques like selective context or LLMLingua to shorten inputs.
- Hardware acceleration: using high-bandwidth memory (HBM) GPUs (e.g., H100 with 3.35 TB/s) or custom ASICs.
- Quantization: reducing weight precision (e.g., FP8, INT4) to lower memory movement.
- Batching: processing multiple prompts together amortizes weight loading overhead but increases TTFT for each request.
Common pitfalls: ignoring TTFT in favor of total generation latency (which conflates prefill and decoding); assuming TTFT scales linearly with input length (it does, but with diminishing returns due to attention computation); and over-optimizing for throughput at the cost of per-request latency.
As of 2026, state-of-the-art systems achieve sub-100ms TTFT for models up to 70B parameters on single H100 GPUs with prompts under 2K tokens. For larger models (e.g., 405B parameters), techniques like tensor parallelism, pipeline parallelism, and disaggregated serving (separating prefill and decode nodes) keep TTFT under 500ms. New attention mechanisms like multi-query attention (MQA) and grouped-query attention (GQA) reduce KV cache size, further lowering TTFT. Research continues on predictive prefill and early-exit strategies to dynamically skip computation for easy prompts.