LASAR (Latent Adaptive Semantic Aligned Reasoning), a new SFT-then-RL framework, nearly halves latent reasoning steps while improving recommendation quality. It achieves 20x speedup over explicit CoT generation on three real-world datasets.
Key facts
- 20x faster than explicit CoT text generation.
- Nearly halves average latent step count.
- Outperforms all baselines on 3 real-world datasets.
- Uses GRPO-based RL + REINFORCE for adaptive depth.
- Policy Head predicts per-sample reasoning depth.
The Latency Problem in Generative Recommendation
Large Language Models (LLMs) have proven powerful for generative recommendation (GenRec) via Chain-of-Thought (CoT) reasoning, but token-by-token generation creates unacceptable latency for real-time systems. Latent reasoning, performing multi-step inference in continuous hidden-state space, offers a cheaper alternative—yet applying it to GenRec surfaces three core challenges: Semantic ID (SID) symbols lack pre-trained semantics for joint optimization; missing reasoning chain supervision causes representation drift; and a fixed global reasoning depth is suboptimal.
How LASAR Works
LASAR addresses these with a two-stage supervised fine-tuning (SFT) then reinforcement learning (RL) pipeline. Stage 1 grounds SID semantics before Stage 2 introduces latent reasoning, ensuring efficient convergence. It then uses step-wise bidirectional KL divergence, with hidden-state anchors extracted from CoT text, to constrain the latent trajectory and mitigate drift. A Policy Head predicts per-sample reasoning depth, dynamically allocating steps. During the GRPO-based RL phase, terminal-only KL alignment handles variable-length reasoning, while REINFORCE optimizes the Policy Head. [According to the LASAR arXiv preprint]

Results and Implications
Experiments on three real-world datasets show LASAR outperforms all baselines while adding marginal inference latency. The key metric: roughly 20x faster than generating explicit CoT text, with average latent steps nearly halved and recommendation quality simultaneously improved. This addresses a major practical bottleneck for deploying LLM-based recommenders at scale.

The unique angle: LASAR demonstrates that adaptive, per-sample reasoning depth—rather than a fixed number of latent steps—is critical for both speed and accuracy. This mirrors trends in LLM inference optimization (e.g., speculative decoding, dynamic computation) and suggests future GenRec systems will abandon fixed-depth architectures entirely.
What to watch
Watch for open-source implementations of LASAR on GitHub and whether production recommender systems (e.g., YouTube, TikTok, Amazon) adopt adaptive latent reasoning within the next 12 months. The key metric: latency reduction vs. accuracy trade-off in A/B tests at scale.










