What Happened: A Benchmark for Industrial Generative Recommendation
Tencent has taken a significant step to accelerate research in generative recommender systems (GeneRec) by launching the Tencent Advertising Algorithm Challenge 2025 and publicly releasing two associated datasets: TencentGR-1M and TencentGR-10M. This initiative directly addresses a critical gap in the field: the lack of large-scale, realistic, and fully multi-modal public benchmarks designed specifically for generative recommendation in an industrial advertising context.
The core innovation is the data itself. Constructed from de-identified Tencent Ads logs, these datasets provide sequential user interaction data at a massive scale:
- TencentGR-1M (Preliminary Track): 1 million user sequences, with up to 100 interacted items per user. Each interaction is labeled with exposure and click signals.
- TencentGR-10M (Final Track): Scales to 10 million users and introduces a crucial refinement: it explicitly distinguishes between click and conversion events at both the sequence and target item level. This allows models to be optimized not just for engagement, but for high-value business outcomes.
Crucially, the datasets are "all-modality." Each item is represented not only by collaborative identifiers (IDs) but also by rich multi-modal embeddings—likely covering text, image, and video content—extracted using state-of-the-art models. This structure forces researchers to build systems that can fuse traditional collaborative filtering signals with deep semantic content understanding, a necessity for modern luxury and retail platforms.
The competition and dataset release have already catalyzed new research, as evidenced by two accompanying papers that tackle persistent challenges in the GeneRec paradigm.
Technical Details: Addressing Core Challenges in Generative Recommendation
The source material highlights three key technical threads emerging from this ecosystem.
1. The TencentGR Datasets & Challenge
The datasets map users' historical behavior into sequences of discrete tokens (representing items). The task for GeneRec models is to autoregressively predict the next item a user will interact with, conditioned on their past sequence and the rich multi-modal context. The evaluation protocol introduces a weighted metric that values high-value conversion events more than simple clicks, aligning model performance directly with business ROI.
2. CRAB: Combating Popularity Bias in GeneRec
A major weakness of current GeneRec models is their tendency to amplify popularity bias—over-recommending popular items at the expense of niche or new products. The paper "CRAB" identifies two root causes: (1) imbalanced tokenization that inherits historical bias, and (2) training procedures that favor frequent tokens.
CRAB proposes a post-hoc debiasing strategy. After a model is trained, it rebalances the semantic token codebook by splitting over-popular tokens while preserving their hierarchical semantic relationships. It then introduces a tree-structured regularizer during further training to enhance semantic consistency for unpopular tokens, encouraging more informative representations. This is a critical advancement for luxury retail, where the long-tail of products and new collections must be surfaced effectively.
3. NSGR: A Tree-Based Generative Reranking Framework
Reranking—the final stage where a candidate set is ordered into a final list—is vital for modeling item-item context. The paper "NSGR" proposes a Next-Scale Generation Reranking framework to solve two problems: generators lacking both local and global perspective, and goal inconsistency between the generator and evaluator during training.
NSGR uses a next-scale generator (NSG) that builds a recommendation list in a coarse-to-fine manner, progressively expanding from broad user interests to specific items. It is guided by a multi-scale evaluator (MSE) that provides scale-specific feedback via a novel tree-based loss. This approach, already deployed on Meituan's platform, creates more coherent and contextually appropriate final lists.
Retail & Luxury Implications: From Research to Personalization
While these are research papers, they point to the near-future architecture of high-end retail recommendation systems.

The All-Modality Imperative: For luxury, an item's story—craftsmanship, material, heritage—is as important as its collaborative popularity. A system that can tokenize and sequence not just IDs but also visual aesthetics, descriptive copy, and campaign imagery can move beyond "users who bought this also bought" to "users who love this aesthetic and narrative might also appreciate." The TencentGR datasets provide the blueprint for training such systems.
Debiasing for Discovery and Curation: Popularity bias is the enemy of curation and discovery. A system that only recommends best-sellers stifles new designers and fails the savvy customer seeking distinction. Techniques like CRAB are essential for platforms aiming to be tastemakers, ensuring their algorithms can elevate emerging talent and deep-catalog items with strong semantic relevance to a user's refined taste profile.
Reranking as Experiential Design: The final presentation of items is a core part of the digital experience. A generative reranker like NSGR can learn to construct lists that tell a visual or thematic story—curating a capsule wardrobe, building a collection of complementary accessories, or sequencing products in a way that mirrors a brand's narrative journey. This transforms the recommendation shelf from a static set of items into a dynamically generated, context-aware experience.
The path from these arXiv preprints to production is non-trivial, requiring significant MLOps investment and integration with existing e-commerce stacks. However, they clearly delineate the next competitive frontier: recommendation as a holistic, multi-modal, generative user modeling task.









