Uber Eats Details Production System for Multilingual Semantic Search Across Stores, Dishes, and Items

Uber Eats engineers published a paper detailing their production semantic retrieval system that unifies search across stores, dishes, and grocery items using a fine-tuned Qwen2 model. The system leverages Matryoshka Representation Learning to serve multiple embedding sizes and shows substantial recall gains across six markets.

AAAla SMITH & AI Research Desk·Mar 10, 2026·6 min read··141 views·AI-Generated·Report error

Source: arxiv.orgvia arxiv_ir, retail_diveMulti-Source

Scaling Multilingual Semantic Search in Uber Eats Delivery: A Production Blueprint

The Innovation: A Unified Semantic Retrieval Engine

Uber Eats has published a detailed technical paper outlining the architecture and implementation of a production-grade semantic search system designed to handle the complex, multilingual demands of its food and grocery delivery platform. The core challenge addressed is unifying retrieval across three distinct but interconnected verticals: stores (restaurants/retailers), dishes (menu items), and grocery/retail items. This moves beyond traditional keyword matching to understand user intent across languages and contexts.

The system is built on a two-tower encoder architecture, where one tower encodes the user's search query and the other encodes the candidate documents (store pages, dish descriptions, product listings). The foundation model is Qwen2, a state-of-the-art multilingual large language model, which is then fine-tuned specifically for Uber Eats' domain.

Technical Architecture: Training at Scale for Real-World Performance

The paper provides a rare look into the industrial-scale machine learning required for a global platform.

1. Data Curation and Training:
The model is fine-tuned using hundreds of millions of query-document interactions, which were aggregated and anonymized. This massive, real-world dataset is the fuel for teaching the model the nuanced relationships between search phrases and relevant results across different markets and verticals.

2. Advanced Loss Functions:
Training employs a sophisticated combination of loss functions to ensure robust learning:

InfoNCE (Noise Contrastive Estimation) on in-batch negatives: This is a standard but effective approach where, within a training batch, all items not paired with a given query are treated as negatives. It's computationally efficient.
Triplet-NCE loss on hard negatives: This is a critical enhancement. Instead of random negatives, the system identifies "hard negatives"—items that are semantically similar to the positive target but are not correct matches (e.g., searching for "pad thai" and retrieving a similar-looking but different noodle dish). Explicitly training the model to separate these hard cases significantly improves precision.

3. Matryoshka Representation Learning (MRL):
A key innovation for production efficiency is the use of MRL. This technique allows a single model to produce embeddings of multiple dimensions (e.g., 768, 384, 192). The full-size embedding is used for maximum accuracy in offline indexing and candidate generation. For the ultra-low-latency online retrieval phase, a smaller, truncated version of the same embedding can be used, drastically reducing memory footprint and speeding up similarity searches without maintaining separate models.

4. Multilingual & Multi-Vertical Design:
The system is built from the ground up to handle multiple languages across Uber Eats' international markets. The Qwen2 base provides strong multilingual capabilities, which are then specialized by the fine-tuning data. The "unified" aspect means a single retrieval model understands that a search for "milk" could be relevant in the "grocery items" vertical, while "chocolate milk" might be a "dish" in a restaurant vertical, and both could appear from stores in the "convenience" category.

Evaluation and Results

The paper reports substantial recall gains over a strong baseline across six different markets and the three core verticals (stores, dishes, items). Recall—the ability to retrieve all relevant items—is paramount in the first stage of a search or recommendation system, as items not retrieved here can never be presented to the user later. Improving recall at this stage directly expands the pool of high-quality candidates for downstream ranking models.

Business Impact: Beyond Food Delivery

While implemented for Uber Eats, the architectural principles and lessons are directly transferable to any complex retail or luxury e-commerce environment facing similar challenges:

Unified Discovery Across Categories: A luxury conglomerate's platform might need to seamlessly search across fashion (apparel, handbags), beauty (skincare, makeup), jewelry, and home goods. A unified semantic model can understand that a search for "gold" could refer to jewelry, a clothing trim, or a home decor finish, retrieving relevant results from all appropriate categories.
Multilingual Search and Discovery: Global brands require search that works intuitively in Paris, Tokyo, and Dubai. A model trained on diverse, anonymized interaction data can capture the varied ways customers describe products in different languages and cultural contexts (e.g., "little black dress" vs. "robe noire petite").
Handling Ambiguous and Lifestyle Queries: Customers often search with intent or lifestyle phrases rather than precise product names (e.g., "outfit for a summer wedding," "gift for a new mother," "capsule wardrobe essentials"). A semantic system trained on real interactions learns to map these queries to relevant products, stores, or curated content.

Implementation Approach & Practical Lessons

The Uber Eats team shares key insights for practitioners:

Data is Paramount: The scale and quality of the fine-tuning data (hundreds of millions of interactions) are non-negotiable for performance. Curating a similar dataset—from clickstreams, purchases, and search logs—would be the first major hurdle for a retail company.
Hard Negatives are Critical: Investing in infrastructure to mine hard negative examples during training is a high-return effort for improving model discrimination.
MRL for Operational Flexibility: Adopting Matryoshka Representation Learning is a strategic choice that simplifies model deployment and maintenance by allowing a trade-off between accuracy and latency from a single trained artifact.
End-to-End System View: The paper emphasizes the entire pipeline—data curation, model architecture, large-scale training, and evaluation. Success depends on engineering excellence across all components, not just the core AI model.

Governance & Risk Assessment

Data Privacy & Anonymization: The paper notes the use of "aggregated and anonymized" data for pretraining. Any retail implementation must have robust governance to ensure customer interaction data is used ethically and in compliance with regulations like GDPR.
Bias and Fairness: A model trained on historical interactions can inherit and amplify biases (e.g., favoring certain brands or product types). Continuous monitoring and evaluation across different user segments and markets are essential.
Model Maturity: The architecture described is production-proven at massive scale. The core components (two-tower encoders, contrastive learning, MRL) are established techniques in industrial ML. The main challenge for other organizations is not the novelty of the tech, but the engineering effort required to collect the data and build the robust training and serving infrastructure.
Computational Cost: Fine-tuning large foundation models on hundreds of millions of examples requires significant GPU resources. The inference cost, mitigated by MRL, must be calculated against the expected uplift in conversion and engagement.

Source: gentic.news · Mar 10, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

For AI leaders in retail and luxury, this paper is less about a groundbreaking new algorithm and more about a **masterclass in industrializing semantic search**. The relevance is exceptionally high. The core problem—unifying semantic understanding across diverse product categories (handbags, ready-to-wear, watches) and in multiple languages—is identical to the challenge faced by global luxury platforms. The most actionable insight is the **emphasis on hard negatives and Matryoshka Representation Learning (MRL)**. Many retail teams fine-tune embedding models but often rely on simpler random negative sampling. The documented gains from hard negative mining justify the added engineering complexity. Similarly, MRL is an underutilized technique that can solve a major operational headache: managing multiple embedding models for different latency-quality trade-offs. Deploying a single MRL model simplifies the tech stack significantly. The prerequisite, however, is data. Luxury brands may have fewer total interactions than Uber Eats, but they can focus on **curating high-quality, cross-category session data**. The training paradigm remains the same: teach a model that a query for "evening bag" is closer to embeddings for "clutch" and "minaudière" than to "tote" or "backpack," and that this holds true whether the item is listed under 'Jewelry & Watches' or 'Leather Goods.' The blueprint is now public; the execution depends on data strategy and ML engineering rigor.

#natural language processing #production ml #computer vision #search & discovery #ai research

Mentioned in this article

Uber Eats Qwen2

Enjoyed this article?