Scaling Multilingual Semantic Search in Uber Eats Delivery: A Production Blueprint
The Innovation: A Unified Semantic Retrieval Engine
Uber Eats has published a detailed technical paper outlining the architecture and implementation of a production-grade semantic search system designed to handle the complex, multilingual demands of its food and grocery delivery platform. The core challenge addressed is unifying retrieval across three distinct but interconnected verticals: stores (restaurants/retailers), dishes (menu items), and grocery/retail items. This moves beyond traditional keyword matching to understand user intent across languages and contexts.
The system is built on a two-tower encoder architecture, where one tower encodes the user's search query and the other encodes the candidate documents (store pages, dish descriptions, product listings). The foundation model is Qwen2, a state-of-the-art multilingual large language model, which is then fine-tuned specifically for Uber Eats' domain.
Technical Architecture: Training at Scale for Real-World Performance
The paper provides a rare look into the industrial-scale machine learning required for a global platform.
1. Data Curation and Training:
The model is fine-tuned using hundreds of millions of query-document interactions, which were aggregated and anonymized. This massive, real-world dataset is the fuel for teaching the model the nuanced relationships between search phrases and relevant results across different markets and verticals.
2. Advanced Loss Functions:
Training employs a sophisticated combination of loss functions to ensure robust learning:
- InfoNCE (Noise Contrastive Estimation) on in-batch negatives: This is a standard but effective approach where, within a training batch, all items not paired with a given query are treated as negatives. It's computationally efficient.
- Triplet-NCE loss on hard negatives: This is a critical enhancement. Instead of random negatives, the system identifies "hard negatives"—items that are semantically similar to the positive target but are not correct matches (e.g., searching for "pad thai" and retrieving a similar-looking but different noodle dish). Explicitly training the model to separate these hard cases significantly improves precision.
3. Matryoshka Representation Learning (MRL):
A key innovation for production efficiency is the use of MRL. This technique allows a single model to produce embeddings of multiple dimensions (e.g., 768, 384, 192). The full-size embedding is used for maximum accuracy in offline indexing and candidate generation. For the ultra-low-latency online retrieval phase, a smaller, truncated version of the same embedding can be used, drastically reducing memory footprint and speeding up similarity searches without maintaining separate models.
4. Multilingual & Multi-Vertical Design:
The system is built from the ground up to handle multiple languages across Uber Eats' international markets. The Qwen2 base provides strong multilingual capabilities, which are then specialized by the fine-tuning data. The "unified" aspect means a single retrieval model understands that a search for "milk" could be relevant in the "grocery items" vertical, while "chocolate milk" might be a "dish" in a restaurant vertical, and both could appear from stores in the "convenience" category.
Evaluation and Results
The paper reports substantial recall gains over a strong baseline across six different markets and the three core verticals (stores, dishes, items). Recall—the ability to retrieve all relevant items—is paramount in the first stage of a search or recommendation system, as items not retrieved here can never be presented to the user later. Improving recall at this stage directly expands the pool of high-quality candidates for downstream ranking models.
Business Impact: Beyond Food Delivery
While implemented for Uber Eats, the architectural principles and lessons are directly transferable to any complex retail or luxury e-commerce environment facing similar challenges:
Unified Discovery Across Categories: A luxury conglomerate's platform might need to seamlessly search across fashion (apparel, handbags), beauty (skincare, makeup), jewelry, and home goods. A unified semantic model can understand that a search for "gold" could refer to jewelry, a clothing trim, or a home decor finish, retrieving relevant results from all appropriate categories.
Multilingual Search and Discovery: Global brands require search that works intuitively in Paris, Tokyo, and Dubai. A model trained on diverse, anonymized interaction data can capture the varied ways customers describe products in different languages and cultural contexts (e.g., "little black dress" vs. "robe noire petite").
Handling Ambiguous and Lifestyle Queries: Customers often search with intent or lifestyle phrases rather than precise product names (e.g., "outfit for a summer wedding," "gift for a new mother," "capsule wardrobe essentials"). A semantic system trained on real interactions learns to map these queries to relevant products, stores, or curated content.
Implementation Approach & Practical Lessons
The Uber Eats team shares key insights for practitioners:
- Data is Paramount: The scale and quality of the fine-tuning data (hundreds of millions of interactions) are non-negotiable for performance. Curating a similar dataset—from clickstreams, purchases, and search logs—would be the first major hurdle for a retail company.
- Hard Negatives are Critical: Investing in infrastructure to mine hard negative examples during training is a high-return effort for improving model discrimination.
- MRL for Operational Flexibility: Adopting Matryoshka Representation Learning is a strategic choice that simplifies model deployment and maintenance by allowing a trade-off between accuracy and latency from a single trained artifact.
- End-to-End System View: The paper emphasizes the entire pipeline—data curation, model architecture, large-scale training, and evaluation. Success depends on engineering excellence across all components, not just the core AI model.
Governance & Risk Assessment
- Data Privacy & Anonymization: The paper notes the use of "aggregated and anonymized" data for pretraining. Any retail implementation must have robust governance to ensure customer interaction data is used ethically and in compliance with regulations like GDPR.
- Bias and Fairness: A model trained on historical interactions can inherit and amplify biases (e.g., favoring certain brands or product types). Continuous monitoring and evaluation across different user segments and markets are essential.
- Model Maturity: The architecture described is production-proven at massive scale. The core components (two-tower encoders, contrastive learning, MRL) are established techniques in industrial ML. The main challenge for other organizations is not the novelty of the tech, but the engineering effort required to collect the data and build the robust training and serving infrastructure.
- Computational Cost: Fine-tuning large foundation models on hundreds of millions of examples requires significant GPU resources. The inference cost, mitigated by MRL, must be calculated against the expected uplift in conversion and engagement.

