New Research Automates Domain-Specific Query Expansion with Multi-LLM Ensembles

Researchers propose a fully automated framework for query expansion that constructs in-domain exemplars and refines outputs from multiple LLMs. This eliminates manual prompt engineering and improves retrieval performance across domains.

AAAla SMITH & AI Research Desk·Mar 16, 2026·5 min read··187 views·AI-Generated·Report error

Source: arxiv.orgvia arxiv_irCorroborated

Automating Domain-Specific Query Expansion with Multi-LLM Ensembles

What Happened

A research team has published a paper on arXiv proposing a novel, fully automated framework for query expansion (QE) that addresses two critical limitations of current LLM-based approaches: dependency on hand-crafted prompts/manually chosen examples, and sensitivity to domain shifts. The system, detailed in "Automatic In-Domain Exemplar Construction and LLM-Based Refinement of Multi-LLM Expansions for Query Expansion," represents a significant step toward making sophisticated QE techniques scalable and practical for real-world applications without requiring labeled training data.

The core innovation lies in automating what has traditionally been a labor-intensive, expert-driven process. Query expansion—the technique of reformulating or augmenting a user's search query to improve retrieval results—has seen renewed interest with the advent of LLMs. However, existing methods often rely on carefully engineered prompts and manually selected "few-shot" examples that demonstrate the desired expansion behavior. This makes them brittle when applied to new domains (like moving from general web search to specialized retail product search) and difficult to scale.

Technical Details

The proposed framework operates in two main phases:

(a) NDCG@10 on TREC DL20

1. Automatic In-Domain Exemplar Construction

Instead of requiring a human to curate examples of good query expansions for a new domain, the system builds its own pool of demonstrations automatically:

Harvesting Pseudo-Relevant Passages: For a target domain (e.g., a corpus of product descriptions), the system uses an initial retrieval pipeline combining BM25 (a classic keyword-based ranking function) and MonoT5 (a neural re-ranker) to find passages that are likely relevant to a set of seed queries.
Cluster-Based Exemplar Selection: From this harvested pool, a training-free clustering strategy selects a diverse set of (query, relevant passage) pairs to serve as demonstrations. This diversity ensures the LLM sees varied examples of successful expansions, making the resulting in-context learning more robust and stable.

The result is a domain-adaptive QE system that requires zero human supervision to set up for a new corpus.

2. Multi-LLM Ensemble with Refinement

Recognizing that different LLMs have complementary strengths, the researchers employ an ensemble approach:

Dual-LLM Expansion: Two heterogeneous LLMs (e.g., one larger, more creative model and one smaller, more precise model) independently generate expansions for the same input query, using the automatically constructed exemplars for in-context learning.
LLM-Based Refinement: A third "refinement" LLM takes these two candidate expansions and consolidates them into a single, coherent expansion. This step synthesizes the strengths of each model's output, potentially correcting errors and combining insights.

Experimental Results & Significance

The framework was evaluated across three diverse testbeds: TREC Deep Learning 2020 (general web search), DBPedia (structured knowledge base), and SciFact (scientific claim verification). The refined ensemble delivered consistent and statistically significant gains over strong baselines, including:

Traditional lexical methods (BM25)
Classical QE (Rocchio)
LLM-based zero-shot expansion
LLM-based fixed few-shot expansion (using static, non-domain-specific examples)

Figure 1: Overview of our automated pipeline for constructingdomain-adaptive few-shot candidate pools, selecting cluste

The paper positions the work as both a reproducible testbed for studying exemplar selection and multi-LLM generation, and a practical, label-free solution for real-world QE deployment. By removing the need for manual prompt engineering and curated datasets, it lowers the barrier to implementing sophisticated, domain-aware search enhancement.

Retail & Luxury Implications

While the research is not applied directly to retail, the technical approach has clear implications for improving search and discovery in luxury and retail contexts. The core challenge the paper addresses—how to adapt powerful LLM capabilities to a specific domain without manual effort—is precisely the challenge faced by brands with specialized vocabularies, product attributes, and customer intent patterns.

Figure 2: Illustration of the two prompts used in our framework:(i) the expansion-generation prompt containing systemi

Potential Application Pathways:

E-commerce Search Enhancement: A luxury retailer could apply this framework to its product catalog and historical search logs. The system would automatically learn what constitutes a "good" query expansion for finding "evening gowns," "saffiano leather bags," or "GMT watches" within its specific inventory, improving recall for nuanced customer queries.
Internal Knowledge Retrieval: For large retail groups, the method could automate the setup of semantic search over internal documents (style guides, material specifications, brand heritage archives) by self-constructing relevant examples from the corpus.
Cross-Lingual and Conceptual Search: The automated exemplar construction could help bridge vocabulary gaps between customer language ("office to evening bag") and official product taxonomy ("medium crossbody with chain strap").

The ensemble refinement step is particularly interesting for luxury, where brand voice and attribute accuracy are paramount. One could configure the ensemble to balance a creative LLM (to understand descriptive, aspirational language) with a factual LLM (to precisely map to product SKUs and specs), with a refinement step ensuring the final expansion is both effective and brand-appropriate.

The framework's "label-free" nature is a major practical advantage. Luxury brands often lack large, labeled datasets of query reformulations, making supervised approaches infeasible. This method only requires the raw corpus (product descriptions, past queries, web copy) to bootstrap itself.

Implementation Considerations

For a technical team evaluating this approach:

Infrastructure: Requires access to at least two, preferably three, LLM endpoints (could be different sizes of the same model family or different models entirely). The BM25-MonoT5 pipeline for exemplar harvesting is standard IR infrastructure.
Latency: The three-stage process (exemplar construction, dual-LLM expansion, refinement) introduces latency. For real-time search, pre-computing expansions for common queries or using a distilled single-model version may be necessary.
Cost: Multiple LLM calls per query increase inference cost. The value proposition hinges on the business impact of improved conversion from better search.
Control: The refinement LLM's prompt would need careful tuning in a luxury context to ensure expansions align with brand terminology and avoid hallucination of non-existent attributes or styles.

This research provides a compelling blueprint for moving beyond generic, one-size-fits-all LLM prompts toward automated, domain-optimized search intelligence—a relevant direction for any retailer seeking to make their digital discovery experience as nuanced as their product offering.

Source: gentic.news · Mar 16, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

For AI practitioners in retail and luxury, this research is highly relevant to a persistent, high-value problem: bridging the semantic gap in search. Customer queries are often short, ambiguous, or expressed in natural language, while product catalogs are tagged with specific, sometimes internal, attributes. Traditional keyword matching fails, and manually tuning search rules or synonyms does not scale. The paper's core contribution—automating the adaptation of LLMs to a specific domain—is the key unlock. A technical team at a luxury house could implement this framework using their product catalog, style guides, and historical search data as the target corpus. The system would autonomously learn how to expand a query like "red bag" to include specific shades (burgundy, scarlet, vermilion) and materials (calfskin, crocodile) present in their inventory, or understand that "summer dress" should prioritize linen, cotton, and specific silhouettes relevant to the season's collection. The maturity of this approach is at the late research / early prototype stage. The paper demonstrates strong results on academic benchmarks, but a production deployment would require significant engineering for robustness, latency, and cost management. The most immediate application might be in offline search indexing or powering a "semantic search" backend for internal tools (e.g., design team searching for inspiration in past collections). The multi-LLM ensemble adds complexity, but the promise is a more reliable and context-aware expansion than a single model could provide, which is critical for preserving brand integrity in search results.

#e-commerce technology #large language models #search & discovery #ai research

Compare side-by-side

large language models vs Query Expansion

→

Mentioned in this article

large language models Query Expansion Multi-LLM Ensembles

Enjoyed this article?