What Happened
A new technical paper, "From BM25 to Corrective RAG: Benchmarking Retrieval Strategies for Text-and-Table Documents," was posted to the arXiv preprint server. The research addresses a critical gap: while Retrieval-Augmented Generation (RAG) systems are ubiquitous, there has been no systematic comparison of modern retrieval methods for the complex, heterogeneous documents common in business—those containing both free-form text and structured tabular data.
The authors constructed a challenging financial Question-Answering (QA) benchmark comprising 23,088 queries over 7,318 documents with mixed content. They then benchmarked ten distinct retrieval strategies, spanning the full modern arsenal:
- Sparse Retrieval: The classic lexical search algorithm, BM25.
- Dense Retrieval: Semantic search using state-of-the-art embedding models.
- Hybrid Fusion: Combining scores from sparse and dense retrievers.
- Cross-Encoder Reranking: Using a more computationally expensive neural model to reorder an initial set of candidate documents.
- Query Expansion: Techniques like HyDE (Hypothetical Document Embeddings) and multi-query generation.
- Adaptive & Contextual Retrieval: Methods that adjust the retrieval based on context or previous steps.
Performance was evaluated on both retrieval quality (Recall@k, MRR, nDCG) and end-to-end generation accuracy (via a "Number Match" metric suitable for financial data), with statistical significance testing.
Technical Details & Key Findings
The results deliver several actionable and somewhat counter-intuitive insights for AI practitioners:
The Two-Stage Pipeline is King: The most effective strategy was a two-stage pipeline that first uses a hybrid retrieval method (fusing BM25 and dense embedding scores) to fetch a broad set of candidates (e.g., top 100), then applies a neural cross-encoder reranker to select the final top passages. This pipeline achieved a Recall@5 of 0.816 and an MRR@3 of 0.605, significantly outperforming any single-stage method.
BM25 Challenges Semantic Dominance: In a major finding, the simple, decades-old BM25 algorithm consistently outperformed state-of-the-art dense retrieval models on this financial document corpus. This directly challenges the common assumption that semantic (vector) search universally dominates keyword-based search. The authors attribute this to the precise, often numeric nature of financial queries, where exact term matching remains highly effective.
Not All Advanced Techniques Pay Off: For this domain of precise numerical QA, query expansion methods (HyDE, multi-query) and adaptive retrieval provided limited to no benefit. However, contextual retrieval—where the system uses information from initially retrieved documents to refine a follow-up search—yielded consistent gains.
Cost-Accuracy Trade-offs are Explicit: The paper provides practical guidance. If maximum accuracy is critical, invest in the two-stage hybrid+reranking pipeline. If latency and cost are primary constraints, a well-tuned BM25 system might be the most efficient choice, outperforming more expensive dense models.
Retail & Luxury Implications
The benchmark uses financial documents, but its conclusions are directly transferable to core retail and luxury AI use cases that rely on RAG over complex, mixed-format data. The key insight is that the optimal retrieval architecture is not a one-size-fits-all semantic search but is dependent on your data and query profile.

Potential Applications & Architectural Guidance:
Product Information & Customer Service Chatbots: Knowledge bases contain product descriptions (text), technical specifications (tables), pricing histories, and inventory logs. A customer asking "What were the price changes for the Lady Dior bag in Q4 2025?" is making a precise, quasi-tabular query. This research suggests a hybrid BM25/dense first stage would likely retrieve the correct pricing table, which a reranker could then confirm.
Internal Enterprise Search: Merchandising plans, global sales reports, and supply chain documents are quintessential text-and-table documents. An analyst searching for "SKU 78945 sell-through in Paris in December" needs pinpoint accuracy. Relying solely on a semantic embedding might miss the specific SKU number; BM25 would catch it. The recommended two-stage pipeline is ideal for such internal intelligence systems.
Sustainability & Compliance Reporting: Generating reports from ESG data, material sourcing ledgers, and audit trails involves extracting precise figures from structured tables referenced in narrative text. The benchmark's finding that query expansion adds little value for numerical queries is crucial here—it prevents teams from over-engineering their RAG pipelines with ineffective complexity.
Implementation Consideration: For luxury houses, where product catalogs are smaller but richer in detail (heritage, craftsmanship notes, material provenance), dense semantic retrieval will still be vital for understanding conceptual queries like "bags suitable for gifting a diplomat." The lesson is not to abandon embeddings, but to default to a hybrid approach where BM25 ensures precision on key entities (product codes, names, numbers) and embeddings capture broader semantic intent. The reranking stage acts as a high-confidence arbiter.









