arXiv paper 2605.11254 introduces MIRA, a benchmark for multi-category retrieval across four scholarly data types. Built on real user queries from a social science search platform, it tests category-aware ranking of Publications, Research Data, Variables, and Instruments & Tools.
Key facts
- Benchmark covers 4 categories: Publications, Research Data, Variables, Instruments & Tools
- Built on real user queries from a social science search platform
- LLM generates topic descriptions and relevance assessments
- Submitted to arXiv on May 11, 2026
- No baseline model scores or dataset size disclosed
Most information retrieval benchmarks evaluate a single data type — web pages, academic papers, or product listings. MIRA (Multi-category Integrated Retrieval Assessment) breaks that pattern. The benchmark, described in a paper posted to arXiv on May 11, 2026, covers four distinct categories from a large-scale social science search platform: Publications, Research Data, Variables, and Instruments & Tools.
What MIRA tests
The benchmark uses real user queries rather than synthetic ones, a design choice that increases ecological validity [According to MIRA]. Systems must rank items across categories in a single unified list, not per-category. This mirrors the real-world expectation that a search for "income inequality" return a mix of papers, datasets, variable definitions, and measurement tools.
LLM-assisted construction
MIRA uses a Large Language Model to generate topic descriptions and narratives for each query, then performs relevance assessment relative to those topics. The authors report this substantially reduces the labor and cost of test collection generation compared to traditional human-judged pools. The paper does not specify which LLM was used or provide inter-annotator agreement figures against human judges — an obvious limitation.
Relation to Retrieval-Augmented Generation
The benchmark arrives as RAG systems increasingly need to pull from heterogeneous sources. A May 1, 2026 study showed multi-step iterative retrieval achieves 15–20% accuracy gains on HotpotQA [per prior reporting]. MIRA provides a dedicated evaluation suite for the multi-source retrieval that RAG pipelines depend on. Current RAG benchmarks like KILT or BEIR don't test cross-category ranking.
What's missing
The paper does not release baseline scores for existing retrieval models on MIRA. No BM25, no dense retriever, no re-ranker results. The authors frame the work as a "foundational testbed" — the community must supply the comparisons. The dataset size and query count are also not disclosed in the abstract.
What to watch
Watch for baseline model scores on MIRA — likely from the authors or early adopters — which will reveal how current dense retrievers and cross-encoders handle cross-category ranking. Also track whether the benchmark gets adopted in the RAG evaluation community as an alternative to BEIR.










