ReMMD-Agent scored 41.80% accuracy on the new ReMMDBench, a 500-sample multilingual misinformation benchmark with 2,756 images across five languages. The agentic system, using GPT-5.2, cut verification cost by 79.9% relative to the prior T2-Agent baseline.
Key facts
- ReMMDBench: 500 samples, 2,756 images, 5 languages.
- ReMMD-Agent accuracy: 41.80% using GPT-5.2.
- Cost reduction: 79.9% vs T2-Agent.
- Five-way veracity labels including 'manipulated'.
- Eight distortion labels for fine-grained analysis.
Most multimodal misinformation benchmarks test single-image, short-caption, binary-label scenarios that don't reflect how misinformation actually spreads online. According to the ReMMD paper, viral posts now combine long multilingual narratives, several images, mixed provenance, and subtle text–image framing errors. Existing methods remain poorly matched to this setting.
ReMMD addresses this gap with two components. ReMMDBench includes 500 real-world samples, 2,756 images, five monolingual languages (with two cross-lingual settings), three text-length tiers, multi-image posts, five-way veracity labels, eight distortion labels, evidence provenance, and rationales. The benchmark is designed to be refreshed yearly to reduce contamination.
ReMMD-Agent is a persistent-memory verifier that decomposes posts into atomic claims, builds a reusable evidence set, and predicts structured L1/L2/L3 outputs. Across proprietary systems, open LVLMs, MMD-Agent, and T2-Agent, ReMMD-Agent obtains the best five-way veracity performance, with 41.80% accuracy and 39.12% macro-F1 using GPT-5.2, while reducing cost by 17.5% relative to MMD-Agent and 79.9% relative to T2-Agent.
41.8% accuracy sounds low, but the benchmark's five-way veracity labels (true, false, misleading, unverifiable, manipulated) and cross-lingual complexity make it significantly harder than binary classification tasks. Prior benchmarks like Fakeddit or Twitter-2016 operate at a fraction of this difficulty. The cost reduction is the real signal — agentic verification under realistic evidence search has been prohibitively expensive, and ReMMD-Agent's persistent memory approach directly attacks that bottleneck.
ReMMD-Agent uses GPT-5.2 (developed by OpenAI) as its backbone. The project is open-source, hosted on GitHub, and builds on prior agentic verification frameworks like MMD-Agent and T2-Agent. The paper includes ablation studies on memory persistence, evidence reuse, and structured output formats.
Limitations and Open Questions
The benchmark's 500 samples, while carefully curated, may not capture the full distribution of multilingual misinformation. The authors note that yearly refresh is planned to mitigate contamination. The cost comparison assumes fixed API pricing — enterprise deployments with volume discounts may see different savings. The system's performance on low-resource languages beyond the five tested remains unknown.
What to watch
Watch for the yearly refresh of ReMMDBench in mid-2027, which will reveal whether contamination affects the 41.8% baseline. Also track adoption of persistent-memory verification in other agentic benchmarks like OSWorld or WebArena.

Source: arxiv.org









