What Happened
Researchers have introduced RealChart2Code, a new large-scale benchmark designed to rigorously test the chart-to-code generation capabilities of Vision-Language Models (VLMs). Published on arXiv on March 26, 2026, the benchmark contains over 2,800 instances grounded in authentic datasets, moving beyond synthetic or simplified examples. Its key innovation is systematically evaluating a model's ability to generate code (e.g., in Python with libraries like Matplotlib or Plotly) that can replicate intricate, multi-panel visualizations from raw data, based on a clear analytical intent described in natural language.
Crucially, RealChart2Code is the first benchmark to assess two challenging dimensions:
- Generation from Large-Scale Raw Data: Can the model understand a dataset's structure and produce correct plotting code?
- Iterative Code Refinement in Conversation: Can the model correct its output based on multi-turn feedback, simulating a real developer's workflow?
The paper presents a comprehensive evaluation of 14 leading VLMs, including both proprietary (e.g., GPT-4V, Gemini) and open-weight models. The results are sobering: all models exhibited significant performance degradation compared to their scores on simpler, existing chart-to-code benchmarks. The research highlights specific struggles with complex plot structures (like faceted or layered charts) and the nuances of "authentic" data, which often contains missing values, inconsistent formatting, and real-world noise.
The analysis uncovers a substantial performance gap between proprietary and open-weight models, with the former consistently outperforming the latter. Most critically, the study confirms that even state-of-the-art VLMs often fail to accurately replicate intricate, multi-panel charts, revealing a major limitation in their current practical utility for data visualization tasks.
Technical Details
The benchmark's strength lies in its data provenance and task design. Instead of using clean, toy datasets, RealChart2Code sources its instances from real-world domains (e.g., finance, science, public policy), preserving the complexity and messiness that data analysts face daily. Each instance includes:
- A natural language query defining the analytical goal (e.g., "Show the monthly sales trend for each product category, with a separate subplot for category and an overall trend line").
- The corresponding raw dataset.
- The target visualization (an image of the desired chart).
- The reference code that correctly generates that chart.
Evaluation is multi-faceted, assessing not just syntactic correctness but visual faithfulness—how closely the chart generated by the model's code matches the target image in terms of data encoding, aesthetics, and layout.
The multi-turn conversational evaluation is particularly novel. It tests if a model can act like a helpful data science assistant: a user can provide feedback like "the legend is in the wrong place" or "this line should be dashed," and the model must understand the visual critique and adjust its code accordingly.
Retail & Luxury Implications
While the benchmark itself is domain-agnostic, its findings have direct implications for any data-driven retail or luxury enterprise exploring AI-assisted analytics and reporting.

Potential Application Areas:
- Automated Business Intelligence (BI) Reporting: Imagine a merchant or planner asking a VLM, "Create a dashboard showing weekly sell-through rates by region for our new handbag line, with YoY comparison." The model would need to access sales data, understand the required chart types (likely a multi-panel layout), and generate the correct plotting code. RealChart2Code shows this is still a frontier, not a solved problem.
- Dynamic Data Storytelling for Leadership: Generating a suite of coherent, publication-ready charts for quarterly board presentations from a single narrative prompt is a complex multi-task. Current VLMs would likely produce inconsistent or incorrect visualizations.
- Rapid Prototyping for Data Teams: Data scientists could use VLMs to quickly draft visualization code, but the benchmark suggests they will need significant human oversight and refinement, especially for complex charts.
The Reality Check: This research is a crucial temperature check for AI leaders. The hype around "conversational analytics" powered by VLMs must be tempered by the understanding that generating accurate, complex visualizations from raw data is a hard problem. The performance gap between proprietary and open models also informs build-vs.-buy decisions. A luxury house building an internal AI analytics co-pilot would face greater technical hurdles using open-source VLMs, based on these findings.
The iterative refinement task is perhaps the most relevant for a production setting. An effective AI assistant wouldn't get the chart perfect on the first try but would learn from feedback. The benchmark shows this capability is in its infancy, indicating that robust, multi-turn charting agents are still a research challenge, not an off-the-shelf product.






