Visual-SDPO, detailed in a June 2026 arXiv paper, boosts code-generated visual artifact quality by over 10 absolute points on ChartMimic, Design2Code, and AeSlides. The framework uses rendered visual feedback as privileged context to distill corrections into a coding LLM student.
Key facts
- Visual-SDPO improves over zero-shot base by >10 absolute points on three benchmarks.
- Outperforms GRPO by at least 2.4 points with fewer training steps.
- Spatially-targeted distillation traces defects to specific code statements.
- No added inference-time cost — teacher weight-shared during training only.
- Unified backbone: Qwen3-VL-8B-Instruct for chart, UI, and slide generation.
The Problem: Code Before Sight
Code-generating LLMs increasingly produce visual artifacts — charts, web pages, slides — by writing programs executed by non-differentiable renderers. The model commits to code before seeing the render, leading to overlapping elements, clipped text, broken alignment, low contrast, and overflow. Existing reinforcement learning methods like GRPO reward executable outputs but lack spatially targeted supervision for visual defects.
Visual-SDPO: Spatially-Targeted Distillation
The paper introduces Visual-SDPO, a self-distillation policy-optimization framework that treats rendered visual feedback as privileged context for a weight-sharing teacher. The teacher sees the rendered artifact and passes defect information to the student. A key innovation is Visual-Grounded Code Credit Weighting, which traces each detected visual defect back to the specific code statements responsible for the affected elements and amplifies the distillation signal on those statements. This makes supervision spatially targeted rather than uniform across all tokens.

A sequence-level GRPO term complements the dense token-level objective by rewarding executable, visually high-quality rollouts. Failed executions remain learnable: execution errors are passed as privileged context to the teacher, which then distills the fix to the student.
Benchmarks and Results
The authors instantiate Visual-SDPO with a unified Qwen3-VL-8B-Instruct backbone. Across chart-to-code (ChartMimic), UI-to-code (Design2Code), and slide-generation (AeSlides) benchmarks, Visual-SDPO improves over the zero-shot base by more than 10 absolute points in the primary metric and over GRPO by at least 2.4 points. Critically, these gains come with fewer training steps and no added inference-time cost — the teacher is weight-shared and only used during training.
Why This Matters
Most visual code generation work treats the problem as a language modeling task, ignoring the non-differentiable rendering step. Visual-SDPO bridges this gap by making the visual feedback loop explicit during training without requiring a differentiable renderer. The spatially-targeted credit weighting is a practical advance: instead of punishing all tokens equally for a defect, it isolates the responsible code lines. This mirrors how a human developer would debug — inspect the render, find the broken element, trace to the code that drew it.
The paper does not disclose training compute or dataset sizes beyond the benchmark splits. According to Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts, the code and data are not yet publicly released at the time of publication.
What to watch
Watch for the release of Visual-SDPO code and weights. If the method generalizes to other backbones (e.g., DeepSeek-Coder, CodeLlama) and domains (3D scene generation, CAD), it could become a standard training recipe for any code-to-visual LLM pipeline.
Source: arxiv.org









