SalesSim, a new arXiv benchmark, reveals top MLLMs score below 79% on retail persona alignment. The paper proposes UserGRPO, a multi-turn RL recipe that lifts alignment by 13.8%.
Key facts
- Best MLLM scored below 79% on persona decision alignment.
- UserGRPO RL boosts alignment by 13.8%.
- Benchmarked 6 open and closed-source state-of-the-art models.
- Models overdisclose criteria and drift under sales persuasion.
- Published on arXiv May 8, 2026, by Pruksachatkun et al.
SalesSim, published on arXiv May 8, 2026, by Yada Pruksachatkun and colleagues, introduces a framework for evaluating how well multimodal LLMs simulate persona-driven customer behavior in online retail conversations. Unlike prior work treating user simulation as surface-level dialogue generation, SalesSim models shopping as a grounded, agentic process where shoppers with specific preferences and dealbreakers interact with a sales agent, seek clarifications, and make purchasing decisions [According to the arXiv preprint].
The benchmark tested six open and closed-source MLLMs on a suite of metrics centered on decision alignment — measuring consistency between the simulator's actions and its persona specifications. The results are sobering: even the strongest model achieved less than 79% average alignment with its underlying persona. Models exhibited significantly lower lexical diversity and overdisclosure of criteria compared to human conversations. More critically, models proved susceptible to persuasion by the sales agent, drifting from their persona specifications over multiple turns.
Why the 79% ceiling matters
The finding cuts against the narrative that LLMs are ready to replace human role-play in e-commerce A/B testing, customer service training, or recommendation system evaluation. If a simulated customer abandons its price constraint after a sales pitch, the resulting data poisons downstream models. The 79% ceiling is not an engineering detail — it's a structural limitation of current MLLMs in maintaining goal-directed behavior under social pressure.
UserGRPO: RL as the fix
To address these gaps, the authors propose UserGRPO, a multi-turn, multi-objective reinforcement learning recipe. It optimizes both conversational fluency and decision alignment under persona specifications, using a reward structure that penalizes drift from persona constraints. Experiments show UserGRPO boosts decision alignment of the baseline model by 13.8% while also improving conversational quality, per the paper's reported metrics.

The approach builds on recent work in RL for agent alignment, such as OpenClaw-RL (covered by gentic.news on May 6, 2026) which trains agents on conversation feedback without manual labels. UserGRPO extends this to a multi-objective setting, balancing fluency against adherence to detailed persona specs — a task that requires the model to say "no" to a convincing sales pitch.
What's missing
The paper does not disclose the specific model used as the baseline for UserGRPO, nor does it release the full benchmark dataset or code publicly. The authors state the data includes rich product metadata with multimodal information, but the scale (number of products, personas, or conversation turns) is not quantified [Per the arXiv abstract]. These omissions limit reproducibility and make it difficult to assess how the 13.8% gain generalizes across model families.

What to watch
Watch for the release of the SalesSim benchmark dataset and code. If the authors open-source it, expect rapid adoption as a standard evaluation for agentic retail simulators. Also track whether major labs (Anthropic, OpenAI, Meta) publish their own persona-alignment scores against SalesSim.










