PerfectSquashBench Tests Image Model Anchoring Bias vs. Text Models

Wharton professor Ethan Mollick released PerfectSquashBench, a test showing image generation models exhibit stronger anchoring bias than text models, getting 'stuck' on initial directions and requiring context window clearing.

AAAla SMITH & AI Research Desk·Apr 22, 2026·5 min read··114 views·AI-Generated·Report error

Source: x.comvia @emollickSingle Source

TL;DR

Ethan Mollick introduces PerfectSquashBench, a new benchmark revealing image models anchor more stubbornly to initial prompts than text models, requiring frequent context resets.

PerfectSquashBench Reveals Image Models' Stubborn Anchoring Bias Compared to Text Models

Ethan Mollick, a professor at The Wharton School and prominent AI researcher, has introduced a new informal benchmark called PerfectSquashBench designed to measure how stubbornly image generation models anchor to initial prompts. The test reveals a significant behavioral difference between multimodal modalities: image models tend to get "much more stuck" on a particular direction than text models, often requiring users to clear the context window entirely to achieve meaningful variation.

The benchmark is simple in concept but revealing in practice. Users repeatedly prompt an image model to generate variations on a theme—in this case, creating "the perfect squash." Mollick's observation, shared on social media platform X, notes that "the squash remains merely fine after many attempts," indicating that the model struggles to break away from its initial interpretation or stylistic direction, even with iterative prompting.

What the Test Reveals

Can skills-based tests still be biased? - TG

Anchoring bias—where an initial piece of information (the "anchor") disproportionately influences subsequent decisions—is a well-documented cognitive phenomenon in humans. Mollick's work suggests it's also a measurable characteristic in AI models, particularly in their generative behavior.

Key Finding: Image models demonstrate a stronger and more persistent anchoring effect than their text-only counterparts. When a text model receives a series of related prompts, it can more easily explore semantic variations, adjust tone, or shift perspective. An image model, however, often latches onto specific visual attributes from the first prompt—composition, color palette, art style, or object representation—and resists significant deviation in subsequent generations within the same session.

This has a direct impact on practical workflow. As Mollick notes, it often "requires clearing the context window fairly often" to get an image model to try a genuinely new approach. Simply asking for "a different version" or "more realistic" may produce only minor tweaks rather than a conceptual reset.

Context and Implications for Practitioners

This behavioral quirk isn't merely academic. For developers, designers, and researchers using image generation APIs (like those from OpenAI, Midjourney, or Stability AI), understanding this anchoring tendency is crucial for efficient prompting.

Workflow Impact: Users may be wasting time and computational resources (and API credits) on long chains of iterative image prompts that yield diminishing returns. The more effective strategy, according to this finding, might be to start fresh conversations or sessions when seeking a substantially different visual direction.
Benchmarking Model "Creativity": PerfectSquashBench could become a quick, qualitative tool for comparing different image models. A model that shows less anchoring and more flexibility across a prompt series might be considered more creatively responsive, all else being equal.
Prompt Engineering: This highlights a difference in optimal prompt strategy between text and image generation. While text models can benefit from long, nuanced conversations building on context, image models might require a "reset and restart" approach for divergent ideas.

gentic.news Analysis

Understanding Polynomial Regression Model - Analytics Vidhya

Mollick's informal benchmark touches on a growing area of research: quantifying the behavioral and psychological characteristics of AI models, not just their raw accuracy or fidelity. This aligns with trends we've covered, such as the push for more nuanced LLM evaluations beyond multiple-choice question benchmarks. It's less about "can the model draw a squash?" and more about "how does the model behave when asked to rethink the squash?"

The finding that image models anchor more strongly than text models is technically plausible. Text generation operates in a high-dimensional, discrete token space where synonyms, sentence structures, and narrative pivots offer many paths away from an anchor. Image generation, while also high-dimensional, often relies on latent spaces where initial noise seeds and early diffusion steps heavily constrain the final output. A model's internal representation of "squash" may be a narrower, more rigid cluster than its representation of the concept of a squash described in text.

For practitioners, this is a useful, practical heuristic. It suggests that the common practice of using a single, high-quality image as a "style reference" or starting point for a series (common in Midjourney's --sref or DALL-E 3's style seeds) is essentially formalizing and leveraging this anchoring bias. The challenge, as Mollick points out, is when the bias works against the user's goal of exploration.

Looking forward, we expect to see more research into "prompt adherence" versus "prompt flexibility" as desirable but competing model traits. Future model cards might even report metrics on anchoring bias, giving users a clearer expectation of how a model will behave in an extended creative session.

Frequently Asked Questions

What is anchoring bias in AI?

Anchoring bias in AI refers to a model's tendency to be overly influenced by the initial information or prompt it receives in a session, making it difficult to steer its outputs in a significantly different direction through follow-up prompts alone. It's an analogy to the cognitive bias of the same name in human psychology.

How can I get an image model to break its anchoring?

Based on Mollick's observation, the most reliable method is to clear the context window. This means starting a brand new chat session or generation thread. Simply adding negative prompts or asking for "a completely different style" within the same session is often less effective, as the model's internal state remains conditioned on the initial anchor.

Does this mean text models don't have anchoring bias?

No, text models can also exhibit anchoring, but Mollick's finding suggests the effect is quantitatively stronger in image models. Text models may fixate on a writing style, genre, or perspective, but they generally offer more degrees of freedom for semantic change without requiring a full reset.

Is PerfectSquashBench a formal, quantitative benchmark?

Not currently. As introduced, it is an informal, qualitative test and observation. However, it provides a clear framework that could be developed into a quantitative benchmark by, for example, measuring the visual similarity (via CLIP scores or other metrics) across a series of iterative prompts compared to a reset baseline.

Source: gentic.news · Apr 22, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

Mollick's PerfectSquashBench observation, while anecdotal, points to a tangible difference in the generative "psychology" of text versus image models. This isn't about capability, but about behavioral rigidity in the creative process. The technical root likely lies in the differing nature of their output spaces and sampling processes. Text generation's sequential, autoregressive nature across a vast vocabulary may allow for more branching exploration. In contrast, image generation via diffusion models is heavily guided by early noise patterns and conditioning; the model commits to a visual trajectory early in the denoising process, making it harder to course-correct mid-generation. This has direct implications for human-AI collaboration. It suggests that co-creating with an image model is less like an open-ended conversation (as it can be with text) and more like providing a detailed brief, reviewing a draft, and then, if the direction is wrong, issuing a completely new brief rather than attempting line edits. For AI developers, mitigating excessive anchoring could become a design goal, perhaps through techniques that encourage greater latent space exploration within a session or by offering explicit "conceptual reset" controls within interfaces. Ultimately, benchmarks like this move us toward a more mature understanding of AI tools. We're progressing from asking "what can it make?" to "how does it work with us?" Understanding a model's tendencies—its stubbornness, its flexibility, its interpretative leaps—is as important as knowing its resolution or accuracy for integrating it into real creative and professional workflows.

#generative-ai #research #benchmark #computer-vision

Mentioned in this article

Ethan Mollick PerfectSquashBench

Enjoyed this article?