The GPQA Diamond Benchmark: Mapping the Turbulent AI Race
A recent visualization of performance on the GPQA Diamond benchmark has provided a compelling snapshot of the rapidly shifting dynamics in the race to develop advanced artificial intelligence. Shared by researcher Ethan Mollick, the chart tracks the progress of major AI labs over time on this challenging benchmark, revealing a story of fleeting leads, surprising surges, and unexpected plateaus.
The GPQA (Graduate-Level Google-Proof Q&A) benchmark, and particularly its "Diamond" subset, is designed to be exceptionally difficult. It consists of multiple-choice questions written by domain experts in fields like biology, physics, and chemistry, intended to be at a graduate-level understanding and resistant to simple web searches. Success on this benchmark indicates an AI model's deep reasoning capabilities and mastery of complex, specialized knowledge.
OpenAI's Early Dominance and the Opening of the Field
The visualization clearly shows that for a significant period, OpenAI operated in a league of its own. Following the release of GPT-4, the company maintained a substantial lead on the GPQA Diamond benchmark, with no other publicly available model coming close. This period of solitary dominance underscored the technical leap represented by OpenAI's flagship model and allowed the company to set the pace and direction for the entire industry. Competitors were left with a clear target, but a daunting one.
The Meteoric Rise and Subsequent Stall of Meta's Llama
The narrative then shifts with the entry of Meta's Llama models, particularly Llama 3. The chart illustrates a dramatic "rise" as Meta's open-weight models rapidly closed the gap with OpenAI's performance. The release of Llama 3 405B in particular was a landmark moment for the open-source and open-weight community, proving that a model outside OpenAI could achieve near-state-of-the-art results on elite benchmarks. However, the visualization also notes a subsequent "collapse" or stall in Meta's progress. After its rapid ascent, Meta's trajectory on this benchmark flattened, suggesting a period of consolidation or a challenge in making the next leap in capability.
xAI's Sudden Ascent and Unexpected Plateau
One of the most striking sequences in the chart is the story of xAI, Elon Musk's AI company. Grok, xAI's model, is shown making a "sudden catch-up," rapidly improving its score to join the top tier of performers. This indicates a period of intense and effective development, likely around the release of Grok-1.5 or Grok-2. However, this surge was followed by what Mollick describes as "stagnation." After its rapid climb, xAI's progress on this benchmark appears to have halted, leaving it in the pack rather than breaking away. This pattern highlights the nonlinear nature of AI advancement, where breakthroughs can be followed by difficult plateaus.
The New Entrants: Chinese Open-Weight LLMs
The final act in this visualization is the "entry of open weights Chinese LLMs." Models like Qwen from Alibaba and DeepSeek from DeepSeek AI have appeared on the leaderboard, demonstrating strong performance. Their presence marks a significant geographical and strategic expansion of the high-stakes AI race. The fact that they are open-weight models is particularly notable, as it contributes to the growing ecosystem of powerful, accessible AI tools outside the control of a few U.S.-based corporations. This development globalizes the competition and increases the diffusion of cutting-edge AI capabilities.
What the Benchmark Race Means for the Future
While benchmark scores are an imperfect measure of real-world utility, they serve as critical mile markers in the development race. The GPQA Diamond benchmark, with its high difficulty, acts as a proxy for reasoning depth and technical knowledge. The fluctuations seen in this chart reflect the immense resources being deployed, the different strategic approaches (closed vs. open-weight), and the inherent unpredictability of research breakthroughs.
The visualization suggests that the era of a single dominant leader may be over, replaced by a crowded field of capable contenders. However, it also shows that maintaining a continuous upward trajectory is exceptionally difficult. The "stagnation" phases experienced by Meta and xAI reveal the steepening challenge of improvement as models approach the frontier of known capabilities. The next phase of the race may depend less on scaling existing methods and more on fundamental architectural or algorithmic innovations.
Source: Analysis based on a visualization shared by Ethan Mollick (@emollick) tracking performance on the GPQA Diamond benchmark.



