The Quiet March Toward Weak AGI: How AI Systems Are Surpassing Historic Benchmarks
In a recent social media analysis that has sparked renewed discussion within artificial intelligence circles, prominent researcher and Wharton professor Ethan Mollick outlined how contemporary AI systems have quietly achieved what many consider the criteria for "weak Artificial General Intelligence" (AGI). According to Mollick's assessment, models like OpenAI's GPT-4.5 have already matched or surpassed several historic milestones that were once considered significant barriers to machine intelligence.
The Three Checkmarks Toward Weak AGI
Mollick's analysis identifies three specific achievements that collectively suggest current AI systems are approaching weak AGI status:
1. The Turing Test Equivalent ✅
The Loebner Prize, established in 1990 as an annual competition implementing a simplified version of the Turing Test, has long served as a benchmark for conversational AI. According to Mollick, GPT-4.5 has achieved equivalent performance to what would have won this competition, suggesting that modern language models can now convincingly mimic human conversation in constrained settings.
2. The Winograd Schema Challenge ✅
This test, designed as a more robust alternative to the Turing Test, presents sentences containing ambiguous pronouns that require real-world knowledge and commonsense reasoning to resolve. Mollick notes that GPT-3 already passed this challenge, demonstrating an ability to handle nuanced linguistic ambiguity that earlier systems struggled with.
3. Standardized Testing Proficiency ✅
Perhaps most impressively, GPT-4 has achieved a 75% score on the SAT, placing it in the 75th percentile of human test-takers. This achievement suggests not just pattern recognition but genuine reasoning ability across multiple domains including mathematics, reading comprehension, and writing.
The One Remaining Challenge
Mollick identifies only one historical benchmark that current systems haven't definitively conquered: playing an old Atari game from 1984. This reference likely points to Montezuma's Revenge or similar games from that era that require complex exploration, planning, and memory—capabilities that have proven challenging for reinforcement learning systems despite their success with other Atari games.
This remaining challenge is particularly interesting because it represents a different type of intelligence than the linguistic and reasoning capabilities demonstrated in the other tests. Video games from this era often require spatial reasoning, long-term planning, and exploration in environments with sparse rewards—capabilities that may test different aspects of general intelligence.
Context: What Is "Weak AGI"?
The concept of "weak AGI" differs from full artificial general intelligence, which would imply human-level or superhuman capabilities across all cognitive domains. Weak AGI typically refers to systems that can perform a broad range of intellectual tasks at or near human level, but perhaps with limitations in certain areas or without genuine consciousness or understanding.
Mollick's criteria focus specifically on historical benchmarks that researchers have used over decades to measure progress toward machine intelligence. By this standard, the progress has been remarkably rapid, with systems achieving in just a few years what many experts predicted would take decades.
Implications for AI Development and Society
The quiet achievement of these milestones has significant implications:
1. Benchmark Obsolescence
Many traditional AI benchmarks are becoming obsolete faster than they can be updated. The rapid progress suggests that researchers need new, more challenging tests that can better differentiate between narrow and general capabilities.
2. Capability Expectations
Organizations and individuals working with AI systems may need to recalibrate their expectations about what these systems can achieve. If weak AGI criteria are being met, the applications could extend far beyond current use cases.
3. Safety and Alignment Considerations
As systems approach even weak forms of general intelligence, questions about alignment, control, and ethical deployment become increasingly urgent. Systems with broader capabilities may exhibit unexpected behaviors or be susceptible to more sophisticated forms of misuse.
The Research Community's Response
Mollick's observation that "the labs could do the funniest thing right now" suggests that major AI research organizations could potentially demonstrate this final capability if they chose to focus resources on it. This raises questions about why certain capabilities remain undeveloped—whether due to technical challenges, strategic decisions, or safety considerations.
The Atari game challenge represents an interesting gap in current capabilities. While modern AI systems can outperform humans on many complex games (like Go, StarCraft, or Dota 2), certain classic games with specific characteristics continue to pose challenges that may reveal important limitations in current approaches to artificial intelligence.
Looking Forward: Beyond Weak AGI
As AI systems approach and potentially surpass the criteria for weak AGI, the research community faces important questions about what comes next. New benchmarks will need to be developed that test not just specific capabilities but robustness, generalization, reasoning transparency, and ethical alignment.
The progress highlighted by Mollick suggests we may be closer to more general forms of artificial intelligence than many realize, but also that our measurement tools may be inadequate for understanding the true nature and limitations of these systems. As one researcher put it, we may be winning battles against benchmarks while still losing the war for genuine understanding of intelligence—both artificial and natural.
Source: Ethan Mollick (@emollick) on X/Twitter, analyzing current AI capabilities against historical weak AGI criteria.


