Google's Gemini 3.1 Pro has taken the lead on the METR (Model Evaluation and Threat Research) time horizon benchmark, a key metric for evaluating how long an AI agent can successfully operate on a software task without human intervention. The model now handles tasks that take human programmers an average of 1 hour and 30 minutes to complete, surpassing all current GPT-5 variants.
The Benchmark Results
The METR time horizon benchmark measures an AI model's ability to complete software engineering tasks autonomously, with success defined as achieving an 80% success rate. According to the results shared by independent researcher @kimmonismus:
- Average Task Duration: 1 hour 30 minutes (human-equivalent time)
- 95% Confidence Interval: 52 minutes to 2 hours 39 minutes
- Average Score: 77%
This performance places Gemini 3.1 Pro ahead of:
- GPT-5.2 (high)
- GPT-5.1-Codex-Max
- GPT-5
- All other tested models
Historical Context: The Curve is Bending
The progress on this benchmark reveals the accelerating pace of AI agent development. In 2023, GPT-4 scored near zero on this same metric—effectively incapable of handling extended autonomous tasks. Just three years later, the conversation has shifted from tasks measured in seconds to tasks measured in hours.
This represents a fundamental shift in what's possible with AI agents. As @kimmonismus notes: "The doubling time on autonomous task length is the number to watch. If it holds, multi-hour agentic work stops being a demo and starts being a workflow."
Competitive Landscape Shift
Perhaps more significant than the raw numbers is the competitive shift. The METR time horizon benchmark has been dominated by OpenAI and Anthropic models for most of its existence. Google's quiet ascent to the top spot represents the first time a non-OpenAI/Anthropic model has led this particular benchmark.
This comes at a critical time in the AI landscape, with Google I/O scheduled for May 2026. The performance suggests Google may be closing the gap in areas where OpenAI has traditionally held strong advantages.
What This Means for Developers
For AI engineers and technical leaders, this benchmark has practical implications:
- Agentic Workflows Become Practical: Tasks that previously required constant human supervision can now be delegated to AI agents for extended periods
- Development Cost Reduction: The ability to handle 90-minute tasks autonomously could significantly reduce the human time required for software development and testing
- New Architecture Possibilities: Systems can be designed with longer feedback loops, allowing agents to tackle more complex, multi-step problems
Technical Implications
While the source doesn't provide technical details about how Gemini 3.1 Pro achieves this performance, the results suggest several possible improvements:
- Better Long-Context Understanding: Handling 90-minute tasks requires maintaining coherence over extended sequences
- Improved Error Recovery: Autonomous agents need robust mechanisms to recover from mistakes without human intervention
- Enhanced Planning Capabilities: Longer tasks require more sophisticated planning and decomposition of complex problems
Limitations and Caveats
It's important to note that benchmark results don't always translate directly to real-world performance. The METR time horizon benchmark uses specific software tasks that may not represent all types of work. Additionally:
- The 95% confidence interval is wide (52 minutes to 2 hours 39 minutes), suggesting significant variability
- The 77% average score indicates room for improvement in consistency
- Real-world deployment introduces complexities not captured in controlled benchmarks
gentic.news Analysis
This development represents a significant inflection point in the AI agent race. For the past two years, OpenAI has consistently led on autonomous task benchmarks, with Anthropic maintaining strong positions in safety-focused evaluations. Google's breakthrough here suggests their investment in agentic capabilities—particularly following their 2025 acquisition of robotics firm Everyday Robots and their integration of DeepMind's AlphaCode 2 technology—is paying dividends.
The timing is particularly strategic. With Google I/O approaching in May 2026, this benchmark leadership positions Google to make a strong case for Gemini's enterprise capabilities. It also aligns with the broader industry trend we've been tracking: the shift from conversational AI to agentic AI. As we noted in our 2025 year-end analysis, "autonomous task completion measured in hours, not minutes" was identified as the key threshold for practical agent deployment.
What's most telling is the competitive dynamic. OpenAI's GPT-5 series, launched throughout 2025, established new standards for coding assistance but apparently hasn't maintained dominance on extended autonomous tasks. This suggests Google may have made architectural choices specifically optimized for long-horizon planning—an area where transformer models traditionally struggle. If Google can maintain this lead, it could reshape the competitive landscape ahead of the expected GPT-6 announcements later this year.
Frequently Asked Questions
What is the METR time horizon benchmark?
The METR (Model Evaluation and Threat Research) time horizon benchmark measures how long an AI agent can successfully work on a software engineering task without human intervention. It defines success as completing tasks with 80% accuracy, and reports results in human-equivalent time—how long it would take a human programmer to complete the same task.
How significant is handling 90-minute tasks compared to previous models?
Extremely significant. In 2023, state-of-the-art models like GPT-4 scored near zero on this benchmark, meaning they couldn't handle extended autonomous work. The jump from seconds to hours represents multiple orders of magnitude improvement in autonomous capability, making practical agentic workflows possible for the first time.
Does this mean Gemini 3.1 Pro is better than GPT-5 for all coding tasks?
Not necessarily. This benchmark specifically measures autonomous task length, not overall coding quality, accuracy, or speed. GPT-5 variants may still excel in other areas like code generation quality, understanding complex requirements, or working with specific programming languages. Different benchmarks measure different capabilities.
When will developers be able to use Gemini 3.1 Pro with these capabilities?
The benchmark results suggest these capabilities exist in Google's research environment. Typically, such advancements reach public APIs within months. Given the timing before Google I/O in May 2026, it's likely Google will announce broader availability of these agentic capabilities at or shortly after their developer conference.









