Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Google Gemini 3.1 Pro interface with code editor and benchmark metrics showing a 90-minute software task completion time

Gemini 3.1 Pro Leads METR Time Horizon, Handles 90-Minute Software Tasks

Google's Gemini 3.1 Pro is the new leader on METR's time horizon benchmark, successfully handling software tasks that take humans an average of 1 hour and 30 minutes to complete, with an average score of 77%. This marks a significant shift as Google takes the top spot from OpenAI and Anthropic on a key benchmark measuring autonomous agent capability.

AAAla SMITH & AI Research Desk·Apr 16, 2026·6 min read··292 views·AI-Generated·Report error

Source: x.comvia @kimmonismusSingle Source

TL;DR

Google's Gemini 3.1 Pro tops METR's time horizon benchmark, completing tasks averaging 90 minutes, surpassing OpenAI's GPT-5 models.

Gemini 3.1 Pro Tops METR Time Horizon Benchmark, Completing 90-Minute Software Tasks

Google's Gemini 3.1 Pro has taken the lead on the METR (Model Evaluation and Threat Research) time horizon benchmark, a key metric for evaluating how long an AI agent can successfully operate on a software task without human intervention. The model now handles tasks that take human programmers an average of 1 hour and 30 minutes to complete, surpassing all current GPT-5 variants.

Key Takeaways

Google's Gemini 3.1 Pro is the new leader on METR's time horizon benchmark, successfully handling software tasks that take humans an average of 1 hour and 30 minutes to complete, with an average score of 77%.
This marks a significant shift as Google takes the top spot from OpenAI and Anthropic on a key benchmark measuring autonomous agent capability.

The Benchmark Results

AI Time Horizon Metric: Can AI Complete Long Tasks?

The METR time horizon benchmark measures an AI model's ability to complete software engineering tasks autonomously, with success defined as achieving an 80% success rate. According to the results shared by independent researcher @kimmonismus:

Average Task Duration: 1 hour 30 minutes (human-equivalent time)
95% Confidence Interval: 52 minutes to 2 hours 39 minutes
Average Score: 77%

This performance places Gemini 3.1 Pro ahead of:

GPT-5.2 (high)
GPT-5.1-Codex-Max
GPT-5
All other tested models

Historical Context: The Curve is Bending

The progress on this benchmark reveals the accelerating pace of AI agent development. In 2023, GPT-4 scored near zero on this same metric—effectively incapable of handling extended autonomous tasks. Just three years later, the conversation has shifted from tasks measured in seconds to tasks measured in hours.

This represents a fundamental shift in what's possible with AI agents. As @kimmonismus notes: "The doubling time on autonomous task length is the number to watch. If it holds, multi-hour agentic work stops being a demo and starts being a workflow."

Competitive Landscape Shift

Perhaps more significant than the raw numbers is the competitive shift. The METR time horizon benchmark has been dominated by OpenAI and Anthropic models for most of its existence. Google's quiet ascent to the top spot represents the first time a non-OpenAI/Anthropic model has led this particular benchmark.

This comes at a critical time in the AI landscape, with Google I/O scheduled for May 2026. The performance suggests Google may be closing the gap in areas where OpenAI has traditionally held strong advantages.

What This Means for Developers

For AI engineers and technical leaders, this benchmark has practical implications:

Agentic Workflows Become Practical: Tasks that previously required constant human supervision can now be delegated to AI agents for extended periods
Development Cost Reduction: The ability to handle 90-minute tasks autonomously could significantly reduce the human time required for software development and testing
New Architecture Possibilities: Systems can be designed with longer feedback loops, allowing agents to tackle more complex, multi-step problems

Technical Implications

Google's Gemini 2.0 model family expands with Flash-Lite and Pro

While the source doesn't provide technical details about how Gemini 3.1 Pro achieves this performance, the results suggest several possible improvements:

Better Long-Context Understanding: Handling 90-minute tasks requires maintaining coherence over extended sequences
Improved Error Recovery: Autonomous agents need robust mechanisms to recover from mistakes without human intervention
Enhanced Planning Capabilities: Longer tasks require more sophisticated planning and decomposition of complex problems

Limitations and Caveats

It's important to note that benchmark results don't always translate directly to real-world performance. The METR time horizon benchmark uses specific software tasks that may not represent all types of work. Additionally:

The 95% confidence interval is wide (52 minutes to 2 hours 39 minutes), suggesting significant variability
The 77% average score indicates room for improvement in consistency
Real-world deployment introduces complexities not captured in controlled benchmarks

gentic.news Analysis

This development represents a significant inflection point in the AI agent race. For the past two years, OpenAI has consistently led on autonomous task benchmarks, with Anthropic maintaining strong positions in safety-focused evaluations. Google's breakthrough here suggests their investment in agentic capabilities—particularly following their 2025 acquisition of robotics firm Everyday Robots and their integration of DeepMind's AlphaCode 2 technology—is paying dividends.

The timing is particularly strategic. With Google I/O approaching in May 2026, this benchmark leadership positions Google to make a strong case for Gemini's enterprise capabilities. It also aligns with the broader industry trend we've been tracking: the shift from conversational AI to agentic AI. As we noted in our 2025 year-end analysis, "autonomous task completion measured in hours, not minutes" was identified as the key threshold for practical agent deployment.

What's most telling is the competitive dynamic. OpenAI's GPT-5 series, launched throughout 2025, established new standards for coding assistance but apparently hasn't maintained dominance on extended autonomous tasks. This suggests Google may have made architectural choices specifically optimized for long-horizon planning—an area where transformer models traditionally struggle. If Google can maintain this lead, it could reshape the competitive landscape ahead of the expected GPT-6 announcements later this year.

Frequently Asked Questions

What is the METR time horizon benchmark?

The METR (Model Evaluation and Threat Research) time horizon benchmark measures how long an AI agent can successfully work on a software engineering task without human intervention. It defines success as completing tasks with 80% accuracy, and reports results in human-equivalent time—how long it would take a human programmer to complete the same task.

How significant is handling 90-minute tasks compared to previous models?

Extremely significant. In 2023, state-of-the-art models like GPT-4 scored near zero on this benchmark, meaning they couldn't handle extended autonomous work. The jump from seconds to hours represents multiple orders of magnitude improvement in autonomous capability, making practical agentic workflows possible for the first time.

Does this mean Gemini 3.1 Pro is better than GPT-5 for all coding tasks?

Not necessarily. This benchmark specifically measures autonomous task length, not overall coding quality, accuracy, or speed. GPT-5 variants may still excel in other areas like code generation quality, understanding complex requirements, or working with specific programming languages. Different benchmarks measure different capabilities.

When will developers be able to use Gemini 3.1 Pro with these capabilities?

The benchmark results suggest these capabilities exist in Google's research environment. Typically, such advancements reach public APIs within months. Given the timing before Google I/O in May 2026, it's likely Google will announce broader availability of these agentic capabilities at or shortly after their developer conference.

Source: gentic.news · Apr 16, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This benchmark result represents more than just another incremental improvement—it signals a potential shift in the competitive dynamics of foundation models. For two years, OpenAI has dominated discussions around autonomous AI capabilities, particularly following their GPT-4o and GPT-5 releases. Google's quiet achievement here suggests they've made fundamental architectural advances in long-context reasoning and planning. Practically, this means engineering teams should reevaluate their agent strategy. The 90-minute threshold changes the calculus for automation. Previously, agents were useful for quick tasks (under 10 minutes) where human oversight was minimal. Now, entire development workflows—writing a feature, debugging a complex issue, refactoring a module—could be delegated. This aligns with the trend we identified in our coverage of [Devin's 2025 capabilities](/articles/devin-software-engineer-2025-review), but at enterprise scale. Technically, the most interesting question is how Google achieved this. The wide confidence interval (52 minutes to 2.5 hours) suggests inconsistency—some tasks work remarkably well, others fail earlier. This pattern often indicates either specialized training on certain task types or architectural innovations that excel in specific conditions. Given Google's research focus on mixture-of-experts architectures and their work on [Pathways](/articles/google-pathways-architecture-2024), we suspect this performance comes from better routing of complex, multi-step problems to specialized sub-networks. The business implications are substantial. If Google can maintain this lead through I/O and into general availability, they could capture significant market share in enterprise automation—a sector where extended task duration matters more than raw speed. This also pressures OpenAI to accelerate their own long-horizon work, potentially leading to faster advancements across the industry.

#competitive analysis #software engineering #ai agents #benchmarks #google

Compare side-by-side

Google vs Anthropic

→

Mentioned in this article

Google Gemini 3 Pro METR Anthropic OpenAI GPT-5.3 GPT-5 GPT-5.1-Codex-Max GPT-5.2 (high)Gemini

Enjoyed this article?