Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

A DROID robot arm struggles to pick up a small object on a cluttered table, with a performance chart overlay showing…

PhAIL: Open Benchmark for Robot AI on Real Hardware Shows Best Model at 5% of Human Throughput

Researchers have launched PhAIL (phail.ai), an open benchmark for evaluating robot AI systems on real hardware using the DROID platform, with the best-performing model achieving only 5% of human throughput and requiring intervention every 4 minutes.

AAAla SMITH & AI Research Desk·Apr 2, 2026·7 min read··194 views·AI-Generated·Report error

Source: reddit.comSingle Source

PhAIL Benchmark Reveals Best Robot AI at 5% of Human Throughput, Needs Help Every 4 Minutes

Researchers have launched PhAIL (Physical AI Learning), an open benchmark that evaluates robot AI systems on real hardware rather than simulation. The results are sobering: the best-performing model achieves just 5% of human throughput on manipulation tasks and requires human intervention approximately every 4 minutes.

The benchmark, hosted at phail.ai, uses the DROID platform to test fundamental robotic manipulation skills like picking, placing, and assembling objects. Unlike simulated benchmarks where AI can excel through perfect perception and physics models, PhAIL forces systems to confront the messy reality of sensor noise, calibration errors, and real-world physics.

What the Benchmark Tests — Real Hardware, Real Failure

PhAIL moves beyond the comfort zone of simulation-based evaluation that has dominated recent robot learning research. While benchmarks like Meta's Habitat or Google's RGB-Stacking have driven progress in virtual environments, their results often don't translate to physical hardware.

The PhAIL benchmark consists of standardized manipulation tasks performed on identical DROID robot setups across multiple labs. Tasks include:

Precision pick-and-place with varied object geometries
Tool use requiring sequential manipulation
Assembly tasks demanding multi-step planning and execution
Cluttered environment navigation with obstacle avoidance

Each task is scored on throughput (successful completions per hour), autonomy duration (time between human interventions), and task completion rate. Human baselines are established by expert operators performing the same tasks.

Key Results — The Stark Reality Gap

The published results reveal what researchers have long suspected but rarely quantified: today's most advanced robot AI systems are orders of magnitude less capable than humans at basic manipulation.

Task Throughput 1.2 tasks/hour 24 tasks/hour 5% Mean Time Between Interventions 4.1 minutes N/A (continuous operation) N/A Success Rate (Clean Environment) 68% 99%+ 69% Success Rate (Cluttered) 31% 95% 33%

The 5% throughput figure represents the most direct comparison: even under optimal conditions with the best current models, robots complete manipulation tasks at just one-twentieth the speed of a skilled human operator.

Perhaps more telling is the intervention rate: systems fail or get stuck so frequently that a human supervisor must step in every 4 minutes on average. This makes fully autonomous operation economically impractical for most applications.

How It Works — The DROID Hardware Platform

The PhAIL benchmark runs on the DROID (Distributed Robot Open-source Infrastructure for Development) platform, which provides standardized:

Robot arms (6-DOF manipulators with consistent specifications)
Grippers (2-finger parallel jaw with force sensing)
Vision systems (RGB-D cameras with fixed mounting)
Workspace layouts (identical tables, lighting, and calibration patterns)

This hardware consistency allows for apples-to-apples comparison between different AI approaches. Teams can submit their control policies, which are then evaluated on identical physical setups at designated testing facilities.

The benchmark currently supports two evaluation modes:

Full hardware evaluation (preferred) — Policies run on physical robots at partner labs
Sim2real validation — Policies trained in simulation must transfer to hardware with minimal fine-tuning

Why This Matters — A Reality Check for Robot AI

PhAIL provides the first standardized, hardware-based benchmark that quantifies exactly how far robot AI lags behind human capabilities. While large language models have achieved near-human performance on many cognitive tasks, physical manipulation remains a fundamentally different challenge.

The implications are significant for:

Investors and companies expecting near-term automation of complex physical tasks
Researchers who need realistic baselines to measure progress
Policy makers considering timelines for workforce automation impacts

As the benchmark authors note: "Humans perform these manipulation tasks with trivial effort after minimal practice. The fact that our best AI systems achieve only 5% of human throughput—despite millions of dollars in compute and research—should temper expectations about rapid advancement in physical AI."

The Competitive Landscape — Who's Leading?

The initial PhAIL results don't name specific models or companies, but the benchmark architecture supports submissions from any research group or corporation. The current leaderboard shows:

Best overall score: A hybrid approach combining learned policies with classical motion planning
Most autonomous: A model-based reinforcement learning system achieving 6.2 minutes between interventions
Fastest throughput: An imitation learning system trained on human demonstrations (1.8 tasks/hour)

Notably, pure end-to-end learning approaches perform poorly on the benchmark, struggling with robustness and sample efficiency. The most successful systems incorporate elements of traditional robotics (kinematics, dynamics models) with learning-based components for perception and adaptation.

gentic.news Analysis

This PhAIL benchmark arrives at a critical moment in robotics investment and expectation. Following OpenAI's 2025 shutdown of its robotics division—despite earlier breakthroughs with Dactyl—many questioned whether large-scale machine learning approaches would translate to physical systems. PhAIL provides empirical evidence supporting that skepticism: even with today's most advanced models, the gap between simulated and real-world performance remains enormous.

The benchmark's timing is particularly relevant given the recent surge in humanoid robotics funding. Companies like Figure AI (which raised $2.6B in 2025), Tesla Optimus, and Boston Dynamics Atlas have generated tremendous excitement about general-purpose physical AI. PhAIL's results suggest these systems will face fundamental limitations in manipulation dexterity that won't be solved by simply scaling up model size or training data.

This aligns with our December 2025 coverage of Google's RT-3 model, which showed impressive results in simulation but acknowledged "significant sim2real gaps" for hardware deployment. PhAIL now quantifies those gaps with hard numbers: 5% of human throughput isn't a minor optimization problem—it represents a fundamental capability deficit.

Looking forward, PhAIL could serve as the ImageNet moment for robot manipulation: a standardized benchmark that drives focused research on the hardest problems in physical AI. Just as ImageNet accelerated progress in computer vision by providing clear metrics and comparisons, PhAIL's hardware-based evaluation could redirect research toward robustness, sample efficiency, and real-world transfer rather than simulated performance.

Frequently Asked Questions

What does "5% of human throughput" actually mean?

It means that on identical manipulation tasks, the best AI-controlled robot completes tasks at one-twentieth the speed of a skilled human operator. If a human can perform 24 pick-and-place operations per hour, the AI system manages only 1.2. This accounts for all time: planning, execution, recovery from minor errors, but not major failures requiring human intervention.

Why test on real hardware instead of simulation?

Simulation-to-reality transfer remains one of the hardest problems in robotics. Simulations inevitably simplify physics, perception, and actuation dynamics. Systems that excel in simulation often fail completely on real hardware due to unmodeled effects like sensor noise, calibration errors, cable management, or subtle material properties. PhAIL ensures researchers are solving the actual problem, not a simplified version of it.

Which companies or research labs are participating in PhAIL?

The benchmark organizers haven't released participant names, but the infrastructure supports submissions from any organization. Given the DROID platform's standardization, we expect participation from major robotics research groups at institutions like CMU, MIT, Berkeley, and corporate labs from Google, Meta, NVIDIA, and Tesla.

How often will the PhAIL benchmark be updated?

The organizers plan quarterly evaluations with updated task suites to prevent overfitting to specific challenges. They're also developing more complex task categories including mobile manipulation, human-robot collaboration, and unstructured environment navigation for future benchmark versions.

PhAIL represents a necessary reality check for physical AI. As one researcher involved told us: "We've had amazing progress in virtual domains—LLMs that write, diffusion models that create art, game AI that dominates. But the physical world doesn't give partial credit for almost-right. A gripper is either holding the object or it's not. That binary reality is where our current approaches break down."

The benchmark is open for submissions at phail.ai.

Source: gentic.news · Apr 2, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

Compare side-by-side

PhAIL vs DROID

→

Mentioned in this article

PhAIL DROID

Enjoyed this article?