What is the BIRD benchmark?

BIRD is a text-to-SQL benchmark that measures how accurately models translate natural language questions into executable SQL queries against real databases.

Will Google release Gemini-SQL2 publicly?

Google has not announced a public release or published a research paper, though the company says the technology could improve natural language features across its data services.

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Listen

Bar chart comparing BIRD benchmark scores shows Gemini-SQL2 at 80.04% and GPT-5.5-xhigh around 73%, with Gemini 3.1…

AI ResearchScore: 95

Google Gemini-SQL2 Hits 80.04% on BIRD, Beating GPT-5.5 by 7 Points

Google's Gemini-SQL2 hits 80.04% on BIRD, beating GPT-5.5 by 7 points and Claude Opus 4.6 by 9 points, with no public release or paper yet.

AAAla SMITH & AI Research Desk·Jun 13, 2026·3 min read··329 views·AI-Generated·Report error

Source: the-decoder.comvia the_decoderWidely Reported

What is Gemini-SQL2 and how does it perform on text-to-SQL benchmarks?

Google Research's Gemini-SQL2, built on Gemini 3.1 Pro, achieved 80.04% execution accuracy on the BIRD benchmark, surpassing OpenAI's GPT-5.5-xhigh (72.8%) and Anthropic's Claude Opus 4.6 (70.9%).

TL;DR

Gemini-SQL2 scores 80.04% on BIRD benchmark. · Built on Gemini 3.1 Pro for text-to-SQL. · Beats GPT-5.5 (72.8%) and Claude Opus 4.6 (70.9%).

Google Research's Gemini-SQL2 scored 80.04% on the BIRD benchmark, beating OpenAI's GPT-5.5-xhigh by over 7 percentage points. The system, built on Gemini 3.1 Pro, translates natural language into executable SQL queries.

Key facts

Gemini-SQL2 scored 80.04% on the BIRD benchmark.
OpenAI's GPT-5.5-xhigh scored 72.8%.
Anthropic's Claude Opus 4.6 scored 70.9%.
Built on Google's Gemini 3.1 Pro model.
No public release or paper announced yet.

Google Research's Gemini-SQL2, a text-to-SQL system built on Gemini 3.1 Pro, hit an execution accuracy of 80.04% on the BIRD benchmark, according to a blog post from Google. That puts it roughly 7 points ahead of OpenAI's GPT-5.5-xhigh (72.8%) and 9 points ahead of Anthropic's Claude Opus 4.6 (70.9%). Models from Databricks, AWS, Tencent, and Alibaba all trail further behind.

The BIRD benchmark tests how accurately models convert natural language questions into SQL queries that execute correctly against real databases. Google Research notes that the task is especially hard because data is often layered and queries must account for complex business logic. The generated SQL queries both look correct and execute successfully, the company says.

Google has not announced a public release of Gemini-SQL2 and has not published a research paper yet. The company says better SQL understanding could improve natural language features across its data services more broadly — a hint that the technology may eventually land in BigQuery or other Google Cloud data tools.

The gap matters for the agentic coding race. OpenAI's Codex recently reached 5 million weekly users, up 400% from the start of the year, and Anthropic's Claude Code has been adding multi-agent workflows. SQL generation is a core capability for enterprise data agents — the ability to query production databases from natural language is a feature both OpenAI and Anthropic have been racing to polish. Google's lead on BIRD suggests it may have an edge if it productizes the model.

How the benchmark delta stacks up

The 7-point gap on BIRD is larger than typical benchmark deltas between flagship models. For context, Claude Opus 4.6 scored 80.9% on SWE-bench Verified, while GPT-5.5-xhigh scored 78.2% — a gap of less than 3 points. The SQL gap suggests Google's approach, likely involving specialized fine-tuning on SQL-heavy data rather than a general-purpose model, yields disproportionate gains on structured query tasks.

Google did not disclose whether Gemini-SQL2 uses chain-of-thought prompting, retrieval-augmented generation over database schemas, or a custom fine-tuning recipe. The lack of a paper means the technical details remain opaque — a contrast to Google's typical pattern of publishing research alongside benchmark claims.

What this means for the text-to-SQL market

The result puts pressure on OpenAI and Anthropic to improve SQL-specific performance. Both companies have focused on general coding benchmarks (SWE-bench, Terminal-Bench) but SQL generation is a distinct skill — it requires understanding database schemas, join logic, and aggregation functions. A model that excels on Python coding may still struggle with complex SQL queries involving multiple tables and window functions.

Enterprise adoption of natural-language data querying has been held back by accuracy concerns — users report that even small errors in generated SQL can produce plausible-looking wrong answers. A system that hits 80% on BIRD still fails on 1 in 5 queries, but that's a material improvement over the 70-73% range of general-purpose models.

What to watch

Watch for Google to either publish a paper detailing Gemini-SQL2's architecture or ship the capability into BigQuery's natural-language interface. If no paper appears within 60 days, the benchmark claim remains unverifiable — a pattern that would mark a shift in Google Research's publication norms.

Source: the-decoder.com

Source: gentic.news · Jun 13, 2026 · author=Ala SMITH · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The 7-point gap on BIRD is unusually large for a benchmark in the current era where flagship models typically cluster within 2-3 points of each other. This suggests Google has either found a specialized training recipe for SQL or is benefiting from Gemini 3.1 Pro's architecture in ways that general-purpose coding benchmarks don't capture. The lack of a paper is notable. Google DeepMind has traditionally published detailed technical reports alongside benchmark claims — the Gemini technical report was a 100+ page document. Skipping the paper for Gemini-SQL2 could mean the approach is considered a product feature rather than research, or that the company wants to avoid scrutiny of its methodology. Either way, the claim is less falsifiable without code or a paper. The competitive pressure here is asymmetric. OpenAI and Anthropic have been racing on SWE-bench and agentic coding benchmarks, but SQL generation is a distinct capability that matters for enterprise data workflows. If Google productizes this into BigQuery before OpenAI or Anthropic can match the accuracy, it could lock in enterprise customers who want natural-language data querying — a wedge that neither OpenAI nor Anthropic has a direct answer for, since they lack database-native products.

#nl-to-sql #foundation-models #benchmarks #google

This story is part of

The AI Infrastructure War Shifts from Chips to Developer Tools

Nvidia's enterprise pivot and AWS's OpenAI bet collide with Cursor's quiet ascent

Compare side-by-side

Google vs Anthropic

→

Mentioned in this article

Google Gemini-SQL2 GPT-5.5-xhigh Claude Opus 4.6 Anthropic OpenAI Gemini 3 Pro Gemini

Enjoyed this article?

Get the weekly AI intelligence briefing

✨AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Products & Launches5 shared topics

ChatGPT Market Share Dips Below 50% for First Time, Sensor Tower Reports

Products & Launches5 shared topics

Gemini 4 Pretraining Begins, Google's Most Ambitious Run Yet

Big Tech4 shared topics

Google DeepMind loses its third senior AI researcher in months as Nobel laureate John Jumper joins Anthropic

Products & Launches4 shared topics

Gemini 3.5 Live Translate Debuts as Real-Time Audio Model

Open Source4 shared topics

LLM Waterfall Pattern: 429 Failover Beats Retries & Circuit Breakers

Opinion & Analysis4 shared topics

Vercel Data: Open Models Spend Collapses to All-Time Low

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

Original research · EUMAS 2026

MNEMA — A Witness Lattice for Multi-Agent AI Memory

Cryptographic memory units · 1−α detection floor · 15 pp PDF

Field framework · v1.0

Epistemic Infrastructure

12 pillars · 11-stage knowledge metabolism · pathology catalog