Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Bar chart comparing BIRD benchmark scores shows Gemini-SQL2 at 80.04% and GPT-5.5-xhigh around 73%, with Gemini 3.1…
AI ResearchScore: 77

Google Gemini-SQL2 Hits 80.04% on BIRD, Beating GPT-5.5 by 7 Points

Google's Gemini-SQL2 hits 80.04% on BIRD, beating GPT-5.5 by 7 points and Claude Opus 4.6 by 9 points, with no public release or paper yet.

·10h ago·3 min read··25 views·AI-Generated·Report error
Share:
Source: the-decoder.comvia the_decoderCorroborated
What is Gemini-SQL2 and how does it perform on text-to-SQL benchmarks?

Google Research's Gemini-SQL2, built on Gemini 3.1 Pro, achieved 80.04% execution accuracy on the BIRD benchmark, surpassing OpenAI's GPT-5.5-xhigh (72.8%) and Anthropic's Claude Opus 4.6 (70.9%).

TL;DR

Gemini-SQL2 scores 80.04% on BIRD benchmark. · Built on Gemini 3.1 Pro for text-to-SQL. · Beats GPT-5.5 (72.8%) and Claude Opus 4.6 (70.9%).

Google Research's Gemini-SQL2 scored 80.04% on the BIRD benchmark, beating OpenAI's GPT-5.5-xhigh by over 7 percentage points. The system, built on Gemini 3.1 Pro, translates natural language into executable SQL queries.

Key facts

  • Gemini-SQL2 scored 80.04% on the BIRD benchmark.
  • OpenAI's GPT-5.5-xhigh scored 72.8%.
  • Anthropic's Claude Opus 4.6 scored 70.9%.
  • Built on Google's Gemini 3.1 Pro model.
  • No public release or paper announced yet.

Google Research's Gemini-SQL2, a text-to-SQL system built on Gemini 3.1 Pro, hit an execution accuracy of 80.04% on the BIRD benchmark, according to a blog post from Google. That puts it roughly 7 points ahead of OpenAI's GPT-5.5-xhigh (72.8%) and 9 points ahead of Anthropic's Claude Opus 4.6 (70.9%). Models from Databricks, AWS, Tencent, and Alibaba all trail further behind.

The BIRD benchmark tests how accurately models convert natural language questions into SQL queries that execute correctly against real databases. Google Research notes that the task is especially hard because data is often layered and queries must account for complex business logic. The generated SQL queries both look correct and execute successfully, the company says.

Google has not announced a public release of Gemini-SQL2 and has not published a research paper yet. The company says better SQL understanding could improve natural language features across its data services more broadly — a hint that the technology may eventually land in BigQuery or other Google Cloud data tools.

The gap matters for the agentic coding race. OpenAI's Codex recently reached 5 million weekly users, up 400% from the start of the year, and Anthropic's Claude Code has been adding multi-agent workflows. SQL generation is a core capability for enterprise data agents — the ability to query production databases from natural language is a feature both OpenAI and Anthropic have been racing to polish. Google's lead on BIRD suggests it may have an edge if it productizes the model.

How the benchmark delta stacks up

The 7-point gap on BIRD is larger than typical benchmark deltas between flagship models. For context, Claude Opus 4.6 scored 80.9% on SWE-bench Verified, while GPT-5.5-xhigh scored 78.2% — a gap of less than 3 points. The SQL gap suggests Google's approach, likely involving specialized fine-tuning on SQL-heavy data rather than a general-purpose model, yields disproportionate gains on structured query tasks.

Google did not disclose whether Gemini-SQL2 uses chain-of-thought prompting, retrieval-augmented generation over database schemas, or a custom fine-tuning recipe. The lack of a paper means the technical details remain opaque — a contrast to Google's typical pattern of publishing research alongside benchmark claims.

What this means for the text-to-SQL market

The result puts pressure on OpenAI and Anthropic to improve SQL-specific performance. Both companies have focused on general coding benchmarks (SWE-bench, Terminal-Bench) but SQL generation is a distinct skill — it requires understanding database schemas, join logic, and aggregation functions. A model that excels on Python coding may still struggle with complex SQL queries involving multiple tables and window functions.

Enterprise adoption of natural-language data querying has been held back by accuracy concerns — users report that even small errors in generated SQL can produce plausible-looking wrong answers. A system that hits 80% on BIRD still fails on 1 in 5 queries, but that's a material improvement over the 70-73% range of general-purpose models.

What to watch

Watch for Google to either publish a paper detailing Gemini-SQL2's architecture or ship the capability into BigQuery's natural-language interface. If no paper appears within 60 days, the benchmark claim remains unverifiable — a pattern that would mark a shift in Google Research's publication norms.


Source: the-decoder.com


Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from multiple verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The 7-point gap on BIRD is unusually large for a benchmark in the current era where flagship models typically cluster within 2-3 points of each other. This suggests Google has either found a specialized training recipe for SQL or is benefiting from Gemini 3.1 Pro's architecture in ways that general-purpose coding benchmarks don't capture. The lack of a paper is notable. Google DeepMind has traditionally published detailed technical reports alongside benchmark claims — the Gemini technical report was a 100+ page document. Skipping the paper for Gemini-SQL2 could mean the approach is considered a product feature rather than research, or that the company wants to avoid scrutiny of its methodology. Either way, the claim is less falsifiable without code or a paper. The competitive pressure here is asymmetric. OpenAI and Anthropic have been racing on SWE-bench and agentic coding benchmarks, but SQL generation is a distinct capability that matters for enterprise data workflows. If Google productizes this into BigQuery before OpenAI or Anthropic can match the accuracy, it could lock in enterprise customers who want natural-language data querying — a wedge that neither OpenAI nor Anthropic has a direct answer for, since they lack database-native products.
Compare side-by-side
Google vs Anthropic
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in AI Research

View all