Google Research's Gemini-SQL2 scored 80.04% on the BIRD benchmark, beating OpenAI's GPT-5.5-xhigh by over 7 percentage points. The system, built on Gemini 3.1 Pro, translates natural language into executable SQL queries.
Key facts
- Gemini-SQL2 scored 80.04% on the BIRD benchmark.
- OpenAI's GPT-5.5-xhigh scored 72.8%.
- Anthropic's Claude Opus 4.6 scored 70.9%.
- Built on Google's Gemini 3.1 Pro model.
- No public release or paper announced yet.
Google Research's Gemini-SQL2, a text-to-SQL system built on Gemini 3.1 Pro, hit an execution accuracy of 80.04% on the BIRD benchmark, according to a blog post from Google. That puts it roughly 7 points ahead of OpenAI's GPT-5.5-xhigh (72.8%) and 9 points ahead of Anthropic's Claude Opus 4.6 (70.9%). Models from Databricks, AWS, Tencent, and Alibaba all trail further behind.
The BIRD benchmark tests how accurately models convert natural language questions into SQL queries that execute correctly against real databases. Google Research notes that the task is especially hard because data is often layered and queries must account for complex business logic. The generated SQL queries both look correct and execute successfully, the company says.
Google has not announced a public release of Gemini-SQL2 and has not published a research paper yet. The company says better SQL understanding could improve natural language features across its data services more broadly — a hint that the technology may eventually land in BigQuery or other Google Cloud data tools.
The gap matters for the agentic coding race. OpenAI's Codex recently reached 5 million weekly users, up 400% from the start of the year, and Anthropic's Claude Code has been adding multi-agent workflows. SQL generation is a core capability for enterprise data agents — the ability to query production databases from natural language is a feature both OpenAI and Anthropic have been racing to polish. Google's lead on BIRD suggests it may have an edge if it productizes the model.
How the benchmark delta stacks up
The 7-point gap on BIRD is larger than typical benchmark deltas between flagship models. For context, Claude Opus 4.6 scored 80.9% on SWE-bench Verified, while GPT-5.5-xhigh scored 78.2% — a gap of less than 3 points. The SQL gap suggests Google's approach, likely involving specialized fine-tuning on SQL-heavy data rather than a general-purpose model, yields disproportionate gains on structured query tasks.
Google did not disclose whether Gemini-SQL2 uses chain-of-thought prompting, retrieval-augmented generation over database schemas, or a custom fine-tuning recipe. The lack of a paper means the technical details remain opaque — a contrast to Google's typical pattern of publishing research alongside benchmark claims.
What this means for the text-to-SQL market
The result puts pressure on OpenAI and Anthropic to improve SQL-specific performance. Both companies have focused on general coding benchmarks (SWE-bench, Terminal-Bench) but SQL generation is a distinct skill — it requires understanding database schemas, join logic, and aggregation functions. A model that excels on Python coding may still struggle with complex SQL queries involving multiple tables and window functions.
Enterprise adoption of natural-language data querying has been held back by accuracy concerns — users report that even small errors in generated SQL can produce plausible-looking wrong answers. A system that hits 80% on BIRD still fails on 1 in 5 queries, but that's a material improvement over the 70-73% range of general-purpose models.
What to watch
Watch for Google to either publish a paper detailing Gemini-SQL2's architecture or ship the capability into BigQuery's natural-language interface. If no paper appears within 60 days, the benchmark claim remains unverifiable — a pattern that would mark a shift in Google Research's publication norms.

Source: the-decoder.com









