Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Google Gemini AI model interface on a laptop screen, showing a browser window with code and graphical elements, with…
Products & LaunchesBreakthroughScore: 100

Gemini 3.5 Flash Scores 78.4 on OSWorld, Matching GPT-5.5

Google integrated Computer Use into Gemini 3.5 Flash, scoring 78.4 on OSWorld — matching GPT-5.5 and undercutting on cost.

·10h ago·2 min read··12 views·AI-Generated·Report error
Share:
Source: the-decoder.comvia the_decoderMulti-Source
What is Gemini 3.5 Flash's OSWorld benchmark score with Computer Use?

Google integrated Computer Use into Gemini 3.5 Flash, scoring 78.4 on OSWorld — matching GPT-5.5 (78.7) and trailing Anthropic Opus 4.8 (83.4). The model natively operates screens, browsers, and mobile devices via the Gemini API.

TL;DR

Google bakes Computer Use into Gemini 3.5 Flash. · Scores 78.4 on OSWorld, tied with GPT-5.5. · Includes adversarial training against prompt injection attacks.

Google baked Computer Use directly into Gemini 3.5 Flash, scoring 78.4 on OSWorld — matching GPT-5.5 (78.7). The model natively sees and operates screens, browsers, and mobile devices via the Gemini API.

Key facts

  • Gemini 3.5 Flash scores 78.4 on OSWorld.
  • GPT-5.5 leads at 78.7; Anthropic Opus 4.8 at 83.4.
  • Feature available via Gemini API and Enterprise Agent Platform.
  • Includes adversarial training for prompt injection defense.
  • Previously only available as separate Gemini 2.5 model.

Google has integrated "Computer Use" directly into Gemini 3.5 Flash, allowing the model to see, understand, and interact with computers, browsers, and mobile devices autonomously. Previously, this capability was only available as a separate Gemini 2.5 model. Combined with existing tools like function calls, Search, and Maps, developers can now build agents for software testing or office automation across browser, mobile, and desktop environments According to The Decoder.

On the OSWorld benchmark, Gemini 3.5 Flash scores 78.4, beating Gemini 3 Flash (65.1) and GPT-5.4 mini (72.1). GPT-5.5 sits just ahead at 78.7, while Anthropic's Opus 4.8 leads at 83.4. Sonnet 4.6 also hits 78.4, and Gemini 3.1 Pro lands at 76.2. The benchmark measures an agent's ability to complete real-world computer tasks like file manipulation and web navigation.

Security and Deployment

To guard against prompt injection attacks, Google uses adversarial training and two optional enterprise safeguards. One requires user confirmation for sensitive or irreversible actions, while the other automatically stops tasks when it detects indirect prompt injections. Google also recommends sandboxing, human oversight, and strict access controls, with more details in its best practices documentation. The feature is available through the Gemini API and the Gemini Enterprise Agent Platform. A Browserbase demo and a GitHub reference implementation are also available.

The move follows Google's broader push to embed agentic capabilities directly into its models rather than requiring separate orchestration layers — a pattern also seen in OpenAI's GPT-5.5 and Anthropic's Claude Opus. By folding Computer Use into the cheaper Flash tier, Google undercuts competitors on price while narrowing the gap on agentic benchmarks.

What to watch

Watch for enterprise adoption metrics in Google Cloud's next quarterly earnings, and whether Anthropic or OpenAI respond with lower-tier models matching Flash's price-performance on OSWorld. A direct head-to-head with GPT-5.5-mini would clarify the agentic cost curve.


Source: the-decoder.com


Sources cited in this article

  1. Flash
Source: gentic.news · · author= · citation.json

AI-assisted reporting. Generated by gentic.news from 2 verified sources, fact-checked against the Living Graph of 4,300+ entities. Edited by Ala SMITH.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

The integration of Computer Use into Gemini 3.5 Flash marks a structural shift in how Google packages agentic capability. By baking screen control into the base model rather than offering it as a separate fine-tune, Google reduces latency and complexity for developers — a direct response to Anthropic's Claude Opus and OpenAI's GPT-5.5 agent features. The 78.4 OSWorld score is telling: it's within 0.3 points of GPT-5.5, but Flash is Google's low-cost tier, meaning the price-to-performance ratio likely favors Google for high-volume agentic workflows. Notably, the gap between top models has collapsed. The spread from Gemini 3 Flash (65.1) to Opus 4.8 (83.4) is just 18.3 points, with four models clustered within 2.3 points of each other. This suggests OSWorld may be nearing saturation for current architectures, or that the benchmark rewards similar design choices across labs. Google's adversarial training against prompt injection is a practical differentiator, as computer-use agents are uniquely vulnerable to indirect attacks via screen content. Google's timing is aggressive: the company committed $11B/year to SpaceX compute and $14B to Anthropic in the same month, signaling a dual strategy of building in-house while betting on competitors. The Flash-tier Computer Use could cannibalize demand for Anthropic's Opus if enterprise customers prioritize cost over the 5-point OSWorld gap.
Compare side-by-side
Anthropic vs Google
Enjoyed this article?
Share:

AI Toolslive

Five one-click lenses on this article. Cached for 24h.

Pick a tool above to generate an instant lens on this article.

Related Articles

From the lab

The framework underneath this story

Every article on this site sits on top of one engine and one framework — both built by the lab.

More in Products & Launches

View all