Listen to today's AI briefing

Daily podcast — 5 min, AI-narrated summary of top stories

Google's Auto-Diagnose AI Hits 90% Accuracy Debugging Test Failures
AI ResearchScore: 85

Google's Auto-Diagnose AI Hits 90% Accuracy Debugging Test Failures

Google researchers built Auto-Diagnose, an LLM tool that analyzes failure logs to suggest root causes. It achieved 90.14% accuracy in evaluation and was used on over 52,000 distinct failing tests after company-wide deployment.

GAla Smith & AI Research Desk·5h ago·6 min read·5 views·AI-Generated
Share:
Google Deploys LLM Agent Auto-Diagnose to Debug Integration Test Failures at Scale

Integration test failures are a notorious productivity sink for software engineers. The signal—the actual root cause of a failure—is buried in massive, messy logs from heterogeneous systems, creating a low signal-to-noise ratio that makes manual diagnosis slow and frustrating. New research from Google, detailed in a paper titled "Auto-Diagnose: Automatic Root-Cause Diagnosis of Integration Test Failures Using Large Language Models," presents a deployed solution to this problem.

The team built Auto-Diagnose, an LLM-based tool integrated directly into Google's internal Critique code review system. When a test fails, the tool automatically analyzes the failure logs, summarizes the most relevant lines, and suggests a probable root cause. This places the diagnostic aid directly in the developer's workflow where the failure is already being reviewed, eliminating context switching.

What the Researchers Built

Auto-Diagnose is an AI agent designed for a specific, high-value software engineering task: post-failure analysis. Its core function is to parse the often-gigabyte-scale output of failed integration tests—which can include logs from multiple services, infrastructure systems, and application code—and produce a concise, actionable diagnosis.

The system was built and deployed within Google's massive internal codebase and development infrastructure. Its integration into Critique is key; Critique is Google's primary code review tool, meaning developers encounter Auto-Diagnose's suggestions precisely when they are investigating a test failure, maximizing relevance and adoption.

Key Results: Accuracy and Scale

The paper reports results from both a controlled evaluation and large-scale deployment.

Root-Cause Diagnosis Accuracy 90.14% Measured via manual evaluation of 71 real-world, sampled failure cases. Total Distinct Failing Tests Analyzed 52,635 After Google-wide deployment. User Feedback: "Not helpful" 5.8% Percentage of cases where users explicitly marked the suggestion as unhelpful. Helpfulness Rank Among Critique Tools #14 Ranked out of 370 total tools in the Critique system based on user feedback.

The 90.14% accuracy on a manually evaluated set is a strong technical result. However, the deployment metrics are arguably more significant: scaling to over 52,000 distinct test failures demonstrates real-world utility and robustness. A "Not helpful" rate of only 5.8% and a top-15 ranking among hundreds of tools indicate strong user acceptance.

How It Works: The LLM Agent Pipeline

The Auto-Diagnose pipeline is a multi-stage process designed for efficiency and accuracy within a production constraint:

  1. Log Collection & Chunking: The system gathers all logs related to a failing test. Due to context window limits, it doesn't feed the entire log (which can be millions of tokens) to the LLM. Instead, it performs smart chunking.
  2. Relevance Filtering & Summarization: This is the core preprocessing step. The system uses a combination of heuristics and a smaller, faster "scorer" LLM to filter and rank log lines by their likelihood of containing failure signals. It selects the most relevant chunks to form a condensed context.
  3. Root-Cause Generation: The filtered and condensed log context is fed to a larger, more capable LLM (the paper notes the use of Google's PaLM 2 family of models) with a carefully engineered prompt. The prompt instructs the model to act as a senior engineer, analyze the provided logs, and output a diagnosis following a specific template: a brief summary, the identified root cause, and relevant log excerpts as evidence.
  4. Integration & Presentation: The final diagnosis is formatted and injected as a comment into the Critique code review associated with the failing test, placing the answer directly in the developer's line of sight.

This architecture highlights a pragmatic approach to using LLMs in production: using smaller, cheaper models for preprocessing (filtering) to manage cost and latency, while reserving the large, expensive model for the final, high-value reasoning task.

What This Means in Practice

For a Google developer, this turns a potentially hour-long log-diving session into a process of reviewing a suggested diagnosis in seconds. The tool doesn't fix the bug, but it dramatically accelerates the "what broke?" phase of debugging, which is often the most time-consuming part.

gentic.news Analysis

This deployment is a textbook example of the shift from LLMs as chatbots to LLMs as integrated workflow agents. Google isn't just testing a prototype; it has productized an AI agent for a critical internal customer—its own engineers—and measured its success by both accuracy and user satisfaction metrics. This follows a clear trend we've covered, such as in "GitHub Copilot Workspace Goes Beta, Framing AI as a Full Coding Workflow Partner", where AI is moving from autocomplete to being embedded in the entire development lifecycle.

The choice to integrate into the existing Critique system, rather than creating a standalone tool, is a masterstroke in adoption strategy. It reduces friction to near-zero. This aligns with the broader industry movement towards AI-powered developer productivity platforms, a space where Google (with its Gemini Code Assist), GitHub (Copilot), and Amazon (CodeWhisperer) are in direct competition. Google's internal deployment of Auto-Diagnose serves as a large-scale R&D lab for features that could eventually filter into its external offerings.

Technically, the paper validates the retrieval-augmented generation (RAG) pattern for log analysis. The system's "scorer" model for relevance filtering is essentially performing retrieval over a massive log document. This is a more specialized and impactful application of RAG than many consumer-facing examples, addressing a real pain point with measurable efficiency gains. The reported 90%+ accuracy suggests that for structured, technical domains like log analysis, LLM agents can achieve reliability that makes them viable for daily use.

Frequently Asked Questions

What is Auto-Diagnose?

Auto-Diagnose is an internal Google tool that uses large language models (LLMs) to automatically analyze logs from failed integration tests, summarize the key issues, and suggest the root cause to developers within their code review interface.

How accurate is Google's Auto-Diagnose AI?

In a manual evaluation of 71 real-world failures, Auto-Diagnose correctly identified the root cause 90.14% of the time. In broad deployment across over 52,000 tests, users found it unhelpful in only 5.8% of cases.

What LLM does Auto-Diagnose use?

The research paper indicates the system utilizes Google's PaLM 2 family of models. It employs a two-stage process: a smaller, faster model for initial log filtering and relevance scoring, and a larger PaLM 2 model for the final reasoning and diagnosis generation.

Is Auto-Diagnose available to the public?

No, Auto-Diagnose is an internal tool deployed within Google's development infrastructure. However, the research paper is publicly available, and the techniques described could be implemented by other organizations or may influence future features in Google's external developer products like Gemini Code Assist.

Following this story?

Get a weekly digest with AI predictions, trends, and analysis — free.

AI Analysis

This work is significant not for a novel model architecture, but for demonstrating a **successful, large-scale productionization of an LLM agent** for a specific, high-value engineering task. The 90.14% accuracy is impressive, but the deployment metrics—52,635 distinct tests analyzed, a 5.8% 'not helpful' rate—are what prove real utility. This moves beyond the typical research paper benchmark to show operational success. Technically, it reinforces the effectiveness of a **hybrid RAG/agent pipeline** for dealing with massive context problems. The system smartly avoids the intractable cost of feeding entire test logs to an LLM by using a cheaper model as a 'filter' or 'retriever' to find relevant snippets first. This is a pattern practitioners should note for any application involving large, noisy documents (logs, codebases, tickets). The integration into the existing **Critique** workflow is the critical adoption lever. It highlights a key principle for AI tool builders: the best agent is often an invisible one that surfaces answers exactly where the user already is, minimizing friction. This deployment acts as a massive, ongoing experiment for Google, providing invaluable data on how engineers interact with and trust AI suggestions, which will directly inform their commercial offerings in the competitive AI-powered IDE space.

Mentioned in this article

Enjoyed this article?
Share:

Related Articles

More in AI Research

View all