testing & qa

30 articles about testing & qa in AI news

ThermoQA Benchmark Reveals LLM Reasoning Gaps: Claude Opus Leads at 94.1%

Researchers released ThermoQA, a 293-question benchmark testing thermodynamic reasoning. Claude Opus 4.6 scored 94.1% overall, but models showed significant degradation on complex cycle analysis versus simple property lookups.

Apr 23, 202678% relevant

Avoko Launches 'Behavioral Lab' for AI Agent Testing & Development

Avoko AI announced 'Avoko,' a platform described as a behavioral lab for AI agents. It aims to provide structured environments for testing, evaluating, and improving agent performance and reliability.

Apr 16, 202689% relevant

QAsk-Nav Benchmark Enables Separate Scoring of Navigation and Dialogue for Collaborative AI Agents

A new benchmark called QAsk-Nav enables separate evaluation of navigation and question-asking for collaborative embodied AI agents. The accompanying Light-CoNav model outperforms state-of-the-art methods while being significantly more efficient.

Apr 2, 202675% relevant

DriveXQA: New AI Framework Helps Autonomous Vehicles See Through Fog and Sensor Failures

Researchers introduce DriveXQA, a multimodal dataset and MVX-LLM architecture that enables autonomous vehicles to answer complex questions about adverse driving conditions by fusing data from multiple visual sensors, significantly improving performance in challenging scenarios like fog.

Mar 13, 202675% relevant

Cekura's Simulation Platform Solves the Critical QA Challenge for AI Agents

YC-backed startup Cekura launches a testing platform that uses synthetic users and LLM judges to simulate thousands of conversational paths for voice and chat AI agents, addressing the fundamental challenge of scaling quality assurance for stochastic AI systems.

Mar 3, 202680% relevant

Google Gemma 4 Model Reportedly in Testing, Signaling Next-Gen Open-Weight LLM Release

A developer reports that Google's Gemma 4 model is 'incoming' and currently being tested. This suggests the next iteration of Google's open-weight language model family is nearing release.

Mar 28, 202687% relevant

GPT-5.5 Stealth Test Reports Emerge, Claiming Performance Over Opus 4.7

Social media reports suggest OpenAI may be conducting limited, unannounced testing of GPT-5.5. Initial, unverified claims from testers indicate it outperforms Anthropic's Claude 3.5 Opus 4.7 model.

Apr 19, 202685% relevant

DeepSeek V4 Begins Limited Rollout with Fast, Expert, Vision Modes

DeepSeek V4 is reportedly in limited gray-scale testing with a new interface offering Fast, Expert, and Vision modes. This mirrors competitor Kimi's tiered system and suggests a move towards performance-based rate limiting.

Apr 7, 202685% relevant

From BM25 to Corrective RAG: A Benchmark Study Challenges the Dominance of Semantic Search for Tabular Data

A systematic benchmark of 10 RAG retrieval strategies on a financial QA dataset reveals that a two-stage hybrid + reranking pipeline performs best. Crucially, the classic BM25 algorithm outperformed modern dense retrieval models, challenging a core assumption in semantic search. The findings provide actionable, cost-aware guidance for building retrieval systems over heterogeneous documents.

Apr 3, 202682% relevant

How to Use Claude Code to Build Game Bots and Test Real-Time Systems

A developer used Claude Code to build a bot for Ultima Online, revealing a powerful workflow for testing complex, stateful systems.

Mar 17, 202695% relevant

How to Build Complete Godot Games with Claude Code Using the Godogen Pipeline

A new open-source pipeline called Godogen uses Claude Code to generate complete Godot games—including GDScript, assets, and bug-finding QA—from a single prompt.

Mar 12, 202691% relevant

From Prototype to Production: Streamlining LLM Evaluation for Luxury Clienteling & Chatbots

NVIDIA's new NeMo Evaluator Agent Skills dramatically simplifies testing and monitoring of conversational AI agents. For luxury retail, this means faster, more reliable deployment of high-quality clienteling assistants and customer service chatbots.

Mar 6, 202660% relevant

AttriBench Reveals LLM Attribution Bias: Accuracy Varies by Race, Gender

Researchers introduced AttriBench, a demographically-balanced dataset for quote attribution. Testing 11 LLMs revealed significant, systematic accuracy disparities across race, gender, and intersectional groups, exposing a new fairness benchmark.

Apr 8, 202692% relevant

ViGoR-Bench Exposes 'Logical Desert' in SOTA Visual AI: 20+ Models Fail Physical, Causal Reasoning Tasks

Researchers introduce ViGoR-Bench, a unified benchmark testing visual generative models on physical, causal, and spatial reasoning. It reveals significant deficits in over 20 leading models, challenging the 'performance mirage' of current evaluations.

Mar 30, 202694% relevant

ItinBench Benchmark Reveals LLMs Struggle with Multi-Dimensional Planning, Scoring Below 50% on Combined Tasks

Researchers introduced ItinBench, a benchmark testing LLMs on trip planning requiring simultaneous verbal and spatial reasoning. Models like GPT-4o and Gemini 1.5 Pro showed inconsistent performance, highlighting a gap in integrated cognitive capabilities.

Mar 23, 202695% relevant

AWS Unveils Production Blueprint for Evaluating AI Agents with Strands and

AWS released Strands and AgentCore, a production blueprint for evaluating AI agents. It generates realistic scenarios and tracks metrics like completion rate and cost, addressing the gap between lab benchmarks and real-world performance—critical for retail AI deployments.

Jul 23, 202688% relevant

Schnucks and VitalityIP Launch Agentic Commerce Shopping Assistant Powered

Schnuck Markets and VitalityIP launched the first agentic commerce shopping assistant in grocery, powered by Google Cloud. It autonomously handles multi-step tasks like reordering and meal planning, moving beyond simple chatbots.

Jul 22, 202698% relevant

Octen Deep Research Bench Scores Beat OpenAI, Gemini by 17 Points

Octen's deep research tool beat OpenAI, Gemini, Grok, and Perplexity by 10–17 points on DeepResearch Bench, returning reports in under 3 minutes.

Jul 21, 202675% relevant

gdb: Benchmarks Saturate Too Fast for Reliable AI Progress Tracking

@gdb notes benchmarks saturate quickly. This undermines AI progress tracking and may force shift to dynamic evaluations.

Jul 16, 202675% relevant

Claude Code Tops JetBrains' New Kotlin Benchmark with 85.7% Resolution

Claude Code with Opus 4.7 xhigh tops JetBrains' Kotlin Benchmark at 85.7%. Configure your CLAUDE.md with Kotlin conventions and use `--model opus-4.7-xhigh` to match this performance.

Jul 8, 202698% relevant

Apple's Safari 247 Ships Official MCP Server: Debug Websites from Claude Code

Apple's Safari 247 MCP server lets Claude Code inspect and debug live web pages. Install it via Homebrew and connect to debug rendering or JavaScript issues.

Jul 1, 202675% relevant

You Deployed AI Search and Relevance Got Worse. Here’s Why It Happens

Retail TouchPoints reports that AI search deployments often worsen relevance due to poor embeddings, lack of fine-tuning, and misaligned ranking. This matters because retailers investing in AI search must address these pitfalls to avoid customer frustration and revenue loss.

Jun 26, 202694% relevant

Shopify Details Generative AI Use Cases for Ecommerce (2026)

Shopify's 2026 guide details generative AI use cases for ecommerce, including conversational AI for sales and product catalog management via the Storefront API. This matters as retailers seek practical AI integrations to enhance operations and customer engagement.

Jun 7, 202698% relevant

MCP Crosses 9,400 Servers; Build Your Own in TypeScript

MCP crossed 9,400 servers. Build a database introspection server in TypeScript. SDK handles protocol framing and capability negotiation.

May 21, 202690% relevant

Pichai: Frontier Models Can Break 'Pretty Much All Software'

Pichai says frontier models can break all software, possibly already. Systemic risk to enterprise stacks.

May 17, 202687% relevant

ARMOR 2025: Military Safety Benchmark Exposes LLM Gaps Across 21 Models

ARMOR 2025 benchmark tests 21 LLMs against military legal doctrines, revealing critical safety gaps that civilian benchmarks miss.

May 5, 202692% relevant

Codex Update Cuts GUI Workflow Latency 42%

Codex app update cuts GUI workflow latency 42%, enabling near-human-speed interface operation for autonomous app building and debugging.

May 1, 202684% relevant

GPT-5.5 + Codex Combines App Building, Browser Use, Image Gen

@intheworldofai claims GPT-5.5 + Codex is a super app better than Claude Code, with 7 capabilities including app building, debugging, browser use, and image generation.

Apr 30, 2026100% relevant

The Agency: 147 Open Source AI Agents Hit 50K GitHub Stars in 2 Weeks

The Agency is an open source repository with 147 specialized AI agents across 12 divisions (engineering, design, marketing, etc.) that hit 50K GitHub stars in under two weeks. It provides one-command install for tools like Claude Code and Cursor, with full modding support.

Apr 28, 202686% relevant

Pinterest Builds Dedicated Conversion Candidate Generation Model

Pinterest details the design and deployment of a dedicated shopping conversion candidate generation model, replacing engagement-based retrieval. Key innovations include a parallel DCN v2 and MLP architecture (+11% recall) and a unified multi-task approach that boosted conversion recall by +42% over their 2023 model.

Apr 27, 2026100% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety