testing & qa
30 articles about testing & qa in AI news
ThermoQA Benchmark Reveals LLM Reasoning Gaps: Claude Opus Leads at 94.1%
Researchers released ThermoQA, a 293-question benchmark testing thermodynamic reasoning. Claude Opus 4.6 scored 94.1% overall, but models showed significant degradation on complex cycle analysis versus simple property lookups.
Avoko Launches 'Behavioral Lab' for AI Agent Testing & Development
Avoko AI announced 'Avoko,' a platform described as a behavioral lab for AI agents. It aims to provide structured environments for testing, evaluating, and improving agent performance and reliability.
QAsk-Nav Benchmark Enables Separate Scoring of Navigation and Dialogue for Collaborative AI Agents
A new benchmark called QAsk-Nav enables separate evaluation of navigation and question-asking for collaborative embodied AI agents. The accompanying Light-CoNav model outperforms state-of-the-art methods while being significantly more efficient.
DriveXQA: New AI Framework Helps Autonomous Vehicles See Through Fog and Sensor Failures
Researchers introduce DriveXQA, a multimodal dataset and MVX-LLM architecture that enables autonomous vehicles to answer complex questions about adverse driving conditions by fusing data from multiple visual sensors, significantly improving performance in challenging scenarios like fog.
Cekura's Simulation Platform Solves the Critical QA Challenge for AI Agents
YC-backed startup Cekura launches a testing platform that uses synthetic users and LLM judges to simulate thousands of conversational paths for voice and chat AI agents, addressing the fundamental challenge of scaling quality assurance for stochastic AI systems.
Google Gemma 4 Model Reportedly in Testing, Signaling Next-Gen Open-Weight LLM Release
A developer reports that Google's Gemma 4 model is 'incoming' and currently being tested. This suggests the next iteration of Google's open-weight language model family is nearing release.
GPT-5.5 Stealth Test Reports Emerge, Claiming Performance Over Opus 4.7
Social media reports suggest OpenAI may be conducting limited, unannounced testing of GPT-5.5. Initial, unverified claims from testers indicate it outperforms Anthropic's Claude 3.5 Opus 4.7 model.
DeepSeek V4 Begins Limited Rollout with Fast, Expert, Vision Modes
DeepSeek V4 is reportedly in limited gray-scale testing with a new interface offering Fast, Expert, and Vision modes. This mirrors competitor Kimi's tiered system and suggests a move towards performance-based rate limiting.
From BM25 to Corrective RAG: A Benchmark Study Challenges the Dominance of Semantic Search for Tabular Data
A systematic benchmark of 10 RAG retrieval strategies on a financial QA dataset reveals that a two-stage hybrid + reranking pipeline performs best. Crucially, the classic BM25 algorithm outperformed modern dense retrieval models, challenging a core assumption in semantic search. The findings provide actionable, cost-aware guidance for building retrieval systems over heterogeneous documents.
How to Use Claude Code to Build Game Bots and Test Real-Time Systems
A developer used Claude Code to build a bot for Ultima Online, revealing a powerful workflow for testing complex, stateful systems.
How to Build Complete Godot Games with Claude Code Using the Godogen Pipeline
A new open-source pipeline called Godogen uses Claude Code to generate complete Godot games—including GDScript, assets, and bug-finding QA—from a single prompt.
From Prototype to Production: Streamlining LLM Evaluation for Luxury Clienteling & Chatbots
NVIDIA's new NeMo Evaluator Agent Skills dramatically simplifies testing and monitoring of conversational AI agents. For luxury retail, this means faster, more reliable deployment of high-quality clienteling assistants and customer service chatbots.
AttriBench Reveals LLM Attribution Bias: Accuracy Varies by Race, Gender
Researchers introduced AttriBench, a demographically-balanced dataset for quote attribution. Testing 11 LLMs revealed significant, systematic accuracy disparities across race, gender, and intersectional groups, exposing a new fairness benchmark.
ViGoR-Bench Exposes 'Logical Desert' in SOTA Visual AI: 20+ Models Fail Physical, Causal Reasoning Tasks
Researchers introduce ViGoR-Bench, a unified benchmark testing visual generative models on physical, causal, and spatial reasoning. It reveals significant deficits in over 20 leading models, challenging the 'performance mirage' of current evaluations.
ItinBench Benchmark Reveals LLMs Struggle with Multi-Dimensional Planning, Scoring Below 50% on Combined Tasks
Researchers introduced ItinBench, a benchmark testing LLMs on trip planning requiring simultaneous verbal and spatial reasoning. Models like GPT-4o and Gemini 1.5 Pro showed inconsistent performance, highlighting a gap in integrated cognitive capabilities.
Shopify Details Generative AI Use Cases for Ecommerce (2026)
Shopify's 2026 guide details generative AI use cases for ecommerce, including conversational AI for sales and product catalog management via the Storefront API. This matters as retailers seek practical AI integrations to enhance operations and customer engagement.
MCP Crosses 9,400 Servers; Build Your Own in TypeScript
MCP crossed 9,400 servers. Build a database introspection server in TypeScript. SDK handles protocol framing and capability negotiation.
Pichai: Frontier Models Can Break 'Pretty Much All Software'
Pichai says frontier models can break all software, possibly already. Systemic risk to enterprise stacks.
ARMOR 2025: Military Safety Benchmark Exposes LLM Gaps Across 21 Models
ARMOR 2025 benchmark tests 21 LLMs against military legal doctrines, revealing critical safety gaps that civilian benchmarks miss.
Codex Update Cuts GUI Workflow Latency 42%
Codex app update cuts GUI workflow latency 42%, enabling near-human-speed interface operation for autonomous app building and debugging.
GPT-5.5 + Codex Combines App Building, Browser Use, Image Gen
@intheworldofai claims GPT-5.5 + Codex is a super app better than Claude Code, with 7 capabilities including app building, debugging, browser use, and image generation.
The Agency: 147 Open Source AI Agents Hit 50K GitHub Stars in 2 Weeks
The Agency is an open source repository with 147 specialized AI agents across 12 divisions (engineering, design, marketing, etc.) that hit 50K GitHub stars in under two weeks. It provides one-command install for tools like Claude Code and Cursor, with full modding support.
Pinterest Builds Dedicated Conversion Candidate Generation Model
Pinterest details the design and deployment of a dedicated shopping conversion candidate generation model, replacing engagement-based retrieval. Key innovations include a parallel DCN v2 and MLP architecture (+11% recall) and a unified multi-task approach that boosted conversion recall by +42% over their 2023 model.
Run Claude Code in Any Sandbox with One API: AgentBox SDK
Swap coding agents and sandbox providers without changing code. Preserves full interactive capabilities (approval flows, streaming).
OpenCLAW-P2P v6.0 Cuts Paper Lookup Latency to <50ms
OpenCLAW-P2P v6.0 introduces a multi-layer persistence architecture and live reference verification, reducing paper retrieval latency from >3s to <50ms and operating with 14 autonomous agents that scored 50+ papers.
GPT ImageGen-2 Passes 'Otter Test', Generates Academic Papers
Wharton professor Ethan Mollick reports OpenAI's GPT ImageGen-2 now reliably generates complex text within images, including academic papers and slides, marking a significant leap in multimodal AI capability.
AI Agents Show Consistent Economic Analysis, Reducing Human Disagreement
A new study finds AI agents like Claude Code and Codex produce economic analyses with far less disagreement than human teams, landing near the human median but with no extreme outliers. This indicates AI's potential for scalable, consistent research support.
NATO Tests SWARM Biotactics' AI-Guided Cyborg Cockroaches for Recon
NATO is evaluating a biohybrid system from German defense startup SWARM Biotactics, which uses AI to guide live cockroaches fitted with sensor backpacks through complex environments for military reconnaissance.
Building a Semantic Recommendation System from Scratch
An engineer documents the process of building a semantic recommender using embeddings and vector search, focusing on the practical challenges and failures encountered. This is a crucial reality check for teams moving beyond collaborative filtering.
ML-Master 2.0 Hits 56.44% on MLE-Bench in 24-Hour Agentic Science Run
Researchers from Shanghai Jiao Tong University demonstrated ML-Master 2.0, an autonomous research agent that operated continuously for 24 hours on the MLE-Bench, achieving a 56.44% medal rate. The breakthrough centers on Hierarchical Cognitive Caching for state management, not reasoning, enabling long-horizon scientific workflows.