fact checking

30 articles about fact checking in AI news

GPT-5.2 Pro Emerges as Powerful Fact-Checking Assistant, Transforming Verification Workflows

OpenAI's GPT-5.2 Pro demonstrates remarkable fact-checking capabilities, automatically identifying objections, caveats, and mathematical errors in written content. This represents a significant advancement in AI-assisted verification previously limited to specialized domains.

Mar 4, 202685% relevant

AI Fact-Checks Rated More Helpful, Less Ideological Than Human Ones

A new experiment found LLM-generated fact-checks are rated as more helpful and less ideological than human ones, achieving broader acceptance across political lines. This suggests AI could reduce polarization in online information verification.

Apr 11, 202685% relevant

Truth AnChoring (TAC): New Post-Hoc Calibration Method Aligns LLM Uncertainty Scores with Factual Correctness

A new arXiv paper introduces Truth AnChoring (TAC), a post-hoc calibration protocol that aligns heuristic uncertainty estimation metrics with factual correctness. The method addresses 'proxy failure,' where standard metrics become non-discriminative when confidence is low.

Apr 2, 202676% relevant

MemFactory Framework Unifies Agent Memory Training & Inference, Reports 14.8% Gains Over Baselines

Researchers introduced MemFactory, a unified framework treating agent memory as a trainable component. It supports multiple memory paradigms and shows up to 14.8% relative improvement over baseline methods.

Apr 1, 202697% relevant

MASFactory: A Graph-Centric Framework for Orchestrating LLM-Based Multi-Agent Systems

Researchers introduce MASFactory, a framework that uses 'Vibe Graphing' to compile natural-language intent into executable multi-agent workflows. This addresses implementation complexity and reuse challenges in LLM-based agent systems.

Mar 9, 202675% relevant

From Agentic Coding to Autonomous Factories: How Cursor Automations Is Redefining Software Engineering

Cursor's new Automations feature transforms AI-assisted coding from a manual, agent-babysitting model to an event-driven system where AI agents trigger automatically based on workflows. This addresses the human attention bottleneck in managing multiple coding agents simultaneously.

Mar 5, 202685% relevant

The Benchmarking Revolution: How AI Systems Are Now Co-Evolving With Their Own Tests

Researchers introduce DeepFact, a novel framework where AI fact-checking agents and their evaluation benchmarks evolve together through an 'audit-then-score' process, dramatically improving expert accuracy from 61% to 91% and creating more reliable verification systems.

Mar 9, 202675% relevant

Hinton Rebrands AI Hallucinations as 'Confabulations'

Geoffrey Hinton redefines AI hallucinations as 'confabulations,' arguing that intelligence reconstructs reality into plausible stories rather than storing facts like a database.

Apr 26, 202687% relevant

FORGE Benchmark Reveals Domain Knowledge

Researchers introduced FORGE, a multimodal dataset with 2D/3D data and fine-grained annotations for manufacturing. Evaluating 18 MLLMs revealed domain knowledge, not visual grounding, is the key bottleneck, with fine-tuning offering a clear path forward.

Apr 10, 202672% relevant

Claude Code Rate Limits Just Doubled: How to Use the New Capacity Starting Today

Claude Code's doubled rate limits and removed peak-hour throttling on Pro, Max, Team, and Enterprise plans let you stop conserving Opus quota and run parallel agent sessions without limit anxiety.

Jul 2, 202660% relevant

Claude Code Steganography Flagged Chinese Users; Anthropic Rolls Back

Anthropic's Claude Code 2.1.91 used steganography to detect Chinese users. After Reddit exposure, Anthropic rolled back the feature, calling it an experiment against model distillation.

Jul 1, 2026100% relevant

Amazon’s Alexa Now Shows 365-Day Price History for Shopping

Amazon expanded Alexa for Shopping to show 30, 90, and 365 days of price history. Over 50 million customers have used the feature since 2024, enhancing deal confidence.

Jun 30, 202678% relevant

Build a Fake Tool-Result Detector for Claude Code

Claude Code can hallucinate tool results. Add a `zen_stop_hook` detector that greps for `<result>` blocks and 'written: N bytes' claims to catch fake outputs every turn.

Jun 25, 2026100% relevant

GLM-5.2 matches Opus 4.7 at 1/5 the price in Snowflake coding test

Zhipu AI's GLM-5.2 matched Claude Opus 4.7 on a Snowflake coding benchmark at one-fifth the cost, threatening Western AI lab pricing and IPO valuations.

Jun 24, 202685% relevant

MCP Agents Log 'Success: True' While Tasks Go Nowhere — Protocol Bug

MCP returns null results inside HTTP 200 responses, causing agents to log success while tasks never run. Vouqis proxy catches this with structured audit logs.

Jun 15, 202695% relevant

Claude Code's June 15 Agentic Credit Split: How to Avoid Hitting the $20 Wall

Claude Code's June 15 agentic credit split moves `claude -p` and CI workflows to a separate $20/month bucket on Pro. Upgrade to Max 5x or switch to direct API for production pipelines.

Jun 10, 2026100% relevant

/loop in Claude Code: How to Build Multi-Agent Workflows Without Leaving

The /loop command in Claude Code enables autonomous multi-agent workflows, cycling through coding tasks until completion. Developers should use it to automate iterative processes like TDD cycles.

Jun 9, 202690% relevant

Dynamic Workflows: A New Agent Primitive Emerges

Dynamic workflows generate harnesses on the fly for agent orchestrators, enabling branching and verified tasks across coding agents like Claude Code and Codex.

Jun 4, 202675% relevant

Anthropic's 80% Code Stat: What It Means for Your CLAUDE.md and Workflow Design

Anthropic's 80% code stat reveals a recursive self-improvement loop. For Claude Code users, invest in CLAUDE.md, MCP servers, and task decomposition to replicate this.

Jun 3, 202675% relevant

Microsoft RAMPART Brings Pytest-Based Safety Testing to AI Agents

Microsoft's RAMPART brings pytest-native safety testing to AI agents, covering adversarial attacks and benign failures, addressing a critical gap in agent development.

May 27, 202689% relevant

Amazon's SageMaker Agentic Fine-Tuning Supports Llama, Qwen, DeepSeek, Nova

Amazon launched an AI agent on SageMaker that automates fine-tuning of Llama, Qwen, DeepSeek, and Nova models via plain-language instructions, abstracting API fragmentation.

May 5, 202690% relevant

OpenAI Agents Now Ask Questions Good Enough for Research Papers

Sébastien Bubeck revealed on the OpenAI Podcast that internal AI agents now ask research questions so insightful they're inspiring papers and correcting published mistakes, with a 1-2 year timeline for full researcher-level capabilities.

Apr 28, 202685% relevant

Agent Harnessing: The Infrastructure That Makes AI Agents Work

A detailed technical guide argues that the model is not the hard part of building AI agents. The six-component harness — context management, memory, tools, control flow, verification, and coordination — is what separates production-grade agents from those that fail silently.

Apr 25, 202688% relevant

OpenCLAW-P2P v6.0 Cuts Paper Lookup Latency to <50ms

OpenCLAW-P2P v6.0 introduces a multi-layer persistence architecture and live reference verification, reducing paper retrieval latency from >3s to <50ms and operating with 14 autonomous agents that scored 50+ papers.

Apr 23, 202677% relevant

AutoZone, Home Depot, Macy’s, and Ulta Partner with Google for Agentic AI

AutoZone, Home Depot, Macy’s, and Ulta Beauty have entered into partnerships with Google Cloud to implement agentic AI solutions. These systems, built on Google's Gemini models, aim to handle complex, multi-step customer interactions. The move signals a shift from experimental chatbots to more autonomous, task-completing AI agents in retail.

Apr 22, 2026100% relevant

Semantic Needles in Document Haystacks

Researchers developed a framework to test how LLMs score similarity between documents with subtle semantic changes. They found models exhibit positional bias, are sensitive to topical context, and produce unique scoring 'fingerprints'. This matters for any application relying on LLM-as-a-Judge for document comparison.

Apr 22, 202674% relevant

POTEMKIN Framework Exposes Critical Trust Gap in Agentic AI Tools

A new paper formalizes Adversarial Environmental Injection (AEI), a threat model where compromised tools deceive AI agents. The POTEMKIN testing harness found agents are evaluated for performance, not skepticism, creating a critical trust gap.

Apr 22, 202675% relevant

PoisonedRAG Attack Hijacks LLM Answers 97% of Time with 5 Documents

Researchers demonstrated that inserting only 5 poisoned documents into a 2.6 million document database can hijack a RAG system's answers 97% of the time, exposing critical vulnerabilities in 'hallucination-free' retrieval systems.

Apr 20, 202695% relevant

Anthropic's Claude Promoted for Stock Picking with 12-Prompt Guide

A viral X thread promotes using Anthropic's Claude AI to identify potential '100-bagger' stocks with a set of 12 prompts. This highlights growing experimentation with general-purpose LLMs for specialized financial analysis, despite inherent risks.

Apr 16, 202689% relevant

Google Launches PaperBanana AI to Format Raw Methods into Publication Text

Google has launched PaperBanana, an AI tool designed to transform unstructured methodology notes into polished, publication-ready text. This targets a key bottleneck in academic writing, automating the formatting and structuring of methods sections.

Apr 16, 202687% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety