fact checking
30 articles about fact checking in AI news
GPT-5.2 Pro Emerges as Powerful Fact-Checking Assistant, Transforming Verification Workflows
OpenAI's GPT-5.2 Pro demonstrates remarkable fact-checking capabilities, automatically identifying objections, caveats, and mathematical errors in written content. This represents a significant advancement in AI-assisted verification previously limited to specialized domains.
AI Fact-Checks Rated More Helpful, Less Ideological Than Human Ones
A new experiment found LLM-generated fact-checks are rated as more helpful and less ideological than human ones, achieving broader acceptance across political lines. This suggests AI could reduce polarization in online information verification.
Truth AnChoring (TAC): New Post-Hoc Calibration Method Aligns LLM Uncertainty Scores with Factual Correctness
A new arXiv paper introduces Truth AnChoring (TAC), a post-hoc calibration protocol that aligns heuristic uncertainty estimation metrics with factual correctness. The method addresses 'proxy failure,' where standard metrics become non-discriminative when confidence is low.
MemFactory Framework Unifies Agent Memory Training & Inference, Reports 14.8% Gains Over Baselines
Researchers introduced MemFactory, a unified framework treating agent memory as a trainable component. It supports multiple memory paradigms and shows up to 14.8% relative improvement over baseline methods.
MASFactory: A Graph-Centric Framework for Orchestrating LLM-Based Multi-Agent Systems
Researchers introduce MASFactory, a framework that uses 'Vibe Graphing' to compile natural-language intent into executable multi-agent workflows. This addresses implementation complexity and reuse challenges in LLM-based agent systems.
From Agentic Coding to Autonomous Factories: How Cursor Automations Is Redefining Software Engineering
Cursor's new Automations feature transforms AI-assisted coding from a manual, agent-babysitting model to an event-driven system where AI agents trigger automatically based on workflows. This addresses the human attention bottleneck in managing multiple coding agents simultaneously.
The Benchmarking Revolution: How AI Systems Are Now Co-Evolving With Their Own Tests
Researchers introduce DeepFact, a novel framework where AI fact-checking agents and their evaluation benchmarks evolve together through an 'audit-then-score' process, dramatically improving expert accuracy from 61% to 91% and creating more reliable verification systems.
Hinton Rebrands AI Hallucinations as 'Confabulations'
Geoffrey Hinton redefines AI hallucinations as 'confabulations,' arguing that intelligence reconstructs reality into plausible stories rather than storing facts like a database.
FORGE Benchmark Reveals Domain Knowledge
Researchers introduced FORGE, a multimodal dataset with 2D/3D data and fine-grained annotations for manufacturing. Evaluating 18 MLLMs revealed domain knowledge, not visual grounding, is the key bottleneck, with fine-tuning offering a clear path forward.
Claude Code Rate Limits Just Doubled: How to Use the New Capacity Starting Today
Claude Code's doubled rate limits and removed peak-hour throttling on Pro, Max, Team, and Enterprise plans let you stop conserving Opus quota and run parallel agent sessions without limit anxiety.
Claude Code Steganography Flagged Chinese Users; Anthropic Rolls Back
Anthropic's Claude Code 2.1.91 used steganography to detect Chinese users. After Reddit exposure, Anthropic rolled back the feature, calling it an experiment against model distillation.
Amazon’s Alexa Now Shows 365-Day Price History for Shopping
Amazon expanded Alexa for Shopping to show 30, 90, and 365 days of price history. Over 50 million customers have used the feature since 2024, enhancing deal confidence.
Build a Fake Tool-Result Detector for Claude Code
Claude Code can hallucinate tool results. Add a `zen_stop_hook` detector that greps for `<result>` blocks and 'written: N bytes' claims to catch fake outputs every turn.
GLM-5.2 matches Opus 4.7 at 1/5 the price in Snowflake coding test
Zhipu AI's GLM-5.2 matched Claude Opus 4.7 on a Snowflake coding benchmark at one-fifth the cost, threatening Western AI lab pricing and IPO valuations.
MCP Agents Log 'Success: True' While Tasks Go Nowhere — Protocol Bug
MCP returns null results inside HTTP 200 responses, causing agents to log success while tasks never run. Vouqis proxy catches this with structured audit logs.
Claude Code's June 15 Agentic Credit Split: How to Avoid Hitting the $20 Wall
Claude Code's June 15 agentic credit split moves `claude -p` and CI workflows to a separate $20/month bucket on Pro. Upgrade to Max 5x or switch to direct API for production pipelines.
/loop in Claude Code: How to Build Multi-Agent Workflows Without Leaving
The /loop command in Claude Code enables autonomous multi-agent workflows, cycling through coding tasks until completion. Developers should use it to automate iterative processes like TDD cycles.
Dynamic Workflows: A New Agent Primitive Emerges
Dynamic workflows generate harnesses on the fly for agent orchestrators, enabling branching and verified tasks across coding agents like Claude Code and Codex.
Anthropic's 80% Code Stat: What It Means for Your CLAUDE.md and Workflow Design
Anthropic's 80% code stat reveals a recursive self-improvement loop. For Claude Code users, invest in CLAUDE.md, MCP servers, and task decomposition to replicate this.
Microsoft RAMPART Brings Pytest-Based Safety Testing to AI Agents
Microsoft's RAMPART brings pytest-native safety testing to AI agents, covering adversarial attacks and benign failures, addressing a critical gap in agent development.
Amazon's SageMaker Agentic Fine-Tuning Supports Llama, Qwen, DeepSeek, Nova
Amazon launched an AI agent on SageMaker that automates fine-tuning of Llama, Qwen, DeepSeek, and Nova models via plain-language instructions, abstracting API fragmentation.
OpenAI Agents Now Ask Questions Good Enough for Research Papers
Sébastien Bubeck revealed on the OpenAI Podcast that internal AI agents now ask research questions so insightful they're inspiring papers and correcting published mistakes, with a 1-2 year timeline for full researcher-level capabilities.
Agent Harnessing: The Infrastructure That Makes AI Agents Work
A detailed technical guide argues that the model is not the hard part of building AI agents. The six-component harness — context management, memory, tools, control flow, verification, and coordination — is what separates production-grade agents from those that fail silently.
OpenCLAW-P2P v6.0 Cuts Paper Lookup Latency to <50ms
OpenCLAW-P2P v6.0 introduces a multi-layer persistence architecture and live reference verification, reducing paper retrieval latency from >3s to <50ms and operating with 14 autonomous agents that scored 50+ papers.
AutoZone, Home Depot, Macy’s, and Ulta Partner with Google for Agentic AI
AutoZone, Home Depot, Macy’s, and Ulta Beauty have entered into partnerships with Google Cloud to implement agentic AI solutions. These systems, built on Google's Gemini models, aim to handle complex, multi-step customer interactions. The move signals a shift from experimental chatbots to more autonomous, task-completing AI agents in retail.
Semantic Needles in Document Haystacks
Researchers developed a framework to test how LLMs score similarity between documents with subtle semantic changes. They found models exhibit positional bias, are sensitive to topical context, and produce unique scoring 'fingerprints'. This matters for any application relying on LLM-as-a-Judge for document comparison.
POTEMKIN Framework Exposes Critical Trust Gap in Agentic AI Tools
A new paper formalizes Adversarial Environmental Injection (AEI), a threat model where compromised tools deceive AI agents. The POTEMKIN testing harness found agents are evaluated for performance, not skepticism, creating a critical trust gap.
PoisonedRAG Attack Hijacks LLM Answers 97% of Time with 5 Documents
Researchers demonstrated that inserting only 5 poisoned documents into a 2.6 million document database can hijack a RAG system's answers 97% of the time, exposing critical vulnerabilities in 'hallucination-free' retrieval systems.
Anthropic's Claude Promoted for Stock Picking with 12-Prompt Guide
A viral X thread promotes using Anthropic's Claude AI to identify potential '100-bagger' stocks with a set of 12 prompts. This highlights growing experimentation with general-purpose LLMs for specialized financial analysis, despite inherent risks.
Google Launches PaperBanana AI to Format Raw Methods into Publication Text
Google has launched PaperBanana, an AI tool designed to transform unstructured methodology notes into polished, publication-ready text. This targets a key bottleneck in academic writing, automating the formatting and structuring of methods sections.