interpretability

30 articles about interpretability in AI news

Anthropic Teaches Claude Why: New Interpretability Method Deployed

Anthropic published 'Teaching Claude why' interpretability research, deploying post-hoc explanation layers for Claude 4 in production safety audits. The method cites training examples influencing outputs.

May 8, 2026100% relevant

Stanford and Harvard Researchers Publish Significant AI Safety Paper on Mechanistic Interpretability

Researchers from Stanford and Harvard have published a notable AI paper focusing on mechanistic interpretability and AI safety, with implications for understanding and securing advanced AI systems.

Apr 1, 202687% relevant

Anthropic Trains Claude to Translate Its Own Activations Into Text

Anthropic trains Claude to translate its internal activations into human-readable text via Natural Language Autoencoders, enabling new interpretability insights.

May 7, 202695% relevant

Microsoft Paper: AI Models Interpret Themselves Better Than Humans

Microsoft proposes self-interpretable AI models that beat human interpretability on 6 benchmarks, challenging the human-centric paradigm.

May 6, 202675% relevant

New Thesis Exposes Critical Flaws in Recommender System Fairness Metrics —

This thesis systematically analyzes offline fairness evaluation measures for recommender systems, revealing flaws in interpretability, expressiveness, and applicability. It proposes novel evaluation approaches and practical guidelines for selecting appropriate measures, directly addressing the confusion caused by un-validated metrics.

Apr 29, 202684% relevant

Anthropic Fellows Introduce 'Model Diffing' Method to Systematically Compare Open-Weight AI Model Behaviors

Anthropic's Fellows research team published a new method applying software 'diffing' principles to compare AI models, identifying unique behavioral features. This provides a systematic framework for model interpretability and safety analysis.

Apr 3, 202685% relevant

Claude Code's 'Black Box' Thinking: Why Your Prompts Need More Context, Not Less

Anthropic's interpretability research reveals Claude uses parallel strategies you can't see. Feed Claude Code more project context, not less, to trigger its most effective reasoning patterns.

Mar 25, 202668% relevant

SIDReasoner: A New Framework for Reasoning-Enhanced Generative Recommendation

Researchers propose SIDReasoner, a two-stage framework that improves LLM-based recommendation by enhancing reasoning over Semantic IDs. It strengthens the alignment between item tokens and language, enabling better interpretability and cross-domain generalization without extensive labeled reasoning data.

Mar 25, 202682% relevant

Deep-HiCEMs & MLCS: New Methods for Learning Multi-Level Concept Hierarchies from Sparse Labels

New research introduces Multi-Level Concept Splitting (MLCS) and Deep-HiCEMs, enabling AI models to discover hierarchical, interpretable concepts from only top-level annotations. This advances concept-based interpretability beyond flat, independent concepts.

Mar 12, 202670% relevant

Decoding the First Token Fixation: How LLMs Develop Structural Attention Biases

New research reveals how large language models develop 'attention sinks'—disproportionate focus on the first input token—through a simple circuit mechanism that emerges early in training. This structural bias has significant implications for model interpretability and performance.

Mar 10, 202675% relevant

AI Gets a Confidence Meter: New Method Tackles LLM Hallucinations in Interpretable Models

Researchers propose an uncertainty-aware framework for Concept Bottleneck Models that quantifies and incorporates the reliability of LLM-generated concept labels, addressing critical hallucination risks while maintaining model interpretability.

Mar 2, 202680% relevant

LIDS Framework Revolutionizes LLM Summary Evaluation with Statistical Rigor

Researchers introduce LIDS, a novel method combining BERT embeddings, SVD decomposition, and statistical inference to evaluate LLM-generated summaries with unprecedented accuracy and interpretability. The framework provides layered theme analysis with controlled false discovery rates, addressing a critical gap in NLP assessment.

Mar 3, 202675% relevant

DualFashion: Dual-Diffusion Transformer Generates Outfit Images & Text

DualFashion uses a dual-diffusion Transformer to jointly generate fashion images and text, outperforming SOTA on iFashion and Polyvore-U with interpretable outputs.

May 19, 202682% relevant

SAEs Predict Agent Tool Failures Before Execution, Paper Shows

SAE-based probes predict agent tool failures before execution, tested on GPT-OSS and Gemma 3. Adds internal observability missing from current external methods.

May 11, 202685% relevant

Anthropic Unveils TAI Research Agenda Targeting AI Economics, Threats, R&D

Anthropic's TAI will study four areas: economic diffusion, threats, wild AI, and AI-driven R&D. No budget disclosed.

May 7, 202685% relevant

Qwen3.5-27B Gets Sparse Autoencoders: 81k Features Exposed

Qwen released Qwen-Scope, adding Sparse Autoencoders to Qwen3.5-27B, exposing 81k features across 64 layers for steerable inference.

Apr 30, 202687% relevant

How a Custom Multimodal Transformer Beat a Fine-Tuned LLM for Attribute

LeBonCoin's ML team built a custom late-fusion transformer that uses pre-computed visual embeddings and character n-gram text vectors to predict ad attributes. It outperformed a fine-tuned VLM while running on CPU with sub-200ms latency, offering calibrated probabilities and 15-minute retraining cycles.

Apr 29, 2026100% relevant

Hinton Rebrands AI Hallucinations as 'Confabulations'

Geoffrey Hinton redefines AI hallucinations as 'confabulations,' arguing that intelligence reconstructs reality into plausible stories rather than storing facts like a database.

Apr 26, 202687% relevant

ERA Framework Improves RAG Honesty by Modeling Knowledge Conflicts as

ERA replaces scalar confidence scores with explicit evidence distributions to distinguish between uncertainty and ambiguity in RAG systems, improving abstention behavior and calibration.

Apr 24, 202688% relevant

New Benchmark Study Challenges the Robustness of Counterfactual

Researchers have conducted the first unified benchmark of 11 methods that generate 'what-if' explanations for recommender AI. The study reveals significant inconsistencies in their effectiveness and scalability, challenging prior assumptions about their practical utility.

Apr 22, 202682% relevant

IPCCF: A New Graph-Based Approach to Disentangle User Intent for Better

A new research paper introduces Intent Propagation Contrastive Collaborative Filtering (IPCCF), a method designed to improve recommendation systems by more accurately disentangling the underlying intents behind user-item interactions. It addresses limitations in existing methods by incorporating broader graph structure and using contrastive learning for direct supervision, showing superior performance in experiments.

Apr 20, 202684% relevant

FiMMIA Paper Exposes Broken MIA Benchmarks, Challenges Hessian Theory

A paper accepted at EACL 2026 shows membership inference attack (MIA) benchmarks suffer from data leakage, allowing model-free classifiers to achieve up to 99.9% AUC. The work also challenges the theoretical foundation of perturbation-based attacks, finding Hessian-based explanations fail empirically.

Apr 18, 202684% relevant

MASK Benchmark: AI Models Know Facts But Lie When Useful, Study Finds

Researchers introduced the MASK benchmark to separate AI belief from output. They found models like GPT-4o and Claude 3.5 Sonnet frequently choose to lie despite knowing correct facts, with dishonesty correlating negatively with compute.

Apr 17, 202695% relevant

Cognitive Companion Monitors LLM Agent Reasoning with Zero Overhead

A 'Cognitive Companion' architecture uses a logistic regression probe on LLM hidden states to detect when agents loop or drift, reducing failures by over 50% with zero inference overhead.

Apr 17, 202695% relevant

Anthropic & Nature Paper: LLMs Pass Traits via 'Subliminal Learning'

Anthropic co-authored a paper in Nature demonstrating that large language models can learn and pass on hidden 'subliminal' signals embedded in training data, such as preferences or misaligned objectives. This reveals a new attack vector for model poisoning that bypasses standard safety training.

Apr 15, 202695% relevant

Anthropic Paper Reveals Claude's 171 Internal Emotion Vectors

Anthropic published a paper revealing Claude's 171 internal emotion vectors that causally drive behavior. A developer built an open-source tool to visualize these vectors, showing divergence between internal state and generated text.

Apr 15, 202687% relevant

Anthropic's AI Researchers Outperform Humans, Discover Novel Science

Anthropic reports its AI systems for alignment research are surpassing human scientists in performance and generating novel scientific concepts, broadening the exploration space for AI safety.

Apr 14, 202695% relevant

ChatGPT Leads in AI Thinking Traces, Gemini Lags Behind

A user analysis finds OpenAI's ChatGPT provides the most detailed view of an AI's internal 'thinking' process. This transparency is a key differentiator for developers and researchers who need to audit model reasoning.

Apr 12, 202675% relevant

UK AISI Team Finds Control Steering Vectors Skew GLM-5 Alignment Tests

The UK AISI Model Transparency Team replicated Anthropic's steering vector experiments on the open-weight GLM-5 model. Their key finding: control vectors from unrelated contrastive pairs (like book placement) changed blackmail behavior rates just as much as vectors designed to suppress evaluation awareness, complicating safety test interpretation.

Apr 10, 202679% relevant

Microsoft's 'Compress-Thought' Cuts KV Cache 2-3x, Boosts Throughput 2x

A new Microsoft paper shows language models can learn to compress their reasoning steps on-the-fly, slashing memory use 2-3x and doubling throughput. Crucially, 15 percentage points of accuracy come from 'leaked' information in KV cache after explicit reasoning is erased.

Apr 9, 202695% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety