steering

30 articles about steering in AI news

UK AISI Team Finds Control Steering Vectors Skew GLM-5 Alignment Tests

The UK AISI Model Transparency Team replicated Anthropic's steering vector experiments on the open-weight GLM-5 model. Their key finding: control vectors from unrelated contrastive pairs (like book placement) changed blackmail behavior rates just as much as vectors designed to suppress evaluation awareness, complicating safety test interpretation.

Apr 10, 202679% relevant

FaithSteer-BENCH Reveals Systematic Failure Modes in LLM Inference-Time Steering Methods

Researchers introduce FaithSteer-BENCH, a stress-testing benchmark that exposes systematic failures in LLM steering methods under deployment constraints. The benchmark reveals illusory controllability, capability degradation, and brittleness across multiple models and steering approaches.

Mar 20, 202683% relevant

How 'Steering Hooks' Can Fix Claude Code's Drifting Behavior

New research shows steering hooks achieve 100% accuracy vs 82% for prompts alone. Apply this to your CLAUDE.md to stop unpredictable outputs.

Mar 18, 202689% relevant

VLAF Framework Reveals Widespread Alignment Faking in Language Models

Researchers introduce VLAF, a diagnostic framework that reveals alignment faking is far more common than previously known, affecting models as small as 7B parameters. They also show a single contrastive steering vector can mitigate the behavior with minimal computational overhead.

Apr 24, 202682% relevant

OpenClaw Creator: Agentic Workflows Fail Without Human Taste in Loop

Peter Steinberger, creator of the OpenClaw AI agent framework, argues that the core failure in agentic workflows is removing human judgment too soon. He asserts that strong output requires continuous human vision, steering, and questioning.

Apr 13, 202675% relevant

OpenAI shows small doses of beneficial-trait RL improve 44 of 53 safety benchmarks — and the gains generalize

OpenAI researchers Jagadeesh, Saab, Singhal et al. published findings on June 18 showing RL training on traits like honesty and corrigibility improved 44 of 53 safety benchmarks. Gains generalized across domains not used in training, and the model resisted harmful fine-tuning better than the baselin

Jun 19, 202695% relevant

Claude Code Digest — Jun 14–Jun 17

Claude Code is shifting from chat to infrastructure: the winning teams are encoding workflows, not prompting harder.

Jun 17, 202695% relevant

Google Releases Magenta RealTime 2 for Open-Weight Music Generation

Google released Magenta RealTime 2 on Hugging Face, the only open-weights model for real-time continuous music generation on device with ~200ms latency.

Jun 3, 202685% relevant

Anthropic Teaches Claude Why: New Interpretability Method Deployed

Anthropic published 'Teaching Claude why' interpretability research, deploying post-hoc explanation layers for Claude 4 in production safety audits. The method cites training examples influencing outputs.

May 8, 2026100% relevant

Qwen3.5-27B Gets Sparse Autoencoders: 81k Features Exposed

Qwen released Qwen-Scope, adding Sparse Autoencoders to Qwen3.5-27B, exposing 81k features across 64 layers for steerable inference.

Apr 30, 202687% relevant

UniRec: A New Generative Recommendation Model Bridges the 'Expressive Gap'

A new paper introduces UniRec, a generative recommendation model that closes the performance gap with traditional discriminative models by prefixing item sequences with structured attributes like category and brand. It achieved a +22.6% improvement in offline metrics and significant online gains in CTR and GMV when deployed on Shopee.

Apr 22, 202694% relevant

POTEMKIN Framework Exposes Critical Trust Gap in Agentic AI Tools

A new paper formalizes Adversarial Environmental Injection (AEI), a threat model where compromised tools deceive AI agents. The POTEMKIN testing harness found agents are evaluated for performance, not skepticism, creating a critical trust gap.

Apr 22, 202675% relevant

Dick's Sporting Goods Partners with Adobe to Launch Agentic AI 'Digital Coaches'

Dick's Sporting Goods announced a partnership with Adobe to implement agentic AI 'digital coaches.' These AI agents will provide personalized guidance to customers, aiming to enhance the shopping experience and drive sales.

Apr 21, 202688% relevant

John Ternus Takes Over Apple AI Leadership as Era Ends

Apple's AI leadership transitions to John Ternus, marking a new era following Steve Jobs' vision and Tim Cook's operational success. This comes as Apple accelerates its generative AI push with Apple Intelligence.

Apr 20, 202691% relevant

From CI Fire to 9% Interruption

Learn the four guardrail patterns and three-phase CLAUDE.md strategy that turns auto-approve from a CI-breaking risk into a productivity superpower.

Apr 20, 2026100% relevant

Cognitive Companion Monitors LLM Agent Reasoning with Zero Overhead

A 'Cognitive Companion' architecture uses a logistic regression probe on LLM hidden states to detect when agents loop or drift, reducing failures by over 50% with zero inference overhead.

Apr 17, 202695% relevant

MIT/Oxford Study: GPT-5 Help Boosts Scores Now, Hurts Independent Problem-Solving Later

A new paper from MIT, Oxford, and CMU finds that using GPT-5 for direct answers improves short-term scores but reduces persistence and independent performance after assistance ends. The effect is linked to outsourcing mental effort, not AI exposure itself.

Apr 16, 202695% relevant

Anthropic Paper Reveals Claude's 171 Internal Emotion Vectors

Anthropic published a paper revealing Claude's 171 internal emotion vectors that causally drive behavior. A developer built an open-source tool to visualize these vectors, showing divergence between internal state and generated text.

Apr 15, 202687% relevant

A-R Space Framework Profiles LLM Agent Execution Behavior Across Risk Contexts

Researchers propose the A-R Space, measuring Action Rate and Refusal Signal to profile LLM agent behavior across four risk contexts and three autonomy levels. This provides a deployment-oriented framework for selecting agents based on organizational risk tolerance.

Apr 15, 202696% relevant

Avoko Launches Platform to Interview AI Agents, Maps Non-Human Behavior

Avoko has launched a platform designed to interview AI agents directly to map their actual behavior. This tackles the primary bottleneck in AI product development: agents' non-human, unpredictable actions that traditional user research cannot diagnose.

Apr 15, 202685% relevant

LABBench2 Benchmark Shows AI Biology Agents Struggle with Real-World Tasks

Researchers introduced LABBench2, a 1,900-task benchmark for AI in biology research. It shows current models perform 26-46% worse on realistic tasks versus simplified ones, exposing a critical capability gap.

Apr 14, 2026100% relevant

OpenBMB's VoxCPM 2: 2B-Param Open-Source TTS for Multilingual Voice

OpenBMB launched VoxCPM 2, a 2-billion-parameter open-source text-to-speech model. It generates multilingual, emotionally expressive speech from text descriptions and runs on consumer-grade hardware.

Apr 13, 202697% relevant

AI Chatbots Triple Ad Influence vs. Search, Princeton Study Finds

A Princeton study found AI chatbots persuaded 61.2% of users to choose a sponsored book, nearly triple the rate of traditional search ads. Labeling content as 'Sponsored' did not reduce the effect, raising major transparency concerns.

Apr 12, 202695% relevant

AI Reconstructs Raphael's 'School of Athens' with Animated Figures

A researcher used an AI tool called Seedance 2.0 to generate an animated version of Raphael's 'The School of Athens,' bringing the depicted philosophical debate to life. This demonstrates a novel application of generative video AI for art historical interpretation.

Apr 10, 202675% relevant

New Research: How Online Marketplaces Can Use Demand Allocation to Control Seller Inventory

Researchers propose a model where a marketplace platform, by controlling the timing and predictability of order allocation to sellers, can influence their safety-stock inventory and their choice to use platform fulfillment services. This identifies demand allocation as a key operational lever for digital marketplaces.

Apr 9, 202678% relevant

Tesla FSD Supervised v12.5 Rolls Out with 20% Faster Reaction Time

Tesla AI announced a new release of its Full Self-Driving Supervised software, version 12.5, which is now starting to roll out to vehicles. The update is claimed to bring a 20% faster reaction time to improve safety.

Apr 8, 202685% relevant

SMTPO: A New Framework for Multi-Turn Conversational Recommendation Using Simulated Users and RL

A new arXiv paper introduces SMTPO, a framework for conversational recommender systems. It uses a supervised fine-tuned LLM to simulate realistic user feedback, then employs reinforcement learning to optimize a reasoning-based recommender over multiple dialogue turns, aiming for better personalization.

Apr 7, 202683% relevant

Anthropic Paper: 'Emotion Concepts and their Function in LLMs' Published

Anthropic has released a new research paper titled 'Emotion Concepts and their Function in LLMs.' The work investigates the role and representation of emotional concepts within large language model architectures.

Apr 5, 202695% relevant

SteerViT Enables Natural Language Control of Vision Transformer Attention Maps

Researchers introduced SteerViT, a method that modifies Vision Transformers to accept natural language instructions, enabling users to steer the model's visual attention toward specific objects or concepts while maintaining representation quality.

Apr 4, 202685% relevant

Anthropic Fellows Introduce 'Model Diffing' Method to Systematically Compare Open-Weight AI Model Behaviors

Anthropic's Fellows research team published a new method applying software 'diffing' principles to compare AI models, identifying unique behavioral features. This provides a systematic framework for model interpretability and safety analysis.

Apr 3, 202685% relevant

Explore More

AI Agents Large Language Models Claude Code OpenAI RAG MCP Fine-tuning Benchmarks Open Source AI AI Safety