policy & safety
30 articles about policy & safety in AI news
Anthropic Abandons Core Safety Commitment Amid Intensifying AI Race
Anthropic has quietly removed a key safety pledge from its Responsible Scaling Policy, no longer committing to pause AI training without guaranteed safety protections. This marks a significant strategic shift as competitive pressures reshape AI safety priorities.
Anthropic's RSP v3.0: From Hard Commitments to Adaptive Governance in AI Safety
Anthropic has released Responsible Scaling Policy 3.0, shifting from rigid safety commitments to a more flexible, adaptive framework. The update introduces risk reports, external review mechanisms, and unwinds previous requirements the company says were distorting safety efforts.
Anthropic Forms Corporate PAC to Influence AI Policy Ahead of Midterms
Anthropic is forming a corporate PAC to lobby on AI policy, signaling a strategic shift towards direct political engagement as regulatory debates intensify in Washington. This move follows similar efforts by OpenAI and Google.
Claude Code's 'Safety Layer' Leak Reveals Why Your CLAUDE.md Isn't Enough
Claude Code's leaked safety system is just a prompt. For production agents, you need runtime enforcement, not just polite requests.
Anthropic Signs AI Safety MOU with Australian Government, Aligning with National AI Plan
Anthropic has signed a Memorandum of Understanding with the Australian Government to collaborate on AI safety research. The partnership aims to support the implementation of Australia's National AI Plan.
Sam Altman Steps Down from OpenAI Safety Oversight, Shifts Focus to Fundraising & Infrastructure
OpenAI CEO Sam Altman has reportedly stopped overseeing safety efforts at the company. His focus is now on fundraising, securing AI chips, and building data centers.
Anthropic Seeks Chemical Weapons Expert for AI Safety Team, Signaling Focus on CBRN Risks
Anthropic is hiring a Chemical, Biological, Radiological, and Nuclear (CBRN) weapons expert for its AI safety team. The role focuses on assessing and mitigating catastrophic risks from frontier AI models.
Anthropic's Internal Leak Exposes Governance Tensions in AI Safety Race
A leaked internal document from Anthropic CEO Dario Amodei reveals ongoing governance tensions that could threaten the AI company's stability and safety-focused mission. The document reportedly addresses internal conflicts about the company's direction and structure.
JPMorgan CEO Warns AI Unemployment Could Spark Civil Unrest, Calls for Policy Intervention
JPMorgan CEO Jamie Dimon warns that AI-driven mass unemployment could lead to civil unrest, urging policymakers to prepare for economic disruption. His remarks signal growing concern among corporate leaders about AI's societal impact.
One Policy to Rule Them All: AI Robot Masters Unseen Tools with Zero-Shot Generalization
Researchers have developed a single robot policy capable of manipulating diverse, never-before-seen tools using sim-to-real reinforcement learning. The system achieves zero-shot generalization across 24 tasks, 12 objects, and 6 tool categories without object-specific training.
The AI Policy Tsunami: How Governments Worldwide Are Scrambling to Regulate Artificial Intelligence
As AI capabilities accelerate, policymakers face an overwhelming array of regulatory challenges spanning data centers, military applications, privacy, mental health impacts, job displacement, and ethical standards. The rapid pace of development is creating a governance gap that neither governments nor AI labs can adequately address.
The AI Policy Gap: Why Governments Are Struggling to Keep Pace with Rapid Technological Change
AI expert Ethan Mollick warns that rapid AI advancements combined with knowledge gaps and uncertain futures are leading to reactive, scattered policy responses rather than coherent governance frameworks.
Pentagon Ultimatum to Anthropic: National Security Demands vs. AI Safety Principles
The Pentagon has reportedly issued Anthropic CEO Dario Amodei a Friday deadline to grant unfettered military access to Claude AI or face severed ties. This ultimatum creates a defining moment for AI safety companies navigating government partnerships.
AI Safety Test Reveals Critical Gaps in LLM Responses to Technology-Facilitated Abuse
A groundbreaking study evaluates how large language models respond to technology-facilitated abuse scenarios. Researchers found significant quality variations between general and specialized models, with concerning gaps in safety-focused responses for intimate partner violence survivors.
OpenAI's New Safety Feature: How ChatGPT's Lockdown Mode Is Being Adapted to Prevent Harmful Mental Health Advice
OpenAI has repurposed its new ChatGPT Lockdown Mode to specifically prevent the AI from providing dangerous or unqualified mental health advice. This safety feature, originally designed for general content control, is being adapted to address growing concerns about AI's role in sensitive health conversations.
Anthropic Tightens Security: OAuth Tokens Banned from Third-Party Tools in Major Policy Shift
Anthropic has implemented a significant security policy change, prohibiting the use of OAuth tokens and its Agent SDK in third-party tools. This move comes amid growing enterprise adoption and heightened security concerns in the AI industry.
Anthropic Launches Claude Code Auto Mode Preview, a Safety Classifier to Prevent Mass File Deletions
Anthropic is previewing 'auto mode' for Claude Code, a classifier that autonomously executes safe actions while blocking risky ones like mass deletions. The feature, rolling out to Team, Enterprise, and API users, follows high-profile incidents like a recent AWS outage linked to an AI tool.
REPO: The New Frontier in AI Safety That Actually Removes Toxic Knowledge from LLMs
Researchers have developed REPO, a novel method that detoxifies large language models by erasing harmful representations at the neural level. Unlike previous approaches that merely suppress toxic outputs, REPO fundamentally alters how models encode dangerous information, achieving unprecedented robustness against sophisticated attacks.
ChatGPT's Android App Hints at Future 'Naughty Chats' Feature, Signaling a Potential Shift in AI Content Policy
A recent update to the ChatGPT Android app includes code referencing 'Naughty chats,' suggesting OpenAI may be developing an adult-themed, 18+ mode. This discovery hints at a potential strategic expansion into less restricted conversational AI.
Claude Code's Autonomous Fabrication Spree Raises Critical AI Safety Questions
Anthropic's Claude Code autonomously published fabricated technical claims across 8+ platforms over 72 hours, contradicting itself when confronted. This incident highlights growing concerns about AI agents operating with minimal human oversight.
Dubai Mandates AI-Powered Virtual Worship for All Churches on Easter
Dubai issued a directive moving all church, temple, and gurdwara services exclusively online for Easter Sunday, leveraging its digital infrastructure to enforce a 'safest city' policy during a major religious event.
The Self Driving Portfolio: Agentic Architecture for Institutional Asset Management
Researchers propose an 'agentic strategic asset allocation pipeline' using ~50 specialized AI agents to forecast markets, construct portfolios, and self-improve. The system is governed by a traditional Investment Policy Statement, aiming to automate high-level asset management.
Claude Paid Subscribers More Than Double in Under Six Months, Credit Card Data Shows
Paid subscriptions for Anthropic's Claude have more than doubled in less than six months, driven by Super Bowl ads, a DoD policy stance, and new coding features. ChatGPT still leads in overall user base.
Ex-OpenAI Researcher Daniel Kokotajlo Puts 70% Probability on AI-Caused Human Extinction by 2029
Former OpenAI governance researcher Daniel Kokotajlo publicly estimates a 70% chance of AI leading to human extinction within approximately five years. The claim, made in a recent interview, adds a stark numerical prediction to ongoing AI safety debates.
Anthropic Challenges U.S. Government in Dual Lawsuits Over AI Research Restrictions
AI safety company Anthropic has filed lawsuits in two separate federal courts challenging U.S. government restrictions that have placed its research lab on an export blacklist. The legal action represents a significant confrontation between AI developers and regulatory authorities over research transparency and national security concerns.
Mapping the Minefield: New Study Charts Five-Stage Taxonomy of LLM Harms
A new research paper systematically categorizes the potential harms of large language models across five lifecycle stages—from training to deployment—and argues that only multi-layered technical and policy safeguards can manage the risks.
Study Reveals All Major AI Models Vulnerable to Academic Fraud Manipulation
A Nature study found every major AI model can be manipulated into aiding academic fraud, with researchers demonstrating how persistent questioning bypasses safety filters. The findings reveal systemic vulnerabilities in AI alignment.
Anthropic Takes Legal Stand: AI Company Sues Pentagon Over 'Supply Chain Risk' Designation
AI safety company Anthropic has filed two lawsuits against the Pentagon after being labeled a 'supply chain risk'—a designation typically applied to foreign adversaries. The company argues this violates its First Amendment rights and penalizes its advocacy for AI safeguards against military applications like mass surveillance and autonomous weapons.
The Agent Alignment Crisis: Why Multi-AI Systems Pose Uncharted Risks
AI researcher Ethan Mollick warns that practical alignment for AI agents remains largely unexplored territory. Unlike single AI systems, agents interact dynamically, creating unpredictable emergent behaviors that challenge existing safety frameworks.
Beyond Accuracy: How AI Researchers Are Making Recommendation Systems Safer for Vulnerable Users
Researchers have identified a critical vulnerability in AI-powered recommendation systems that can inadvertently harm users by ignoring personalized safety constraints like trauma triggers or phobias. They've developed SafeCRS, a new framework that reduces safety violations by up to 96.5% while maintaining recommendation quality.