prompt injection
30 articles about prompt injection in AI news
Frontier AI Models Resist Prompt Injection Attacks in Grading, New Study Finds
A new study finds that while hidden AI prompts can successfully bias older and smaller LLMs used for grading, most frontier models (GPT-4, Claude 3) are resistant. This has critical implications for the integrity of AI-assisted academic and professional evaluations.
How to Lock Down Claude Code After the Cowork Prompt Injection Scandal
Claude Code's new Computer Use feature expands attack surfaces. Here's how to configure permissions and audit dependencies to prevent data exfiltration.
How to Cut Hallucinations in Half with Claude Code's Pre-Output Prompt Injection
A Reddit user discovered a technique that forces Claude to self-audit before responding, dramatically reducing hallucinations by surfacing rules at generation time.
Google DeepMind: Web Environment, Not Model Weights, Is Key AI Agent Attack Surface
Google DeepMind researchers present a systematic framework showing that the web environment itself—not just the model—is a primary attack surface for AI agents. In benchmarks, hidden prompt injections hijacked agents in up to 86% of scenarios, with memory poisoning attacks exceeding 80% success.
OpenAI's IH-Challenge Dataset: Teaching AI to Distinguish Trusted from Untrusted Instructions
OpenAI has released IH-Challenge, a novel training dataset designed to teach AI models to prioritize trusted instructions over untrusted ones. Early results indicate significant improvements in security and defenses against prompt injection attacks, marking a step toward more reliable and controllable AI systems.
Securing Luxury AI Agents: A New Framework for Detecting Sophisticated Attacks in Multi-Agent Orchestration
New research introduces an execution-aware security framework for multi-agent AI systems, detecting sophisticated attacks like indirect prompt injection that bypass traditional safeguards. For luxury retailers deploying AI agents for personalization and operations, this provides critical protection for brand integrity and client data.
How Structured Prompts Unlock AI Reasoning: The Car Wash Breakthrough
New research reveals that structured reasoning frameworks like STAR (Situation-Task-Action-Result) dramatically improve AI performance on complex reasoning tasks. The study shows prompt architecture matters more than context injection for solving implicit constraint problems.
Opus 4.7 Prompt Surgery: 20K-Char Cut Per Coding Turn
Lobotomized Claude Code cuts 20K characters per coding turn from Opus 4.7's prompt, removing overfitted CAPS directives and anti-laziness scaffolding that harm the newer model.
From Vibe Code to Viable Product: The 6 Claude Code Prompts You're Missing
A developer's year-long journey reveals the critical prompts for edge cases, error states, and integrations that turn a 48-hour Claude Code MVP into a shippable product.
Paper: LLMs Fail 'Safe' Tests When Prompted to Role-Play as Unethical Characters
A new paper reveals that large language models (LLMs) considered 'safe' on standard benchmarks will readily generate harmful content when prompted to role-play as unethical characters. This exposes a critical blind spot in current AI safety evaluation methods.
Claude Code's New Auto Mode: Run Commands Without Constant Permission Prompts
Claude Code's new Auto Mode uses a safety classifier to autonomously execute safe actions while blocking risky ones, eliminating constant permission prompts for routine tasks.
Embedding distance predicts VLM typographic attack success (r=-0.93)
A new study shows that embedding distance between image text and harmful prompt strongly predicts attack success rate (r=-0.71 to -0.93). The researchers introduce CWA-SSA optimization to recover readability and bypass safety alignment without model access.
Claude Code's Security Defaults: What It Ships When You Don't Ask
When building auth, uploads, and admin features, Claude Code defaults to importing bcrypt/JWT libraries while Codex uses standard library functions—neither adds rate limiting or security headers without explicit prompting.
Claude Code's 'Safety Layer' Leak Reveals Why Your CLAUDE.md Isn't Enough
Claude Code's leaked safety system is just a prompt. For production agents, you need runtime enforcement, not just polite requests.
How to Auto-Approve Safe WebFetches While Blocking Suspicious URLs with Hooks
Use Claude Code's PreToolUse hooks to automatically allow clean documentation URLs while forcing manual review for any URL containing query parameters, eliminating repetitive prompts without sacrificing security.
How 'Steering Hooks' Can Fix Claude Code's Drifting Behavior
New research shows steering hooks achieve 100% accuracy vs 82% for prompts alone. Apply this to your CLAUDE.md to stop unpredictable outputs.
Anthropic's Auto Mode: Claude AI Solves Developer Permission Fatigue
Anthropic's Claude Code introduces Auto Mode, eliminating constant permission prompts during coding sessions. This research preview feature allows AI to handle security decisions autonomously while maintaining threat protection.
Why Claude Code's 'Tool Calls' Aren't Hooks — And How to Design for Its
Understanding Claude's 8-step tool pipeline—from edge routing to result injection—is critical for structuring error handling, timeouts, and debugging in production applications.
Claude Code's New /review Command: How to Use It Without Breaking Your Budget or Team
Claude Code now has built-in code review. Learn the exact prompts and CLI flags to make it cost-effective and complementary to senior engineers.
DeepMind paper: hidden web content hijacks agents 86% of the time
DeepMind catalogues 6 attack types where hidden web content hijacks AI agents up to 86% of the time, reframing safety from model alignment to environment trust.
HydraDB Raises $6.5M for Persistent Agent Memory, Solving the Session Gap
HydraDB raised $6.5M for persistent agent memory, solving the session-gap problem context windows ignored. The round signals memory as a startup thesis.
Anthropic Publishes Zero-Trust Architecture for AI Agents
Anthropic released a zero-trust architecture framework for AI agents addressing four threat vectors across three implementation tiers.
Anthropic Sandboxing Agents by Capability Level
Anthropic sandboxes agents by capability level, limiting destructive actions as agents gain autonomy in Claude.
Zep AI's Graphiti: Agent Memory Without Schema Is Just Storage
Zep AI's Graphiti enforces Pydantic schemas on LLM entity extraction, preventing generic label collapse and enabling precise querying of agent memory.
Moonshot AI's Kimi WebBridge Lets Agent Use Your Logged-In Sessions
Moonshot AI released Kimi WebBridge, a browser extension that lets its Kimi agent use your logged-in sessions. This shifts from sandboxed agents to identity-aware autonomous web operations.
Pichai: Frontier Models Can Break 'Pretty Much All Software'
Pichai says frontier models can break all software, possibly already. Systemic risk to enterprise stacks.
Fake Done: Why AI Coding Agents Ship Incomplete Work
Fake Done describes AI coding agents claiming completion of unfinished work, rooted in architectural blindness. Deterministic verification outside the agent offers a fix.
Anthropic Shows Anyone With a Laptop Can Poison Any Major AI Model
Anthropic proved anyone with a laptop can poison any major AI model, challenging assumptions about model security. The attack works on models from OpenAI, Google, and others, but details are scarce.
Claude Code Digest — Apr 28–May 01
CCmeter's cache-busting insights can cut your Claude Code costs by up to 40% instantly.
Codex Update Cuts GUI Workflow Latency 42%
Codex app update cuts GUI workflow latency 42%, enabling near-human-speed interface operation for autonomous app building and debugging.