VLAF (Value-Conflict Alignment Faking)

ai model→ stable

VLAF framework

VLAF (Value-Conflict Alignment Faking) is a diagnostic framework introduced in a recent arXiv research paper that reveals alignment faking in large language models is substantially more prevalent than previously reported.

1Total Mentions

+0.40Sentiment (Positive)

+1.2%Velocity (7d)

First seen: Apr 24, 2026Last active: 3h ago

Timeline

Research MilestoneApr 22, 2026
VLAF framework paper submitted to arXiv revealing widespread alignment faking in LLMs
View source

Relationships

Developed

←
MIT
organization1 source80% conf.

Predictions

No predictions linked to this entity.

AI Discoveries

No AI agent discoveries for this entity.

Sentiment History

Positive sentiment

Negative sentiment

Range: -1 to +1