AI Safety & Red Teaming
Ensure AI systems are safe, aligned, and robust. Red teaming, guardrails, interpretability, evaluation.
40
Open Positions
Core Skills
Safety Red-TeamingMechanistic InterpretabilityRLHFEvaluation FrameworksAlignmentGuardrailsAdversarial Testing
Active Positions (40)
Anthropic Fellows Program — AI SafetymidRemote
Anthropic·London, UK; Ontario, CAN; Remote-Friendly, United States; San Francisco, CA
Safeguards Policy Analyst, Fraud & ScamsmidRemote
Anthropic·Remote-Friendly (Travel-Required) | San Francisco, CA | New York City, NY
fraud typologiesscam ecosystemsthreat modelingpolicy designclassifier guidelinesintegrity & authenticity enforcement
Forward Deployed Engineermid
Labelbox·San Francisco Bay Area
Labelboxdata-centric AI developmentAlignerrfrontier data labelingannotation toolsworkflow automation
Forward Deployed Engineer, RL Environmentsmid
Labelbox·San Francisco Bay Area
LabelboxRL Environmentsdata-centric AI developmentAlignerrfrontier data labelingannotation tools
Software Engineer, Cloud Inference Safeguardsmid
Anthropic·San Francisco, CA | Seattle, WA
real-time safeguards infrastructureclassifiersrate limitsenforcement actionsintervention hookstelemetry pipelines
Research Scientist, Frontier Risk Evaluationsmid
Scale AI·San Francisco, CA; New York, NY
frontier risk evaluationsdangerous capability testingsecurity vulnerability exploitationCBRN uplift evaluationevaluation harnessestechnical report writing
Safeguards Enforcement Lead, Frontier Abuse Enforcementsenior
Anthropic·San Francisco, CA | New York City, NY | Washington, DC
frontier model abuse enforcementunauthorized model training detectionenforcement strategy developmentdetection signal developmentintelligence fusionlegal team collaboration
Safeguards Analyst, Human Exploitation & AbusemidRemote
Anthropic·Remote-Friendly (Travel-Required) | San Francisco, CA | Washington, DC
automated enforcement systemsreview workflowsdetection signal tuningevaluation datasetsharm escalation pathwayshuman exploitation detection
Research Scientist, AI Controls and Monitoringmid
Scale AI·San Francisco, CA; New York, NY
Agent RobustnessAI Control ProtocolsAI Risk EvaluationsBenchmarking AI AgentsRed-Team SimulationsMulti-Agent Systems
Research Scientist, Gemini Safetymid
Google DeepMind·Zurich, Switzerland
Gemini Safetyfairness behavioruser-facing modelssafety algorithmscutting edge solutions
Program Manager, Alignment manager
OpenAI·San Francisco
Alignment program managementrecursive self-improvement (RSI)alignment ecosystemSafety Fellows programexternal collaboration managementmisalignment research
Data Scientist, Safety Systemsmid
OpenAI·San Francisco
Safety Systems metricsStatistical methods for safety metricsSafety data developmentReal-world safety impact measurementHarm and abuse mitigation approachesSafety-related dashboards
Model Behavior Tutor - Wit & ConversationmidRemote
xAI·Remote
Humor effectiveness scoringConversational naturalness evaluationPersonality trait consistencyCultural reference dataset creationIrony and banter modelingEngagement maximization
Research Engineer / Scientist, Alignment Science - Londonmid
Anthropic·London, UK
AI ControlAdversarial ScenariosHarmlessnessHelpfulnessHonestyAlignment Science
Research Engineer / Scientist, Societal Impactsmid
Anthropic·San Francisco, CA
Societal ImpactsPrivacy-preserving ToolsMixed-methods StudiesHuman-AI InteractionSocio-technical AlignmentEmpirical Methods
Research Scientist, Societal Impactsmid
Anthropic·San Francisco, CA
Clio observational toolreal-world usage pattern analysisClaude's Constitutionsocietal impacts researchfine-tuning team collaborationsafeguards team collaboration
Safeguards Analyst, Account Abusemid
Anthropic·San Francisco, CA | New York City, NY
graph-based data infrastructureaccount-linking signalsthird-party vendor signalsbehavioral indicatorsenforcement toolingscaled abuse detection
Software Engineer, Account Abusemid
Anthropic·San Francisco, CA
account abuse detection systemssignal analysis at scalethird-party data-enrichment integrationabuse pattern identificationattack vector analysismulti-layered defense systems
Research Engineer, Reward Models PlatformmidRemote
Anthropic·Remote-Friendly (Travel-Required) | San Francisco, CA | Seattle, WA | New York City, NY
Reward ModelsReward HacksPreference ModelsHuman Feedback DataRubric MethodologiesReward Signals
Software Engineer, Safeguardsmid
Anthropic·San Francisco, CA | New York City, NY
abuse detection mechanismsmisuse prevention systemsmodel behavior monitoringautomated enforcement actionssafety dashboardsmulti-layered defenses
Software Engineer, Safeguards Infrastructure mid
Anthropic·London, UK
Safeguards infrastructureagentic review toolingmetric and evaluation systemsdata storage for safetyoperational bar for safety systemsreal-time safety mechanisms
Software Engineer, Sandboxingmid
Anthropic·San Francisco, CA | New York City, NY
sandboxing infrastructurecode execution safetyclient-side sandboxing librarysandboxing APIdeveloper experience for sandboxingsecurity for external system interaction
Engineering Manager, Safeguards Data Infrastructuremanager
Anthropic·London, UK; New York City, NY
HIPAA compliance for AIprivacy-preserving data APIsPII storage toolingsafeguards data stack portabilityoffline data stack for AI safetysensitive data storage layer
ML/Research Engineer, Safeguardsmid
Anthropic·San Francisco, CA | New York City, NY
misuse detection classifierssynthetic data pipelines for training classifierscoordinated cyber attack detectioninfluence operations monitoringagentic product safetythreat models for agentic risks
Safeguards Enforcement Analyst, Safety EvaluationsmidRemote
Anthropic·Remote-Friendly (Travel-Required) | San Francisco, CA | Washington, DC; San Francisco, CA | New York City, NY
safety evaluationsmodel launch readinessevaluation monitoringpolicy evolution trackingthreat vector evolutionmodel capability evolution
Model Behavior Tutor - Epistemic Rigor & TruthfulnessmidRemote
xAI·Remote
epistemic rigortruthfulnessfactual accuracylogical coherencefallacious reasoninghidden assumptions
Model Behavior Tutor - Social Cognition & EQmidRemote
xAI·Remote
social cognitionEQemotional subtextsocial contextuser intentemotional states
Model Behavior Tutor - Style, Taste & AestheticsmidRemote
xAI·Remote
style taste aestheticsstylistic excellencevoice consistencyaesthetic impactcurate training datahigh-quality writing
AI Research Engineer, Enterprise Evaluationsmid
Scale AI·San Francisco, CA; New York, NY
GenAI Evaluation SuiteLLM-as-a-Judge autorater frameworksRLAIFmodel-judging-model setupsAI-assisted evaluation systemshuman-rated datasets
Research Engineer, Frontier Red Team (Autonomy)mid
Anthropic·San Francisco, CA
Frontier Red TeamingAutonomous AI SystemsCyberphysical AI SafetySelf-Improving AI DefenseAI Model OrganismsDefensive Agent Development
Research Engineer / Scientist, Alignment Sciencemid
Anthropic·San Francisco, CA
Alignment ScienceScalable OversightInterpretabilityFine-TuningFrontier Red TeamResponsible Scaling Policy
Research Scientist, Frontier Red Team (Emerging Risks)mid
Anthropic·San Francisco, CA
Frontier Red TeamEmerging RisksCyber-Physical CapabilitiesSelf-Improving AIAutonomous AI SystemsSocietal Risks
Research Engineer / Scientist, Frontier Red Team (Cyber)mid
Anthropic·San Francisco, CA
Frontier Red TeamCyberphysical CapabilitiesAI-enabled Cyber ThreatsZero DaysExploitsCybersecurity Domains
Senior Research Scientist, Reward ModelsseniorRemote
Anthropic·Remote-Friendly (Travel Required) | San Francisco, CA
reward modelingRLHFLLM-based evaluationrubric-based grading methodsreward hacking mitigationpreference learning at scale
Certification Content & Systems Architectmid
Anthropic·San Francisco, CA | New York City, NY
certification curriculum architectureassessment designcompetency frameworksAI-native assessment systemsadaptive assessmentsitem bank evolution
Staff Data Scientist - Trust and Safetystaff
Databricks·San Francisco, California
fraud and abuse detection using MLstatistical techniques for securitymachine learning for trust and safetysecurity and compliance data analysisdata-driven security program analysisstate-of-the-art fraud detection methods
Biological Safety Research Scientistmid
Anthropic·San Francisco, CA | New York City, NY
capability evaluations (evals)threat modelingadversarial attacksfalse-positive rate optimizationsafety system stress-testingdual-use biological knowledge
Enforcement Operations Leadsenior
Anthropic·San Francisco, CA | New York City, NY | Washington, DC
Safety EvaluationsContent ModerationVendor Operations ManagementPolicy EnforcementModel Safety StandardsMitigation Strategies
Member of Technical Staff - Model Evaluationstaff
xAI·Palo Alto, CA
SGlangvLLMModel evaluation frameworksIn-house benchmarkingPublic benchmarkingModel assessment
Offensive Security Research Engineer, Safeguardsmid
Anthropic·San Francisco, CA
vulnerability researchexploitationremediationreverse engineeringnetwork securityLLM misuse research