AI Safety & Red Teaming
Ensure AI systems are safe, aligned, and robust. Red teaming, guardrails, interpretability, evaluation.
0
Open Positions
Core Skills
Safety Red-TeamingMechanistic InterpretabilityRLHFEvaluation FrameworksAlignmentGuardrailsAdversarial Testing
Active Positions (35)
Data Scientist, Safety Systemsmid
OpenAI·San Francisco
Safety Systems metricsStatistical methods for safety metricsSafety data developmentSafety dashboardsReal-world safety impact measurementHarm and abuse mitigation measurement
Research Scientist, Multimodal Alignment, Safety, and Fairnessmid
Google DeepMind·Kirkland, Washington, US; Mountain View, California, US; New York City, New York, US
multimodal alignmentsociotechnical modelingAI alignment researchfrontier AI safetyinterdisciplinary AI safetyAGI safety
[Expression of Interest] Research Scientist / Engineer, Alignment Finetuningmid
Anthropic·San Francisco, CA
Alignment finetuning techniquesSynthetic Data GenerationHonesty evaluation frameworksHarmlessness evaluationCharacter evaluationMoral reasoning training
ML/Research Engineer, Safeguardsmid
Anthropic·San Francisco, CA | New York City, NY
misuse detection classifierssynthetic data pipelines for training classifierscoordinated cyber attack detectioninfluence operations monitoringagentic product safetythreat models for agentic risks
Red Team Engineer, SafeguardsmidRemote
Anthropic·Remote-Friendly (Travel-Required) | San Francisco, CA | Washington, DC
Red Teaming for AI SafetyAdversarial Testing of AI SystemsAgent System TestingTool Use ExploitationFull Kill Chain Attack SimulationAutomated Testing Frameworks
Research Engineer, Frontier Red Team (Autonomy)mid
Anthropic·San Francisco, CA
Frontier Red TeamingAutonomous AI System DefenseCyberphysical Capability SafetyModel Organism Development for Autonomous SystemsDefensive Agent DevelopmentProject Vend
Research Engineer, Interpretabilitymid
Anthropic·San Francisco, CA
InterpretabilityMechanistic understandingMonosemantic FeaturesCircuitsTransformer CircuitsNeuroscience of neural networks
Research Engineer / Scientist, Frontier Red Team (Cyber)mid
Anthropic·San Francisco, CA
Frontier Red TeamCyberphysical CapabilitiesAI-enabled Cyber ThreatsZero DaysExploitsCybersecurity Domains
Research Engineer / Scientist, Societal Impactsmid
Anthropic·San Francisco, CA
Societal ImpactsPrivacy-preserving ToolsMixed-methods StudiesHuman-AI InteractionSocio-technical AlignmentEmpirical Methods
Research Scientist, Frontier Red Team (Emerging Risks)mid
Anthropic·San Francisco, CA
Frontier Red TeamingCyber-Physical AI SystemsSelf-Improving AIAI for CyberdefenseAI Robotics SafetyProject Vend
Research Scientist, Societal Impactsmid
Anthropic·San Francisco, CA
ClioConstitutional AISocietal Impacts ResearchModel Behavior EvaluationFine-TuningInterpretability
Software Engineer, Beneficial Deployments mid
Anthropic·San Francisco, CA | New York City, NY
MCP (Model Context Protocol) IntegrationsSelf-Serve Onboarding Systems for Global MarketsAccessibility systems for AIMultilingual AI CapabilitiesSocial impact measurement systemsVerification systems for nonprofits
Software Engineer, Account Abusemid
Anthropic·San Francisco, CA
Account abuse detection systemsSignal analysis at scaleThird-party data-enrichment vendor integrationMulti-layered defense systemsAbuse pattern identificationAttack Vector Discovery
Research Engineer / Scientist, Alignment Sciencemid
Anthropic·San Francisco, CA
Alignment ScienceScalable OversightInterpretabilityFine-TuningFrontier Red TeamResponsible Scaling Policy
Researcher, Automated Red Teamingmid
OpenAI·San Francisco
Automated Red Teaming (ART)Classifier Jailbreak DiscoveryCatastrophic Risk AssessmentSafety SystemsPreparedness FrameworkAutomated Bio Risk Discovery
Software Engineer - Human Alignment, Consumer Devicesmid
OpenAI·San Francisco
Multimodal AI evaluation frameworksHuman feedback toolingLong-term memory systemsUser modeling for personalizationHuman-aligned AI systemsProduct-grounded research infrastructure
Research Engineer/Scientist - Human Alignment, Consumer Devicesmid
OpenAI·San Francisco
RLHFreward modelingpreference learninglong-horizon evaluationpolicy improvementpost-training
Research Scientist, Gemini Safetymid
Google DeepMind·Mountain View, California, US
post-traininginstruction tuningmultimodal safetymodel fairnesstext-to-text tuningimage-to-text tuning
Senior Psychologist or Sociologist - AI Psychology & Safety senior
Google DeepMind·Mountain View, California, US
AI psychologydevelopmental psychology for AI safetytechnical safety protocols for AIcognitive principles for AIbehavioral analysis for AIsociotechnical AI safety
Staff Machine Learning Research Scientist, LLM Evalsstaff
Scale AI·San Francisco, CA; Seattle, WA; New York, NY
LLM Evaluation MethodologiesLLM BenchmarkingInstruction Following EvaluationFactuality EvaluationRobustness EvaluationFairness Evaluation
Tech Lead/Manager, Machine Learning Research Scientist- LLM Evalssenior
Scale AI·San Francisco, CA; Seattle, WA; New York, NY
LLM Evaluation MethodologiesLLM Evaluation FrameworksInstruction Following EvaluationFactuality EvaluationRobustness EvaluationFairness Evaluation
Model Behavior Tutor - Style, Taste & AestheticsmidRemote
xAI·Remote
AI Stylistic Excellence EvaluationAI Voice Consistency AssessmentAesthetic Impact Assessment in AI OutputsTraining Data Curation for High-Quality WritingElimination of Clichés in AI OutputsElimination of Corporate-Speak in AI Outputs
Software Engineer, Safeguards Infrastructure mid
Anthropic·London, UK
safety infrastructureagentic review toolingmetric and evaluation systemsmulti-layered defensesreal-time safety mechanisms
Research Engineer / Scientist, Alignment Science - Londonmid
Anthropic·London, UK
AI ControlAdversarial ScenariosHarmlessnessHelpfulnessHonestyAlignment Science
Software Engineer, Sandboxingmid
Anthropic·San Francisco, CA | New York City, NY
sandboxing infrastructurecode execution securityclient-side sandboxing APIdeveloper experience for sandboxingsecure external system interaction
Member of Technical Staff - Model Evaluationstaff
xAI·Palo Alto, CA
SGlangvLLMModel evaluation frameworksIn-house benchmarkingPublic benchmarkingModel assessment
Biological Safety Research Scientistmid
Anthropic·San Francisco, CA | New York City, NY
capability evaluations (evals)threat modelingadversarial attacksfalse-positive rate optimizationsafety system stress-testingdual-use biological knowledge
Model Behavior Tutor - Wit & ConversationmidRemote
xAI·Remote
Humor Effectiveness Evaluation in AIConversational Naturalness AssessmentAI Personality Trait ConsistencyTraining Dataset Creation for Humor and IronyTraining Dataset Creation for Cultural ReferencesAI Engagement Maximization Techniques
Software Engineer, Safeguardsmid
Anthropic·San Francisco, CA | New York City, NY
abuse detectionmisuse preventionmodel behavior monitoringautomated enforcementsafety dashboardsintegrity systems
Applied Safety Research Engineer, Safeguardsmid
Anthropic·San Francisco, CA
Safety EvaluationsEvaluation FrameworksMulti-turn Conversation AnalysisTool-Augmented Model SafetyLong Context Safety AnalysisUser Diversity Simulation
[Expression of Interest] Research Scientist / Engineer, Honestymid
Anthropic·New York City, NY; San Francisco, CA
Hallucination Mitigation StrategiesTruthfulness enhancementData Curation Pipelines for AccuracySpecialized classifiers for hallucinationsHonesty benchmarksRetrieval-Augmented Generation (RAG) systems
Model Behavior Tutor - Social Cognition & EQmidRemote
xAI·Remote
Emotional Subtext Detection in AISocial Context InterpretationUser Intent DetectionEmotional Response CalibrationDetection of Emotionally Tone-Deaf AI ResponsesDataset Construction for Dark-Pattern Social Tactics
Model Behavior Tutor - Epistemic Rigor & TruthfulnessmidRemote
xAI·Remote
Model Output Assessment for Factual AccuracyModel Output Assessment for Logical CoherenceDetection of Ideological CaptureDetection of Statistical FallaciesDetection of Rhetorical Sleights of HandAdversarial Example Construction for Epistemic Weaknesses
Anthropic AI Safety FellowmidRemote
Anthropic·London, UK; Ontario, CAN; Remote-Friendly, United States; San Francisco, CA
AI Safety ResearchEmpirical AI ProjectsPublic Research OutputsExternal Compute Infrastructure
Research Scientist, Safety and Alignment for Humanoid Roboticsmid
Google DeepMind·New York City, New York, US
humanoid robotics safetyVLA modelsagentic reasoningHuman-Robot Interaction (HRI)embodiment understandingactuator fault responses