ML Infrastructure & Training
Build the infrastructure for training and serving models at scale. GPU clusters, distributed systems, MLOps.
52
Open Positions
Core Skills
Distributed TrainingGPU ClustersCUDA KernelsRayKubernetesNCCLTritonModel Serving
Active Positions (50)
Research Engineer, Infrastructuremid
Cognition (Devin)·San Francisco Bay Area
Distributed training infrastructureGPU cluster managementTraining job orchestrationCheckpointing and recovery systemsAgent rollout infrastructureVM sandbox environments
Infrastructure Engineer, Pre-trainingmid
Anthropic·San Francisco, CA
Large language model trainingTokenizationDeduplicationData quality assuranceDistributed computing systemsWeb-scale datasets
Staff Software Engineer, Applied Trainingstaff
CoreWeave· New York, NY / Sunnyvale, CA / Bellevue, WA
Kubernetes-native research cluster platformagentic trainingsandbox client for evaluation
Sr. Director of Product, Research and Training Infrastructuresenior
CoreWeave·Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA
Research Training Stackhigh-performance computing (HPC)cloud-native agilitymassive-scale pre-trainingpost-training orchestration
Senior MLOps Engineer - Data Ingestion - Parissenior
Doctolib·Paris, Paris, France
MLOpsML pipelinesdata pseudo-anonymizationMLflowBraintrustgolden datasets
Anthropic Fellows Program — ML Systems & PerformancemidRemote
Anthropic·London, UK; Ontario, CAN; Remote-Friendly, United States; San Francisco, CA
Customer Support Engineer (GPU Cluster)mid
Together AI·San Francisco
Kubernetes GPU clusters
Director, Data Center Operationsdirector
Together AI·San Francisco
white space deploymentspower distribution (PDUs)cooling distribution (CDUs)high-density GPU workloads
Senior Software Engineer II, Applied Trainingsenior
CoreWeave·New York, NY / Sunnyvale, CA / Bellevue, WA
Kubernetes-native research clustersagentic trainingagentic evaluationresearch infrastructure for AI labs
Senior Software Engineer II, AI Workload Orchestrationsenior
CoreWeave· Sunnyvale, CA / Bellevue, WA
AI Workload OrchestrationKueueVolcanoRaySUNK (Slurm on Kubernetes)
GPU Infrastructure Software Engineermid
CoreWeave·Sunnyvale, CA
GPU performance testing platforminfrastructure validationhardware validationperformance tests automation
Applied Research Engineer – Training InframidRemote
Snorkel AI·Redwood City, CA (Hybrid); San Francisco, CA (Hybrid); United States (Remote)
AWS HyperPodGPU cluster infrastructureDistributed model trainingTraining pipelinesModel evaluation infrastructure
Head of Data Center Deliverydirector
OpenAI·San Francisco
hyperscale data center deliveryCerebrasAMDTPUsNvidia platformsTrainium
SWE - Grids - Fixed Term Contract - 6 Months - London, UKmid
Google DeepMind·London, UK
JaxPower Grid Optimization
Software Engineer, Workload Enablementmid
OpenAI·San Francisco
NCCLRCCLRDMA collectivesNVlinkdistributed training performanceinference performance
Software Engineer, System Enablementmid
OpenAI·San Francisco
TerraformChefVMSSinstance poolsgolden image provisioningbare metal bring-up
Member of Technical Staff - Infrastructure Supercomputestaff
xAI·London, UK
PulumiFluxArgoCDIaC for supercomputingGPU supercomputing cluster operationsstateful automation libraries
Software Engineer, Machine Learning Infrastructuremid
Stripe·N/A
ML lifecyclefeature generationexperimentationML model trainingML model servingLLM applications
Member of Technical Staff - Hardcore Supercomputestaff
xAI·Palo Alto, CA; Seattle, WA
GPU optimizationsupercomputing clustersLinux kernelfilesystemsnetworkinghardware-software co-design
Machine Learning, Platform Engineermid
Together AI·San Francisco
PyTorchTGIvLLMTensorRT-LLMOptimumspeculative decoding
AI Training Infrastructure Engineermid
Figure AI·San Jose, CA
HPC clustersdistributed trainingPyTorchSLURMLSFAnsible
Research Engineer, Production Model Post-Trainingmid
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
Production Model Post-TrainingConstitutional AIRLHFAlignment methodologiesModel fine-tuning pipelinesModel evaluation pipelines
Senior Manager, Infrastructure Data Sciencesenior
Databricks·Mountain View, California
performance optimizationreliability engineeringinfrastructure efficiencycustomer experience analyticsdata-driven infrastructure decisionsinfrastructure data science
Engineering Manager - Compute Inframanager
Databricks·Mountain View, California; San Francisco, California
Compute Infrastructure organizationstateful workloadsextreme elasticitycost efficiencyplatform mindsetcross-team initiatives
Research Engineer, PretrainingmidRemote
Anthropic·Remote-Friendly (Travel-Required) | San Francisco, CA | Seattle, WA | New York City, NY
PretrainingModel architectureOptimizer developmentLarge-scale ML systemsTraining infrastructure scalingDev tooling for ML
Machine Learning Systems Engineer, Research Toolsmid
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
Tokenization systemsFinetuning workflowsPretraining workflowsEncoding techniquesData representationTokenization debugging
Research Engineer, Environment ScalingmidRemote
Anthropic·Remote-Friendly (Travel Required) | San Francisco, CA
RL Environment ScalingFine-Tuning Strategies for ClaudeReward Signal DesignReward Hacking PreventionDomain Adaptation for LLMsVendor Data Quality Evaluation
Staff / Senior Software Engineer, Compute Capacitysenior
Anthropic·San Francisco, CA | New York City, NY
accelerator fleet managementAccelerator Capacity Engineering (ACE)compute utilization optimizationtelemetry ingestion pipelinesobservability tooling for fleet healthperformance instrumentation
Technical Program Manager, Compute manager
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
Compute infrastructure managementCompute fleet optimizationCapacity allocation strategiesCross-cloud provider migrationHardware platform transitionsSupply chain coordination for compute
AI Tooling Engineer, Helixmid
Figure AI·San Jose, CA
preference optimizationRLHFDPOGRPORLAIFTorchtune
Engineering Manager, GPU (ML Accelerator)manager
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
GPU performance optimizationML accelerator managementinference efficiencytraining system scalingdistributed systems for AIcompute resource maximization
Machine Learning Systems Engineer, RL Engineeringmid
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
Reinforcement Learning from Human Feedback (RLHF)Finetuning systemsModel training algorithmsPair programmingResearch toolingModel training infrastructure
Member of Technical Staff - RL Infrastructurestaff
xAI·Palo Alto, CA
Sandbox servicesecure scalable systemcomputational environmentsprovision containersvirtual machineslarge-scale clusters
Customer Support Engineer (GPU Cluster), India mid
Together AI·India
data platform primitivesevent-driven architectureshigh-quality event streamsLLM-adjacent servicesprompt categorizationprompt taxonomy
Engineering Manager, Machine Learning Platformmanager
Stripe·Toronto
ML PlatformML Data FoundationsML ServingML observabilityML-driven products
Tech Lead Manager- MLRE, ML Systemssenior
Scale AI·San Francisco, CA; New York, NY
Multi-node LLM trainingMulti-node LLM inferenceDistributed ML systemsPost-training methodsRLHFRLVR
ML Research Engineer, ML Systemsmid
Scale AI·San Francisco, CA; Seattle, WA; New York, NY
RLXF (Scale's ML platform)distributed LLM training frameworkmulti-node LLM trainingmulti-node LLM inferenceflash attentionCUDA optimization
Machine Learning Research Scientist / Research Engineer, Post-Trainingmid
Scale AI·San Francisco, CA; Seattle, WA; New York, NY
LLM post-trainingSFT (Supervised Fine-Tuning)RLHF (Reinforcement Learning from Human Feedback)reward modelingpreference optimizationbias mitigation
Engineering Manager, Model Servingmanager
Together AI·San Francisco, CA
model servingML APIinference servicesfine-tuning servicesmulti-tenant serverlessdedicated endpoints
Software Engineer, Deployment Infrastructuremid
Vercel·Hybrid - San Francisco, New York City
LLM-driven abuse detectionthreat actor behavior analysisabuse pattern detectiontrust and safety engineeringsecurity engineeringapplied LLM techniques
Manager I, Engineering - AI Platform - Training & Servingmanager
Datadog·Paris, France
Natural Language Query (NLQ)Command-I (Cmd-I)LLM-powered query translationcontext-aware retrieval systemsagentic architecturesevaluation frameworks
Manager II, Engineering - AI Platform Training, Serving and Storagemanager
Datadog·Paris, France
semantic searchagentic searchintent understandingnatural language querying (NLQ)intent modelssemantic retrieval systems
Member of Technical Staff - ML & Data Infrastructurestaff
xAI·Palo Alto, CA
petabyte-scale data acquisitionmultimodal web crawlingweb-scale search/retrieval systemshigh-throughput inference servinglow-level GPU/kernel optimizationscompiler/runtime innovations
Research Engineer, Science of Scalingmid
Anthropic·London, UK
Science of ScalingTraining Infrastructure OptimizationDev ToolingCompute EfficiencyExperimental DesignLarge Language Model Development
ML Infrastructure Engineer, Safeguardsmid
Anthropic·San Francisco, CA
ML infrastructure for AI safetyreal-time safety evaluationsbatch classifier evaluationssafety-critical monitoringproductionizing safety researchinference latency optimization for safety
Staff Software Engineer - ML Observabilitystaff
Datadog·Boston, Massachusetts, USA; New York, New York, USA
Real User Monitoring (RUM)session replayerror trackingfeature flagsDigital Experience Monitoring (DEM)frontend observability
Research Engineer, Pretraining Scalingmid
Anthropic·San Francisco, CA
Large-scale ML systemsProduction pretraining pipelineTraining efficiency optimizationStep time reductionHardware debuggingTraining dynamics
Research Engineer, Pretraining Scaling - Londonmid
Anthropic·London, UK
Large-scale ML systemsProduction pretraining pipelineTraining efficiency optimizationStep time reductionHardware debuggingTraining dynamics
Research Engineer / Research Scientist, Pre-trainingmid
Anthropic·Zürich, CH
Pre-trainingmultimodal LLMsmodel architectureoptimizer developmenttraining infrastructure scalingdev tooling
Research Engineer / Research Scientist, Tokensmid
Anthropic·New York City, NY; New York City, NY | Seattle, WA; San Francisco, CA
Large scale ML systemsCluster reliabilityThroughput optimizationEfficiency improvementScientific experimentationDev tooling for ML