ML Infrastructure & Training
Build the infrastructure for training and serving models at scale. GPU clusters, distributed systems, MLOps.
0
Open Positions
Core Skills
Distributed TrainingGPU ClustersCUDA KernelsRayKubernetesNCCLTritonModel Serving
Active Positions (35)
Software Engineer, Machine Learning Infrastructuremid
Stripe·Toronto, Canada
Engineering Manager - Compute Inframanager
Databricks·Mountain View, California; San Francisco, California
Compute Infrastructure organizationstateful workloadsextreme elasticitycost efficiencyplatform mindsetcross-team initiatives
ML Infrastructure Engineer, Safeguardsmid
Anthropic·San Francisco, CA
ML infrastructure for AI safetyreal-time safety evaluationsbatch classifier evaluationssafety-critical monitoringproductionizing safety researchinference latency optimization for safety
Research Lead, Training InsightsseniorRemote
Anthropic·Remote-Friendly (Travel Required) | San Francisco, CA; San Francisco, CA | New York City, NY
Training InsightsLong-horizon EvaluationsEmerging Capabilities MeasurementModel Capabilities CharacterizationEvaluation MethodologiesReinforcement Learning Training
Software Engineer, ML Networkingmid
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
Kernel Networking (TCP/IP Stack Internals)XDP (eXpress Data Path)eBPF (Extended Berkeley Packet Filter)io_uringDPDK (Data Plane Development Kit)Remote Direct Memory Access (RDMA)
Staff / Senior Software Engineer, Compute Capacitysenior
Anthropic·San Francisco, CA | New York City, NY
Accelerator Capacity Engineering (ACE)AI Accelerator Fleet ManagementAccelerator telemetryKubernetes-Native Infrastructure
Training Content & Systems Architectmid
Anthropic·San Francisco, CA | New York City, NY
AI ContentModular content architecturesAutomated Quality LoopsTrainer-facing content systemsContent drift detection
Engineering Manager ChatGPT Inframanager
OpenAI·London, UK
ChatGPT Runtime ArchitectureRuntime Framework DevelopmentService Integration PatternsExtensibility MechanismsLoad-Bearing InfrastructureFollow-the-Sun Global Infrastructure
Training, Process Management Engineermid
OpenAI·London, UK
Training Runtimedistributed training workloadsprocess managementlarge-scale training clustersRust asynchronous systemsdistributed OS
Member of Technical Staff - Mid-trainingstaff
xAI·Palo Alto, CA
synthetic coding datalarge-scale docker verificationmodel distillationsynthetic data generationmid-training data mixtureslong-context data recipes
Member of Technical Staff - ML & Data Infrastructurestaff
xAI·Palo Alto, CA
Multimodal data crawlingWeb-scale search/retrieval systemsHigh-throughput inference servingLow-Level GPU/Kernel OptimizationsCompiler/runtime innovationsHigh-speed interconnect fabrics
Member of Technical Staff - RL Infrastructurestaff
xAI·Palo Alto, CA
RL Infrastructureagentic model capabilityevaluation frameworkopen-source evaluation datasetautomation frameworksobservability
Applied Research Science Lead, Model Scaling seniorRemote
Runway·Remote
Model PretrainingMultimodal AIModel ScalingHuman-in-the-Loop SystemsResearch-to-application pipeline
ML Research Engineer, ML Systemsmid
Scale AI·San Francisco, CA; Seattle, WA; New York, NY
RLXF (Scale's ML platform)distributed LLM training frameworkmulti-node LLM trainingmulti-node LLM inferenceflash attentionCUDA optimization
Tech Lead Manager- MLRE, ML Systemssenior
Scale AI·San Francisco, CA; New York, NY
Multi-node LLM trainingMulti-node LLM inferenceLarge-scale distributed ML systemsReinforcement Learning from Human Feedback (RLHF)Reinforcement Learning from Video Feedback (RLVR)Proximal Policy Optimization (PPO)
Member of Technical Staff - ML Training Systemsstaff
Modal·New York
Torch (PyTorch)Hugging FaceVerl (training framework)Slime (training framework)ML Training OptimizationData Loading Bottleneck Elimination
Senior Data Center Capacity Delivery Managersenior
Anthropic·San Francisco, CA | New York City, NY
Data Center Capacity DeliveryLeased Data Center CapacityPartnered Data Center CapacityConstruction Site SafetyOperational ReadinessCompute Infrastructure for Frontier AI
Senior Engineer, Datacenter Server Lifecyclesenior
Anthropic·London, UK
Datacenter Machine LifecycleHardware Provisioning at ScaleHardware DecommissioningTrusted Compute StandardsHardware Integrity AttestationInfrastructure Security Integration
Manager I, Engineering - AI Platform - Training & Servingmanager
Datadog·Paris, France
distributed training of foundation modelsAI model serving at scalelarge-scale training infrastructureinference platformAI platform infrastructureTraining & Serving systems
Member of Technical Staff - Compute Infrastructurestaff
xAI·Palo Alto, CA
Exascale Compute Resource ManagementContainer Orchestration Beyond KubernetesMassive-Scale Cluster DesignAI Agent Workload InfrastructureHigh-Performance Training Run Optimization
Member of Technical Staff - RL Training Frameworkstaff
xAI·Palo Alto, CA
End-to-end RL training frameworkDistributed RL systemsAsynchronous Reinforcement Learning (RL) Training FrameworksPre-training-scale RLSoftware and algorithm co-designJAX Framework
Machine Learning Systems Engineer, Research Toolsmid
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
tokenization systemsFinetuning workflowsPretraining workflowsencoding techniquesdata representationtokenization debugging
Research Engineer, Environment ScalingmidRemote
Anthropic·Remote-Friendly (Travel Required) | San Francisco, CA
Reinforcement Learning (RL) at ScaleRL EnvironmentsReward DesignReward Hacking DetectionFine-Tuning StrategiesQA Frameworks
Research Engineer, Reward Models PlatformmidRemote
Anthropic·Remote-Friendly (Travel-Required) | San Francisco, CA | Seattle, WA | New York City, NY
Reward ModelsReward HacksPreference ModelsHuman Feedback DataRubric MethodologiesReward Signals
ML HW-SW Co-design Software Managermanager
Google DeepMind·Mountain View, California, US
HW-SW Co-designMachine Learning Accelerators
Engineering Manager, Machine Learning Platformmanager
Stripe·Toronto
ML PlatformML FoundationsML Serving InfrastructureML Observability
Software Engineer, Machine Learning Infrastructuremid
Stripe·Seattle, SF
ML InfrastructureLLM ApplicationsML Training InfrastructureML Serving InfrastructureML Experimentation PlatformsFeature Generation Systems
Senior / Staff Software Engineer, Continuous Integrationsenior
Anthropic·London, UK
Continuous Integration (CI) infrastructureMonorepo scalingTest InfrastructureCode quality assurance systemsDeveloper productivity tooling
Software Engineer, Observabilitymid
OpenAI·San Francisco
OpenTelemetryPrometheusGrafanaELK stackJaegerDistributed tracing
Customer Support Engineer (GPU Cluster), India mid
Together AI·India
GPU clustersAI inferencefine-tuning servicesGen AI solutionscustomer technical supporttroubleshooting guides
Rack Product Engineer, AI Rack Infrastructure - Stargatemid
OpenAI·San Francisco
Stargate AI Rack InfrastructureAI Datacenter Rack DesignAI Compute Platform InfrastructureRack Manufacturing ReadinessRack Lifecycle Performance ManagementRack Architecture Definition
Member of Technical Staff – Model Trainingstaff
Inflection AI·Palo Alto, CA
Post-TrainingFine-Tuning Large Transformer ModelsPreference OptimizationReinforcement Learning from Human Feedback (RLHF)DPO (Direct Preference Optimization)GRPO (Group Relative Policy Optimization)
Manager II, Engineering - AI Platform Training, Serving and Storagemanager
Datadog·Paris, France
AI Platform TrainingAI Platform ServingAI Platform StorageLarge-scale training infrastructureLarge-scale inference infrastructureLLMObs (LLM Observability)
Engineering Manager, GPU (ML Accelerator)manager
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
GPU OptimizationML Accelerator EfficiencyInference system scalingTraining system scalingDistributed systems for AICompute Resource Efficiency
Machine Learning Systems Engineer, RL Engineeringmid
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
Reinforcement Learning EngineeringRLHFfinetuning systemsmodel training pipelineClaude models training