Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

ML Infrastructure & Training

Build the infrastructure for training and serving models at scale. GPU clusters, distributed systems, MLOps.

208
Open Positions

Core Skills

Distributed TrainingGPU ClustersCUDA KernelsRayKubernetesNCCLTritonModel Serving

Active Positions (50)

Sr. Software Engineer (Data Center Automation) senior
xAI·Memphis, TN
Model Monitoring & ObservabilityGPU ClustersReal-time SystemsMLOps
Senior ML Systems Engineer, Frameworks & Toolingsenior
Cohere·London
Distributed TrainingJAXPyTorchCUDANCCLMegatron
Software Engineer, Monetization ML Infrastructuremid
OpenAI·San Francisco
Feature EngineeringFeature StoresRecommendation SystemsLearning-to-RankModel ServingMLOps
Member Of Technical Staff - Cloud Infrastructurestaff
xAI·Palo Alto, CA
GPU ClustersDistributed TrainingModel ServingMLOpsCloud SecurityModel Monitoring & Observability
Software Engineer, GPU Infrastructure (HPC)mid
Cohere·Canada
GPU ClustersDistributed TrainingNCCLJAXPyTorchMLOps
Software Engineer, Compute Infrastructuremid
OpenAI·San Francisco
GPU ClustersDistributed TrainingNCCLCUDA ProgrammingGPU OptimizationModel Serving
Software Engineer, Productivity - Model Performancemid
OpenAI·San Francisco
Triton (GPU Kernels)PyTorchInference OptimizationMLOpsDistributed Training
Software Engineer, ML Systems & Training Architecturemid
OpenAI·San Francisco
Distributed TrainingGPU ClustersPyTorchMLOpsEmbodied AI Systems
Host Systems Software Engineermid
OpenAI·San Francisco
CUDA ProgrammingDistributed TrainingGPU ClustersGPU Optimization
Senior Site Reliability EngineerseniorRemote
Photoroom·Paris | Remote
Model ServingGPU ClustersMLOpsInference OptimizationModel Monitoring & Observability
Software Engineer, Productivity - Networkingmid
OpenAI·San Francisco
Distributed TrainingModel ServingMLOps
CPU/Storage/PoP-WAN Program Managermanager
OpenAI·San Francisco
GPU ClustersDistributed TrainingModel Serving
Software Engineer, Distributed Data Systems - Roboticsmid
OpenAI·San Francisco
Distributed TrainingApache KafkaMLOpsData CurationMultimodal AI
Tokens-as-a-Service (Taas) Software Engineermid
OpenAI·San Francisco
GPU ClustersInference OptimizationModel ServingModel Monitoring & ObservabilityDistributed Training
Engineering Manager, Research Toolsmanager
Anthropic·San Francisco, CA | New York City, NY
Distributed TrainingModel Fine-TuningData CurationMLOpsETL Pipelines
Capacity Systems Software Engineermid
OpenAI·San Francisco
GPU ClustersMLOpsModel Monitoring & ObservabilityDistributed Training
Software Engineer, Core Network Engineeringmid
OpenAI·San Francisco
GPU ClustersDistributed TrainingNCCLModel ServingGPU Optimization
Tech Lead, Deployment & Operations — Custom Infrastructuresenior
OpenAI·San Francisco
GPU ClustersModel ServingMLOps
Engineering Manager, Core Services manager
OpenAI·San Francisco
Distributed TrainingMLOpsModel Monitoring & Observability
CPU Storage Tech Leadsenior
OpenAI·San Francisco
GPU ClustersDistributed TrainingInference OptimizationModel Serving
Software Engineer, RL Training Inframid
OpenAI·San Francisco
Reinforcement Learning from Human Feedback (RLHF)Distributed TrainingMulti-Agent SystemsInference OptimizationReinforcement LearningModel Fine-Tuning
Software Engineer, Reliability mid
OpenAI·San Francisco
Model ServingGPU ClustersReliability EngineeringModel Monitoring & ObservabilityDistributed Training
Token-as-a-Service Technical Program Managermanager
OpenAI·San Francisco
GPU ClustersDistributed TrainingModel ServingInference Optimization
Training Performance Engineermid
OpenAI·San Francisco
Distributed TrainingGPU OptimizationCUDA ProgrammingNCCLMLOpsFlashAttention
Software Engineer, Data Infrastructure - Researchmid
OpenAI·San Francisco
Distributed TrainingMultimodal AIMLOpsData CurationGPU Clusters
Research Infrastructure Engineer, Training Systemsmid
OpenAI·San Francisco
Distributed TrainingPyTorchGPU ClustersMLOpsModel Monitoring & ObservabilityLarge Language Models (LLMs)
Senior Software Engineer, Simulation ML Infrastructuresenior
Waymo·Mountain View, CA, USA; San Francisco, CA, USA
Foundation ModelsDistributed TrainingScaling LawsGPU ClustersData CurationSynthetic Data Generation
Site Reliability Engineer, Frontier Systems Infrastructuremid
OpenAI·San Francisco
GPU ClustersDistributed TrainingMLOpsModel Monitoring & Observability
Principal Software Engineer, ML System Architectstaff
Waymo·Mountain View, CA, USA
Foundation ModelsDistributed TrainingData CurationEvaluation FrameworksModel ServingDistillation
Software Engineer, Productivity - Training Runtimemid
OpenAI·San Francisco
Distributed TrainingMLOpsModel Monitoring & Observability
Senior Software Engineer, Training Efficiencysenior
Waymo·Mountain View, California
Distributed TrainingJAXTensorFlowMLOpsData CurationGPU Clusters
Systems Engineer, HPC (US & Canada)mid
Mistral AI·Montreal
GPU ClustersDistributed TrainingMLOps
Technical Lead Manager - Training Runtime, Data(set) Movementsenior
OpenAI·San Francisco
Distributed TrainingPre-TrainingMLOpsData CurationMultimodal AIReinforcement Learning
Systems Engineer (Network / Storage / Systems)mid
OpenAI·San Francisco
GPU ClustersDistributed TrainingModel Serving
Senior Machine Learning Infrastructure Engineer, Simulationsenior
Waymo·Mountain View, CA, USA: San Francisco, CA, USA
Foundation ModelsDistributed TrainingModel ServingData CurationMLOpsGPU Clusters
Senior Staff+ Infrastructure Engineer, Cluster Infrastructuresenior
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
GPU ClustersDistributed TrainingMLOpsModel Monitoring & ObservabilityAgentic AI
Full-Stack Software Engineer, Compute Foundationsmid
OpenAI·San Francisco
Distributed TrainingGPU ClustersModel ServingMLOps
Principal Technical Program Manager staff
Roblox·San Mateo, CA, United States
MLOpsTechnical Program Management
Software Engineer, Hardware Healthmid
OpenAI·San Francisco
GPU ClustersGPU OptimizationModel Monitoring & ObservabilityDistributed TrainingReliability Engineering
Senior Staff+ Software Engineer, Node Infrasenior
Anthropic·London, UK
GPU ClustersDistributed TrainingMLOpsGPU OptimizationModel Monitoring & Observability
Software Engineer, GPU Infrastructure - HPCmid
OpenAI·San Francisco
GPU ClustersDistributed TrainingGPU OptimizationReliability EngineeringModel Monitoring & Observability
Senior Staff+ Software Engineer, Kubernetes Platformsenior
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
GPU ClustersDistributed TrainingMLOpsModel ServingGPU Optimization
Engineer, Supercomputing & Distributed Systemsmid
Krea·San Francisco
Distributed TrainingGPU ClustersNCCLETL PipelinesData CurationApache Kafka
Staff Engineer, Datacenter Server Lifecyclestaff
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
GPU ClustersMLOpsModel Monitoring & ObservabilityHardware Testing
Training: ML Framework Engineermid
OpenAI·San Francisco
Distributed TrainingGPU ClustersPyTorchMegatronScaling LawsInference Optimization
Software Engineer, Fleet Managementmid
OpenAI·San Francisco
GPU ClustersMLOpsLLM IntegrationModel Monitoring & Observability
Software Engineer, Research Developer Productivitymid
OpenAI·San Francisco
MLOpsDistributed Training
Software Engineer, Platform Systemsmid
OpenAI·San Francisco
Distributed TrainingGPU ClustersModel Monitoring & ObservabilityMLOps
Software Engineer, Fleet Hardware Healthmid
OpenAI·San Francisco
GPU ClustersModel Monitoring & ObservabilityDistributed Training
Site Reliability Engineer, Inference Infrastructuremid
Cohere·Toronto
MLOpsModel ServingGPU ClustersModel Monitoring & ObservabilityReliability Engineering