Inference & Model Optimization
Optimize models for fast, cheap inference. Quantization, distillation, compilation, serving at scale.
95
Open Positions
Core Skills
Inference OptimizationQuantizationModel CompressionDistillationTensorRTvLLMONNXTriton Inference Server
Active Positions (50)
Research Engineer, Model Inference & Serving - Parismid
H Company·Hybrid Paris
vLLMSGLangInference OptimizationModel ServingSpeculative DecodingLong-Context Modeling
ML Engineer - Life Sciences (Early Talent) mid
Nebius·Amsterdam, Netherlands
Inference OptimizationModel CompressionQuantizationFoundation Models
Senior ML Solutions Architect - Token Factorysenior
Nebius·United States
Model Fine-TuningLoRA / QLoRASpeculative DecodingQuantizationInference OptimizationModel Serving
Performance Architect, AI HWmid
Tenstorrent·Toronto, Ontario, Canada
GPU OptimizationInference OptimizationDistributed TrainingGPU Clusters
Sr Engineer, Server Inferencesenior
Tenstorrent·Belgrade, Serbia
Model ServingInference OptimizationKV-Cache OptimizationDistributed Training
Staff Technical Program Manager, AI Systems and IP Deliverystaff
Tenstorrent·North America
Model ServingCompiler DesignInference Optimization
Software Engineer – Performance Profilingmid
Etched·San Jose
Inference OptimizationGPU OptimizationModel ServingTransformer Architectures
ASIC Architectmid
Etched·San Jose
Transformer ArchitecturesInference OptimizationScaling Laws
Design Verification Engineer - SoCmid
Etched·San Jose
Transformer ArchitecturesInference Optimization
Chip Simulation Software Engineermid
Etched·San Jose
Transformer ArchitecturesEmbedded SoftwareInference Optimization
Firmware Engineer (Taiwan)mid
Etched·Taipei
Firmware DevelopmentEmbedded SoftwareTransformer Architectures
Accelerator Software Engineermid
Etched·San Jose
Firmware DevelopmentEmbedded LinuxInference OptimizationModel Serving
Engineering Manager, Inference ML Runtimemanager
Cerebras·Sunnyvale CA or Toronto Canada
Model ServingInference OptimizationMultimodal AIDistributed TrainingLarge Language Models (LLMs)
LLM Inference Performance & Evals Engineermid
Cerebras·Toronto, Ontario, Canada
Mixture-of-ExpertsSpeculative DecodingKV-Cache OptimizationQuantizationInference OptimizationEvaluation Frameworks
Member of Technical Staff (Software Engineer) staff
Cerebras·Sunnyvale, CA
Model ServingInference OptimizationMLOpsModel Monitoring & ObservabilityDistributed Training
ML Performance Benchmarking Engineermid
Cerebras·Toronto, Ontario, Canada
Inference OptimizationModel Monitoring & ObservabilityGPU OptimizationSpeculative DecodingEvaluation FrameworksModel Serving
ML Research Engineer (Inference)mid
Cerebras·Bengaluru, Karnataka, India
Speculative DecodingInference OptimizationLong-Context ModelingMixture-of-ExpertsFoundation ModelsKV-Cache Optimization
ML Systems Performance Engineermid
Cerebras·Sunnyvale CA or Toronto Canada
Inference OptimizationGPU OptimizationCUDA KernelsModel Monitoring & ObservabilityCompiler Design
Principal Engineer, AI Inference ReliabilitystaffRemote
Cerebras·Remote, California, United States; Sunnyvale CA or Toronto Canada
Model ServingInference OptimizationReliability EngineeringDistributed TrainingModel Monitoring & Observability
Senior/Staff Engineer : Post Silicon- Bring Upsenior
Cerebras·Bengaluru, Karnataka, India; Sunnyvale, CA; Toronto, Ontario, Canada
GPU OptimizationInference OptimizationHardware TestingReliability Engineering
Senior/Staff- Engineer: Post Silicon- Bring Up senior
Cerebras·Bengaluru, Karnataka, India; Sunnyvale, CA; Toronto, Ontario, Canada
GPU OptimizationInference OptimizationHardware TestingReliability Engineering
Staff Inference ML Runtime Engineerstaff
Cerebras·Sunnyvale CA or Toronto Canada
Model ServingInference OptimizationDistributed TrainingLarge Language Models (LLMs)PyTorchFoundation Models
Senior Software Engineer, Performancesenior
Nuro·Mountain View, California (HQ)
GPU OptimizationReal-time SystemsInference OptimizationEdge AI
Senior/Staff Software Engineer, ML Infrastructure, Optimizationsenior
Nuro·Mountain View, California (HQ)
QuantizationDistillationModel CompressionInference OptimizationCompiler DesignGPU Optimization
Software Engineer, ML Infrastructure, Optimizationmid
Nuro·Mountain View, California (HQ)
QuantizationInference OptimizationModel CompressionCompiler DesignGPU OptimizationLarge Language Models (LLMs)
Research Engineer, Model Inference & Serving - Londonmid
H Company·Hybrid London
vLLMSGLangInference OptimizationModel ServingPyTorchJAX
Backend Engineer - APImid
xAI·Palo Alto, CA
Model ServingvLLMSGLangTensorRTAgent OrchestrationLLM Integration
Engineering Manager, API Coremanager
Anthropic·San Francisco, CA | New York City, NY
Model ServingInference OptimizationLLM IntegrationKV-Cache Optimization
Performance Engineer, Inference Systemsmid
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
Inference OptimizationKV-Cache OptimizationModel Monitoring & ObservabilityQuantizationSpeculative DecodingGPU Optimization
Staff + Senior Software Engineer, Inferencesenior
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
Model ServingInference OptimizationGPU ClustersModel Monitoring & ObservabilityDistributed Training
Staff+ Software Engineer, Inference RuntimestaffRemote
Anthropic·Remote-Friendly (Travel-Required) | San Francisco, CA | Seattle, WA | New York City, NY
Inference OptimizationModel ServingGPU OptimizationKV-Cache OptimizationSpeculative Decoding
Staff + Sr. Software Engineer, Cloud Inferencesenior
Anthropic·San Francisco, CA
Model ServingInference OptimizationGPU ClustersMLOpsModel Monitoring & Observability
ChatGPT Performance Engineermid
OpenAI·San Francisco
Model Monitoring & ObservabilityInference OptimizationGPU OptimizationDistributed TrainingModel ServingReliability Engineering
Software Engineer, Caching Infrastructuremid
OpenAI·San Francisco
KV-Cache OptimizationInference Optimization
Software Engineer, Inference – AMD GPU Enablement mid
OpenAI·San Francisco
vLLMTriton (GPU Kernels)CUDA ProgrammingDistributed TrainingInference OptimizationNCCL
ML Research Engineer - Hardware Codesignmid
OpenAI·San Francisco
QuantizationModel CompressionCUDA ProgrammingScaling LawsGPU Optimization
Software Engineer, Inference - Performance Optimizationmid
OpenAI·San Francisco
Inference OptimizationKV-Cache OptimizationModel ServingGPU OptimizationSpeculative DecodingFlashAttention
Performance & Systems Engineer, Codexmid
OpenAI·San Francisco
Inference OptimizationModel ServingAgentic AILarge Language Models (LLMs)GPU Optimization
Software Engineer, Productivity - Inference Runtimemid
OpenAI·San Francisco
Inference OptimizationModel ServingModel Monitoring & ObservabilityMLOpsSpeculative Decoding
Backend Software Engineer, ChatGPT ImageGenmid
OpenAI·San Francisco
Diffusion ModelsMultimodal AIModel ServingInference OptimizationDistributed Training
Staff Research Engineer, Model Efficiencystaff
Cohere·New York
Inference OptimizationMixture-of-Experts (MoE)Speculative DecodingKV-Cache OptimizationFlashAttentionQuantization
Staff Software Engineer, Inference Infrastructurestaff
Cohere·San Francisco
Model ServingInference OptimizationGPU ClustersMLOpsDistributed TrainingModel Monitoring & Observability
Lead Member of Technical Staff, Inference Infrastructuresenior
Cohere·San Francisco
Model ServingInference OptimizationGPU ClustersDistributed TrainingLarge Language Models (LLMs)MLOps
AI Inference Internship intern
Perplexity AI·London
Inference OptimizationQuantizationCUDA ProgrammingTriton (GPU Kernels)JAXMixture-of-Experts
Forward Deployed Engineer (Inference & Post-Training)mid
Together AI·San Francisco
vLLMSGLangTensorRTKV-Cache OptimizationSpeculative DecodingLoRA / QLoRA
Member of Technical Staff - ML Performancestaff
Modal·New York
vLLMTensorRTCUDA ProgrammingInference OptimizationPyTorchGPU Optimization
Software Engineer - Edgemid
Palantir·Washington, D.C.
Edge AIEdge InferenceInference OptimizationETL PipelinesReal-time Systems
Machine Learning Infrastructure Engineer- Model Inferencemid
Abridge·SF Office
Model ServingInference OptimizationvLLMTriton Inference ServerTensorRTGPU Optimization
Engineering Manager, Model Inferencemanager
Abridge·SF Office
Model ServingInference OptimizationGPU OptimizationQuantizationvLLMDistributed Training
Cloud Site Reliability Engineermid
SambaNova·San Jose, California, United States
Model ServingInference OptimizationModel Monitoring & ObservabilityMLOps