Inference & Model Optimization

Optimize models for fast, cheap inference. Quantization, distillation, compilation, serving at scale.

Open Positions

Core Skills

Inference OptimizationQuantizationModel CompressionDistillationTensorRTvLLMONNXTriton Inference Server

Active Positions (43)

Software Engineer, Inference AI/MLmid

CoreWeave· Sunnyvale, CA / Bellevue, WA

Triton Inference ServervLLMTensorRT-LLMRay ServeModel-serving optimizationRequest batching

AI Success Engineer - NYCmid

OpenAI·New York City

AI Success Engineerpost-sales technical relationshipaccount healthadoption velocityworkflow transformationtechnical integration

AI Success Engineer - San Franciscomid

OpenAI·San Francisco

AI Success Engineerpost-sales technical relationshipaccount healthadoption velocityworkflow transformationtechnical integration

GPU Performance Engineermid

CoreWeave·Warsaw, Poland

GPU performance testingkubernetes-based solutionshardware performance testingautomation platforms

Senior Software Engineer II, Inferencesenior

CoreWeave·Sunnyvale, CA / Bellevue, WA

micro-batch schedulersspeculative decodingKV-cache reuseP99 SLAsSLIs/SLOsKubernetes-native inference platform

Senior Software Engineer I, Inferencesenior

CoreWeave·Sunnyvale, CA / Bellevue, WA

micro-batch schedulersspeculative decodingKV-cache reuseP99 SLAsSLIs/SLOsKubernetes-native inference platform

Sr. Software Engineer - Perf and Benchmarkingsenior

CoreWeave· Sunnyvale, CA / Bellevue, WA

Kubernetes-native benchmarking servicesMLPerfP99 SLAsplanet-scale performance data warehouse

Member of Technical Staff (AI Inference Engineer)staff

Perplexity AI·San Francisco

AI inference engineCUDACuTe DSLKV-cache managementcontinuous batchingTriton

Engineering Manager, Model Routing & Inferencemanager

Cursor·San Francisco

inference gatewaymodel routingdynamic model selectionGPU cluster utilizationrouting backpressureadmission control

Software Engineer, Model Routing & Inferencemid

Cursor·New York

inference gatewaycross-provider failoverrouting backpressureadmission controlGPU utilization optimizationprovider economics

Sr. Manager, Engineering - AI Gateway (LLM Inference)senior

Databricks·New York

Databricks AI GatewayLLM inference governance

Forward Deployed Engineer - Sydneymid

OpenAI·Sydney, Australia

AI Success Engineer - US RemotemidRemote

OpenAI·Remote - US

AI Success Engineerpost-sales technical relationshipaccount healthadoption velocityworkflow transformationtechnical integration

AI deployment engineer (UK)mid

Writer·London, UK

WRITER platformenterprise-grade LLMsAI agent deploymentAI solution architectureAI-powered applications

Technical Deployment Lead, Forward Deployed Engineering - Londonsenior

OpenAI·London, UK

Forward Deployed EngineeringTechnical Delivery Plan0→1 PrototypingMVP to ScaleCustomer Workflow MappingChange Management for Adoption

Forward Deployed Engineer (FDE), Life Sciences - SFmid

OpenAI·San Francisco

Forward Deployed EngineerLife SciencesRegulated EnvironmentsProduction AdoptionEval LoopsWorkflow-specific Benchmarks

Forward Deployed Engineer (FDE), Life Sciences - NYCmid

OpenAI·New York City

Forward Deployed Engineering (FDE) - Life Sciencesregulated environment deploymentsclinical development workflowssubmissions workflowsscientific operationsworkflow-specific benchmarks

AI Deployment Engineermid

OpenAI·Tokyo, Japan

OpenAI API platform deploymentapplication ideation with AIproduction application scalinghigh-fidelity feedback synthesisdeveloper resources creationenterprise resources codification

Optimization Software Engineermid

Anduril·Washington, District of Columbia, United States

Lattice OSclassical optimization algorithmshybrid quantum optimizationClaude CodeGenAI-powered development toolsmulti-domain optimization

Engineering Manager, Inference Routing and Performancemanager

Anthropic·San Francisco, CA | New York City, NY

Dystrocustom load-balancing algorithmscache placement optimizationlatency spike debuggingkernel-boundary debuggingML framework internals

Senior Backend Engineer, Inference Platformsenior

Together AI·San Francisco

SGLangvLLMNVIDIA Dynamoglobal request routingload balancinglarge-scale resource allocation

Generative AI Inference Engineermid

Stability AI·United States

generative AI inferencemulti-modal modelsdiffusion model architecturesComfyUIinference optimizationTriton

Machine Learning Engineer - Inferencemid

Together AI·San Francisco

container platform for custom modelsdedicated inferenceautoscaling optimizationcold start minimizationvideo or audio generationCUDA kernels

LLM Inference Frameworks and Optimization Engineermid

Together AI·San Francisco, Singapore, Amsterdam

LLM Inference FrameworksMixture of Experts (MoE) ParallelismTensor ParallelismPipeline ParallelismCUDA Graph OptimizationsTensorRT/TRT-LLM

Software Engineer, Inference Deploymentmid

Anthropic·San Francisco, CA | New York City, NY | Seattle, WA

inference deploymentdeployment orchestrationcapacity-aware deployment schedulingGPU deploymentTPU deploymentTrainium deployment

Software Engineer - GenAI inference mid

Databricks·San Francisco, California

GenAI inferenceFoundation Model APILLM serving systemsModel-serving stackSparsityActivation compression

Sr. Software Engineer - Performancesenior

Databricks·Mountain View, California

Performance EngineeringPerformance BenchmarksCompetitive Benchmark AnalysisTelemetry AnalysisScalability AnalysisLatency Optimization

Staff Software Engineer, Foundational Model Servingstaff

Databricks·San Francisco, California

Foundation ModelsvLLMSGLangGPU OptimizationHigh-throughput inferenceLow-latency inference

Engineering Manager, Cloud Inference AWSmanager

Anthropic·San Francisco, CA | Seattle, WA

Cloud InferenceLLM ServingInference OptimizationLoad BalancingModel DeploymentPerformance Standards

Model Quality Software Engineer, Claude Codemid

Anthropic·San Francisco, CA | New York City, NY

Claude Codeeval systemscoding capabilities evaluationresearch infrastructure scalingdata collection pipelinesresearcher productivity tooling

Engineering Manager, Inferencemanager

Anthropic·San Francisco, CA | New York City, NY | Seattle, WA

inference system scalingmodel performance optimizationcompute efficiencydistributed systems for inferencebottleneck removalrobust inference solutions

Performance Engineermid

Anthropic·San Francisco, CA | New York City, NY | Seattle, WA

supercomputing-scale MLGPU/accelerator programmingML framework internalslow-latency high-throughput samplinglow-precision inference kernelscustom load-balancing algorithms

Senior Software Engineer, Inferencesenior

Anthropic·Dublin, IE

LLM inference optimizationinference batching strategiesinference caching strategiescompute-agnostic inference deploymentsAI accelerator orchestrationfleet-wide orchestration

Software Engineer, Compute Efficiencymid

Anthropic·San Francisco, CA | New York City, NY

telemetry and monitoring systemscost attribution frameworksinfrastructure performance optimizationdistributed systems analysiscapacity hotspotsmulti-tenant infrastructure

Staff / Senior Software Engineer, Cloud Inferencesenior

Anthropic·San Francisco, CA | Seattle, WA

Cloud Inferenceintelligent request routinginference executioncapacity managementmulti-cloud platformscompute-agnostic inference deployments

Staff / Senior Software Engineer, Inferencesenior

Anthropic·San Francisco, CA | New York City, NY | Seattle, WA

compute-agnostic inference deploymentsintelligent request routingfleet-wide orchestrationdiverse AI acceleratorsdistributed systems for AIhigh-performance inference infrastructure

Technical Program Manager, Inference Performancemanager

Anthropic·San Francisco, CA | Seattle, WA

Inference runtime optimizationAccelerator performance managementCross-platform validationHardware target reliabilityInfrastructure integrationPlatform modernization

Senior Software Engineer, Model Servingsenior

Databricks·San Francisco, California

Model ServingLow-latency inferenceGPU workloadsAutoscalingModel container buildsDeployment workflows

Member of Technical Staff - Applied Inferencestaff

xAI·Palo Alto, CA

model servingKVcache systemsinference enginestail performanceGPU kernelsbatch scheduling

Member of Technical Staff - Inferencestaff

xAI·Palo Alto, CA

Model Inference OptimizationGPU OptimizationQuantization TechniquesModel DistillationSpeculative DecodingLow-Precision Numerics

Senior Software Engineer, macOS Specialistsenior

Datadog·Bordeaux, France; Grenoble, France; Lyon, France; Montpellier, France; Nantes, France; Nice, France; Paris, France; Sophia Antipolis, France

agentic workflowsagentic investigationsincident troubleshootingGenAI model deployment at scaledistributed tracingapplication profiling

Senior Staff GenAI Engineer - Application Performance Monitoring (APM)senior

Datadog·New York, New York, USA

large-scale systems designobservability data platformsdistributed query enginesreal-time event streamingtrillions of data points per daytechnical leadership at scale

Staff GenAI Engineer - Application Performance Monitoring (APM)staff

Datadog·New York, New York, USA

AI ObservabilityLLM ObservabilityLLM evaluationprompt iterationagent debuggingLLM optimization