Inference & Model Optimization

Optimize models for fast, cheap inference. Quantization, distillation, compilation, serving at scale.

0
Open Positions

Core Skills

Inference OptimizationQuantizationModel CompressionDistillationTensorRTvLLMONNXTriton Inference Server

Active Positions (28)

Infrastructure Engineering Manager, Forward Deployed Engineeringmanager
OpenAI·New York City
Forward Deployed EngineeringSingle-tenant deploymentsInfrastructure patternsSecurity postureCustomer connectivityProvisioning workflows
Engineering Manager, Inferencemanager
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
Model Inference OptimizationDistributed Systems for AIAI Safety IntegrationCompute Resource Efficiency
Model Quality Software Engineer, Claude Codemid
Anthropic·San Francisco, CA | New York City, NY
eval systems for coding tasksexperiment scaling infrastructuredata collection pipelinesresearcher productivity toolingproduct intuition for AI capabilitiesresearch-to-engineering translation
Senior Software Engineer, Inferencesenior
Anthropic·Dublin, IE
LLM Inference OptimizationCompute-Agnostic Inference DeploymentsAI Accelerator OrchestrationIntelligent Request RoutingInference Batching StrategiesInference Caching Strategies
Software Engineer, Compute Efficiencymid
Anthropic·San Francisco, CA | New York City, NY
Cost attribution frameworksMulti-tenant infrastructure
Software Engineer, Inference Deploymentmid
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
inference deploymentGPU deploymentTPU deploymentTrainium deploymentcapacity-aware schedulingdeployment orchestration
Staff / Senior Software Engineer, Cloud Inferencesenior
Anthropic·San Francisco, CA | Seattle, WA
Cloud Inferenceintelligent request routinginference executioncapacity managementcompute-agnostic inference deployments
Staff / Senior Software Engineer, Inferencesenior
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
Compute-Agnostic Inference DeploymentsAI Accelerator OrchestrationDistributed AI inference systems
Inference Runtime, Engineering Managermanager
OpenAI·San Francisco
Inference Runtimemodel inferencelow-latency inferencehigh-availability inferencedistributed systems for AImodel architecture optimization
Forward Deployed Engineer - Semiconductormid
OpenAI·San Francisco
Semiconductor AI DeploymentChip Design Workflow IntegrationRTL Codebase AI IntegrationVerification Quality ImprovementEDA Vendor IntegrationFabrication Partner Systems
Inference Technical Lead, On-Device Transformerssenior
OpenAI·San Francisco
On-device transformersEdge deploymentNPUs (Neural Processing Units)Specialized acceleratorsTransformer model optimizationKV-Cache Behavior
Member of Technical Staff - Inferencestaff
xAI·Palo Alto, CA
Model Inference OptimizationGPU OptimizationQuantization TechniquesModel DistillationSpeculative DecodingLow-Precision Numerics
Software Engineer - GenAI inference mid
Databricks·San Francisco, California
GenAI inferenceFoundation Model APILLM serving systemsModel-serving stackSparsityActivation compression
Sr. Software Engineer - Performancesenior
Databricks·Mountain View, California
Competitive benchmark analysisCloud-scale performance optimizationLatency optimizationCompute scalabilityPrice-to-performance ratio analysis
Engineering Manager, Cloud Inference AWSmanager
Anthropic·San Francisco, CA | Seattle, WA
AWS inference infrastructureLLM Serving on AWSInference capacity managementLoad balancing for AI APIsInference technology packaging
Senior Staff GenAI Engineer - Application Performance Monitoring (APM)senior
Datadog·New York, New York, USA
LLM-based agentspromptsevalsagent toolsCoPilot-like assistantDynamic Instrumentation
Product Solutions Architect 3 - LLM Observabilitymid
Datadog·Boston, Massachusetts, USA; Denver, Colorado, USA; New York, New York, USA; San Francisco, California, USA
LLM Observabilityreference architecturestechnical guidescookbooksproofs of conceptsmall-scale deployments
Staff Software Engineer, Foundational Model Servingstaff
Databricks·San Francisco, California
Foundation ModelsvLLMSGLangGPU OptimizationHigh-throughput inferenceLow-latency inference
Performance Engineer, GPUmid
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
GPU programming at scaletensor core optimizationsdistributed GPU orchestrationcustom kernel developmentinference efficiency optimizationGPU utilization maximization
Senior Software Engineer, Model Servingsenior
Databricks·San Francisco, California
Model Serving InfrastructureLow-Latency InferenceGPU OptimizationModel Container BuildsIntelligent AutoscalingModel Deployment Workflows
Engineering Manager, Model Servingmanager
Together AI·San Francisco, CA
model servingML APIinference servicesfine-tuning servicesmulti-tenant serverlessdedicated endpoints
Software Engineer - Accelerationmid
Perplexity AI·San Francisco
AI AccelerationAgentic AIAI Tools and AgentsAI Developer ExperienceAI-Assisted Code GenerationQuality-Velocity Pareto Frontier
AI Deployment Engineermid
OpenAI·London, UK
OpenAI APIChatGPT APILangChainLlamaIndexVector databasesRAG (Retrieval-Augmented Generation)
Technical Deployment Lead, Forward Deployed Engineering (FDE) - NYCsenior
OpenAI·New York City
0→1 prototypesMVPchange managementtechnical project managementfield insightsdelivery reliability
Member of Technical Staff - Applied Inferencestaff
xAI·Palo Alto, CA
Distributed Model Serving InfrastructureGlobal KVcache SystemsInference Engine BenchmarkingInference Engine Fine-TuningGPU OptimizationTail Performance Optimization
Engineering Manager, Inference Developer Productivitymanager
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
Inference Developer ProductivityML InfrastructureDeveloper ExperienceAccelerator PlatformsGPUTPU
Performance Engineermid
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
supercomputing-scale MLGPU/accelerator programmingML framework internalslow-latency high-throughput samplinglow-precision inference kernelscustom load-balancing algorithms
Research Engineer, Production Model Post-Trainingmid
Anthropic·Zürich, CH
Post-trainingConstitutional AIRLHFAlignment methodologiesModel fine-tuning pipelinesProduction model evaluation