Skip to content
gentic.news — AI News Intelligence Platform
Connecting to the Living Graph…

Inference & Model Optimization

Optimize models for fast, cheap inference. Quantization, distillation, compilation, serving at scale.

43
Open Positions

Core Skills

Inference OptimizationQuantizationModel CompressionDistillationTensorRTvLLMONNXTriton Inference Server

Active Positions (43)

Software Engineer, Inference AI/MLmid
CoreWeave· Sunnyvale, CA / Bellevue, WA
Triton Inference ServervLLMTensorRT-LLMRay ServeModel-serving optimizationRequest batching
AI Success Engineer - NYCmid
OpenAI·New York City
AI Success Engineerpost-sales technical relationshipaccount healthadoption velocityworkflow transformationtechnical integration
AI Success Engineer - San Franciscomid
OpenAI·San Francisco
AI Success Engineerpost-sales technical relationshipaccount healthadoption velocityworkflow transformationtechnical integration
GPU Performance Engineermid
CoreWeave·Warsaw, Poland
GPU performance testingkubernetes-based solutionshardware performance testingautomation platforms
Senior Software Engineer II, Inferencesenior
CoreWeave·Sunnyvale, CA / Bellevue, WA
micro-batch schedulersspeculative decodingKV-cache reuseP99 SLAsSLIs/SLOsKubernetes-native inference platform
Senior Software Engineer I, Inferencesenior
CoreWeave·Sunnyvale, CA / Bellevue, WA
micro-batch schedulersspeculative decodingKV-cache reuseP99 SLAsSLIs/SLOsKubernetes-native inference platform
Sr. Software Engineer - Perf and Benchmarkingsenior
CoreWeave· Sunnyvale, CA / Bellevue, WA
Kubernetes-native benchmarking servicesMLPerfP99 SLAsplanet-scale performance data warehouse
Member of Technical Staff (AI Inference Engineer)staff
Perplexity AI·San Francisco
AI inference engineCUDACuTe DSLKV-cache managementcontinuous batchingTriton
Engineering Manager, Model Routing & Inferencemanager
Cursor·San Francisco
inference gatewaymodel routingdynamic model selectionGPU cluster utilizationrouting backpressureadmission control
Software Engineer, Model Routing & Inferencemid
Cursor·New York
inference gatewaycross-provider failoverrouting backpressureadmission controlGPU utilization optimizationprovider economics
Sr. Manager, Engineering - AI Gateway (LLM Inference)senior
Databricks·New York
Databricks AI GatewayLLM inference governance
Forward Deployed Engineer - Sydneymid
OpenAI·Sydney, Australia
AI Success Engineer - US RemotemidRemote
OpenAI·Remote - US
AI Success Engineerpost-sales technical relationshipaccount healthadoption velocityworkflow transformationtechnical integration
AI deployment engineer (UK)mid
Writer·London, UK
WRITER platformenterprise-grade LLMsAI agent deploymentAI solution architectureAI-powered applications
Technical Deployment Lead, Forward Deployed Engineering - Londonsenior
OpenAI·London, UK
Forward Deployed EngineeringTechnical Delivery Plan0→1 PrototypingMVP to ScaleCustomer Workflow MappingChange Management for Adoption
Forward Deployed Engineer (FDE), Life Sciences - SFmid
OpenAI·San Francisco
Forward Deployed EngineerLife SciencesRegulated EnvironmentsProduction AdoptionEval LoopsWorkflow-specific Benchmarks
Forward Deployed Engineer (FDE), Life Sciences - NYCmid
OpenAI·New York City
Forward Deployed Engineering (FDE) - Life Sciencesregulated environment deploymentsclinical development workflowssubmissions workflowsscientific operationsworkflow-specific benchmarks
AI Deployment Engineermid
OpenAI·Tokyo, Japan
OpenAI API platform deploymentapplication ideation with AIproduction application scalinghigh-fidelity feedback synthesisdeveloper resources creationenterprise resources codification
Optimization Software Engineermid
Anduril·Washington, District of Columbia, United States
Lattice OSclassical optimization algorithmshybrid quantum optimizationClaude CodeGenAI-powered development toolsmulti-domain optimization
Engineering Manager, Inference Routing and Performancemanager
Anthropic·San Francisco, CA | New York City, NY
Dystrocustom load-balancing algorithmscache placement optimizationlatency spike debuggingkernel-boundary debuggingML framework internals
Senior Backend Engineer, Inference Platformsenior
Together AI·San Francisco
SGLangvLLMNVIDIA Dynamoglobal request routingload balancinglarge-scale resource allocation
Generative AI Inference Engineermid
Stability AI·United States
generative AI inferencemulti-modal modelsdiffusion model architecturesComfyUIinference optimizationTriton
Machine Learning Engineer - Inferencemid
Together AI·San Francisco
container platform for custom modelsdedicated inferenceautoscaling optimizationcold start minimizationvideo or audio generationCUDA kernels
LLM Inference Frameworks and Optimization Engineermid
Together AI·San Francisco, Singapore, Amsterdam
LLM Inference FrameworksMixture of Experts (MoE) ParallelismTensor ParallelismPipeline ParallelismCUDA Graph OptimizationsTensorRT/TRT-LLM
Software Engineer, Inference Deploymentmid
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
inference deploymentdeployment orchestrationcapacity-aware deployment schedulingGPU deploymentTPU deploymentTrainium deployment
Software Engineer - GenAI inference mid
Databricks·San Francisco, California
GenAI inferenceFoundation Model APILLM serving systemsModel-serving stackSparsityActivation compression
Sr. Software Engineer - Performancesenior
Databricks·Mountain View, California
Performance EngineeringPerformance BenchmarksCompetitive Benchmark AnalysisTelemetry AnalysisScalability AnalysisLatency Optimization
Staff Software Engineer, Foundational Model Servingstaff
Databricks·San Francisco, California
Foundation ModelsvLLMSGLangGPU OptimizationHigh-throughput inferenceLow-latency inference
Engineering Manager, Cloud Inference AWSmanager
Anthropic·San Francisco, CA | Seattle, WA
Cloud InferenceLLM ServingInference OptimizationLoad BalancingModel DeploymentPerformance Standards
Model Quality Software Engineer, Claude Codemid
Anthropic·San Francisco, CA | New York City, NY
Claude Codeeval systemscoding capabilities evaluationresearch infrastructure scalingdata collection pipelinesresearcher productivity tooling
Engineering Manager, Inferencemanager
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
inference system scalingmodel performance optimizationcompute efficiencydistributed systems for inferencebottleneck removalrobust inference solutions
Performance Engineermid
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
supercomputing-scale MLGPU/accelerator programmingML framework internalslow-latency high-throughput samplinglow-precision inference kernelscustom load-balancing algorithms
Senior Software Engineer, Inferencesenior
Anthropic·Dublin, IE
LLM inference optimizationinference batching strategiesinference caching strategiescompute-agnostic inference deploymentsAI accelerator orchestrationfleet-wide orchestration
Software Engineer, Compute Efficiencymid
Anthropic·San Francisco, CA | New York City, NY
telemetry and monitoring systemscost attribution frameworksinfrastructure performance optimizationdistributed systems analysiscapacity hotspotsmulti-tenant infrastructure
Staff / Senior Software Engineer, Cloud Inferencesenior
Anthropic·San Francisco, CA | Seattle, WA
Cloud Inferenceintelligent request routinginference executioncapacity managementmulti-cloud platformscompute-agnostic inference deployments
Staff / Senior Software Engineer, Inferencesenior
Anthropic·San Francisco, CA | New York City, NY | Seattle, WA
compute-agnostic inference deploymentsintelligent request routingfleet-wide orchestrationdiverse AI acceleratorsdistributed systems for AIhigh-performance inference infrastructure
Technical Program Manager, Inference Performancemanager
Anthropic·San Francisco, CA | Seattle, WA
Inference runtime optimizationAccelerator performance managementCross-platform validationHardware target reliabilityInfrastructure integrationPlatform modernization
Senior Software Engineer, Model Servingsenior
Databricks·San Francisco, California
Model ServingLow-latency inferenceGPU workloadsAutoscalingModel container buildsDeployment workflows
Member of Technical Staff - Applied Inferencestaff
xAI·Palo Alto, CA
model servingKVcache systemsinference enginestail performanceGPU kernelsbatch scheduling
Member of Technical Staff - Inferencestaff
xAI·Palo Alto, CA
Model Inference OptimizationGPU OptimizationQuantization TechniquesModel DistillationSpeculative DecodingLow-Precision Numerics
Senior Software Engineer, macOS Specialistsenior
Datadog·Bordeaux, France; Grenoble, France; Lyon, France; Montpellier, France; Nantes, France; Nice, France; Paris, France; Sophia Antipolis, France
agentic workflowsagentic investigationsincident troubleshootingGenAI model deployment at scaledistributed tracingapplication profiling
Senior Staff GenAI Engineer - Application Performance Monitoring (APM)senior
Datadog·New York, New York, USA
large-scale systems designobservability data platformsdistributed query enginesreal-time event streamingtrillions of data points per daytechnical leadership at scale
Staff GenAI Engineer - Application Performance Monitoring (APM)staff
Datadog·New York, New York, USA
AI ObservabilityLLM ObservabilityLLM evaluationprompt iterationagent debuggingLLM optimization