Site Reliability Engineer (SRE) - AI/Defense
Ensure reliability of deployed AI systems and defense infrastructure
31
Open Positions
Core Skills
KubernetesPrometheusGrafanaIncident ResponseReliability Engineering
Active Positions (31)
Senior Site Reliability Engineersenior
Anduril·Boston, Massachusetts, United States
CI/CD pipeline optimizationobservability cultureBusiness Systems infrastructure
Site Reliability Engineer, Discoverymid
Anduril·Seattle, Washington, United States
Lattice OSmission autonomymesh networkingsensor fusionsystems integrationrobotics
Senior Site Reliability Engineer - Automation Platform (x/f/m)senior
Doctolib·Paris, Paris, France
Engineering Manager - Platform Reliabilitymanager
Databricks·London, United Kingdom
Staff Site reliability engineer staff
Writer·New York City, NY
AI native approachSite reliability for AI platformsEnterprise generative AI reliabilityAI-powered workflow availabilityProactive systemic challenge solvingResilient AI systems
Senior Site Reliability Engineer, Data Infrastructuresenior
CoreWeave·New York, NY / Bellevue, WA
Kubernetes-based data platformmulti-region system reliabilityDevSecOps for data infrastructuredata platform observability
Storage Reliability Engineermid
CoreWeave·Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA
storage systems for AI workloadskernel-level debuggingproduction infrastructure triagestorage stack improvements
Lead/Manager Site Reliability Engineering Team (Amsterdam) senior
Together AI·Amsterdam
PagerDutyAnsibleTerraform
Senior Site Reliability Engineer, Production Engineering senior
Anduril·Costa Mesa, California, United States
Lattice OSProduction Engineeringmission-critical systemsautonomous command and controloperational environmentsreliability engineering
Staff Software Engineer, Site Reliability (SRE)staff
Harvey AI·Bengaluru
site reliabilityscaling across 50+ regionsmission-critical operations
Sr. Engineering Manager, SREsenior
Abridge·SF Office
SLOsmulti-region deploymentmulti-cloud deploymentapplication reliability roadmapsoftware replatformingrearchitecture
Senior Site Reliability Engineer (x/f/m)senior
Doctolib·Paris, Paris, France
Database optimizationDatastores healthDatabase reliabilityDatabase availabilityDatabase performanceDatabase automation
Software Engineer, Infrastructure Reliabilitymid
OpenAI·San Francisco
Distributed system performance optimizationSystem resilience improvementObservability platform developmentIncident response postmortemsInfrastructure scalability patternsReliability guardrails
Senior Software Engineer, Site Reliabilitysenior
Anduril·Sydney, New South Wales, Australia
Tactical Networkingcommand and control (C2)collaborative autonomyOSI reference modelLayer-1 PhysicalLayer-2 Data
Engineering Manager - Observability & Reliability Engineering Obsession (x/f/m)manager
Doctolib·Berlin, Berlin, Germany
Ruby on Rails backend foundationsPostgreSQL scalabilityMongoDB integrationPlatform as a Product mindsetBackend foundation managementCI pipeline automation
SRE / Incident Manager Team Leader (x/f/m)senior
Doctolib·Paris, Paris, France
Incident ManagementProblem ManagementOperational ExcellenceReliability EngineeringChange SafetyObservability
Senior DevOps Engineer, Spacesenior
Anduril·Costa Mesa, California, United States
Lattice OSSpace Domain Awareness (SDA)Space ControlSDANetInfrastructure pipeline hardeningTest and release pipeline development
Senior Site Reliability Engineer - Observability (x/f/m)senior
Doctolib·Berlin, Berlin, Germany; Paris, Paris, France
observability strategyloggingmetricstracingalertingincident detection
Site Reliability Engineer - Cybersecuritymid
xAI·Palo Alto, CA
X Money platformP2P paymentsmoney transmissionhybrid cloud securitydistributed systems securitysecurity automation
Site Reliability Engineer (SRE)mid
xAI·London, UK
BuildkiteArgoCDPrometheusGrafanaPagerDutyPulumi
DevOps Engineer, IPSmid
Scale AI·Doha, Qatar
Infrastructure as Code (Terraform)CloudFormationCI/CD pipelinescontainerized applicationsVPCsVPNs
Site Reliability Engineer / DevOpsmid
Scale AI·Mexico City, MX
robot stationstechnical facilities managementon-site infrastructurenetwork installationshardware troubleshootingphysical infrastructure
Deployment Site Reliability Engineer, Connected Warfaremid
Anduril·Costa Mesa, California, United States
Lattice OSautonomycomputer visionsensor fusionfirst principles aircraft designsystem safety
Site Reliability Engineer Internintern
Dataiku·France, Paris
Dataiku Cloudfully-managed offeringlaunchpadSaaS portalCloud EngineeringSRE
Senior Site Reliability Engineer - Deployed, Connected Warfaresenior
Anduril·Costa Mesa, California, United States
system deploymenthardware installationsoftware installationnetwork expansioncustomer mission supportmission critical capabilities
Senior Site Reliability Engineer - Developer, Connected Warfaresenior
Anduril·Costa Mesa, California, United States
warfighter capability deliverydeployment engineer supportsystem integration strategiesfault tolerant system deliveryscalable system deliverymodern technology solutions
Senior Site Reliability Engineer - Database (x/f/m)senior
Doctolib·Nantes
LLMVLMRAG-based systemsAI Medical CompanionVector DatabasesGoogle Cloud Platform (GCP)
Senior Site Reliability Engineer - Tactical Reconnaissance & Strikesenior
Anduril·Atlanta, Georgia, United States
Lattice OSautonomous dronessolid rocket motorsGhostAnvilBolt
Engineering Manager SRE (x/f/m)manager
Doctolib·Paris, Paris, France
Automation PlatformCI/CD automationTesting infrastructureEphemeral development environmentsDeveloper productivity toolingContract testing
Member of Technical Staff - Infrastructure Reliabilitystaff
xAI·Palo Alto, CA
GPU supercluster reliabilityhigh-QPS production systemsinfrastructure automation in Rustdistributed infrastructure monitoringtraining throughput optimizationstorage infrastructure evolution
Site Reliability Engineer IImid
Dataiku·United States, New York
pretrainingposttrainingscience organizationtechnical operationsprogram managementexecution engine