Lead Site Reliability Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Bridge Defense ยท 1 month ago

Lead Site Reliability Engineer

Bridge Defense is redefining how modern defense technology is delivered, focusing on national security solutions for the Department of Defense and federal law enforcement. The Lead Site Reliability Engineer will be responsible for ensuring the reliability and performance of advanced hardware and AI infrastructure, leading deployment and automation efforts across secure environments.

Defense & Space
badNo H1BnoteSecurity Clearance RequirednoteU.S. Citizen Onlynote

Responsibilities

Lead infrastructure design, deployment, and operations for ComputeBridge hardware clusters across secure and distributed environments
Install and configure physical systems, including high-density GPU servers, networking gear, and storage arrays
Build and deploy secure Linux images and containerized workloads using OpenShift and other orchestration platforms
Develop and manage automation pipelines for provisioning, configuration management, and monitoring using modern DevOps toolchains (Ansible, Terraform, etc.)
Operate and maintain distributed networking meshes across multiple classified and unclassified domains
Implement and manage out-of-band management tools (IMPI, iDRAC, BMC, etc.) for remote troubleshooting and control
Integrate and optimize NVIDIA GPU infrastructure for AI/ML training and inference workloads
Collaborate with mission engineers, software teams, and government operators to ensure system readiness and performance
Provide on-site technical leadership for deployments, troubleshooting, and continuous improvement
Mentor junior engineers and establish operational best practices across the ComputeBridge program as the contract grows

Qualification

Site Reliability EngineeringLinux Systems AdministrationDevOps AutomationNVIDIA GPU InfrastructureDistributed Systems ManagementNetworking ConfigurationOut-of-band ManagementAI/ML FrameworksMentoring Junior EngineersExperience in DoD EnvironmentsCommunicationProblem-solving SkillsDocumentation Skills

Required

3+ years of experience in site reliability, systems engineering, or hardware operations roles
Deep expertise with physical infrastructure: server racking, cabling, diagnostics, and troubleshooting
Strong experience with Linux systems administration, imaging, and automated deployment
Hands-on experience managing large-scale clusters or distributed systems in OpenShift or Kubernetes environments
Familiarity with DevOps automation (Ansible, Terraform, CI/CD pipelines)
Experience configuring and managing networking and mesh architectures
Direct experience with NVIDIA GPUs, CUDA, and related AI/ML frameworks
Proficiency with out-of-band management and IMPI/iDRAC tooling
Certifications: Linux+ and Security+ (required or in-progress)
Excellent communication, documentation, and problem-solving skills
Clearance: Active TS/SCI required or ability to obtain

Preferred

Experience operating in secure DoD or intelligence environments
Familiarity with Palantir platforms or other government data systems
Prior experience supporting AI/ML infrastructure in production or tactical settings
Experience with performance tuning and monitoring of HPC or GPU-accelerated clusters

Benefits

Competitive compensation
Robust benefits
Professional development and certification opportunities
Clear paths for growth

Company

Bridge Defense

twittertwitter
company-logo
Bridge Defense is an investment firm that builds and scales defense technology.

Funding

Current Stage
Early Stage
Company data provided by crunchbase