Talent Vine · 3 weeks ago
Lead Site Reliability Engineer
Talent Vine's client is redefining modern defense technology delivery, based in Washington, D.C. The Lead Site Reliability Engineer will oversee the reliability, scalability, and performance of a major AI infrastructure project, ensuring operational excellence for a significant government program.
Human Resources
Responsibilities
Lead infrastructure design, deployment, and operations for large-scale hardware clusters across secure and distributed environments
Install and configure physical systems, including high-density GPU servers, networking gear, and storage arrays
Build and deploy secure Linux images and containerized workloads using OpenShift and other orchestration platforms
Develop and manage automation pipelines for provisioning, configuration management, and monitoring using modern DevOps toolchains (Ansible, Terraform, etc.)
Operate and maintain distributed networking meshes across classified and unclassified domains
Implement and manage out-of-band management tools (IPMI, iDRAC, BMC, etc.) for remote troubleshooting and control
Integrate and optimize NVIDIA GPU infrastructure for AI/ML training and inference workloads
Collaborate with mission engineers, software teams, and government operators to ensure system readiness and performance
Provide on-site technical leadership for deployments, troubleshooting, and continuous improvement
Mentor junior engineers and establish operational best practices as the program scales
Qualification
Required
3+ years of experience in site reliability, systems engineering, or hardware operations roles
Deep expertise with physical infrastructure: server racking, cabling, diagnostics, and troubleshooting
Strong Linux systems administration experience, including imaging and automated deployment
Hands-on experience managing large-scale clusters or distributed systems in OpenShift or Kubernetes
Familiarity with DevOps automation (Ansible, Terraform, CI/CD pipelines)
Experience configuring and managing networking and mesh architectures
Direct experience with NVIDIA GPUs, CUDA, and AI/ML frameworks
Proficiency with out-of-band management tools (IPMI/iDRAC)
Certifications: Linux+ and Security+ (required or in progress)
Excellent communication, documentation, and problem-solving skills
Clearance: Active TS/SCI required
Preferred
Experience operating in secure DoD or intelligence environments
Familiarity with Palantir platforms or other government data systems
Experience supporting AI/ML infrastructure in production or tactical settings
Experience tuning and monitoring HPC or GPU-accelerated clusters
Benefits
Competitive compensation
Robust benefits
Professional development and certification opportunities
Clear paths for growth
Company
Talent Vine
Founded on integrity, Talent Vine is the “anti-agency” recruiting firm empowering early and growth stage companies with the support needed to find the right talent for the right role at the right time.
Funding
Current Stage
Early StageCompany data provided by crunchbase