Apply on Employer Site

Talent Vine · 3 weeks ago

Lead Site Reliability Engineer

Washington D.C.

Full-time

Onsite

Senior Level, Lead/Staff

3+ years exp

Talent Vine's client is redefining modern defense technology delivery, based in Washington, D.C. The Lead Site Reliability Engineer will oversee the reliability, scalability, and performance of a major AI infrastructure project, ensuring operational excellence for a significant government program.

Human Resources

No H1B

Security Clearance Required

U.S. Citizen Only

Responsibilities

Lead infrastructure design, deployment, and operations for large-scale hardware clusters across secure and distributed environments

Install and configure physical systems, including high-density GPU servers, networking gear, and storage arrays

Build and deploy secure Linux images and containerized workloads using OpenShift and other orchestration platforms

Develop and manage automation pipelines for provisioning, configuration management, and monitoring using modern DevOps toolchains (Ansible, Terraform, etc.)

Operate and maintain distributed networking meshes across classified and unclassified domains

Implement and manage out-of-band management tools (IPMI, iDRAC, BMC, etc.) for remote troubleshooting and control

Integrate and optimize NVIDIA GPU infrastructure for AI/ML training and inference workloads

Collaborate with mission engineers, software teams, and government operators to ensure system readiness and performance

Provide on-site technical leadership for deployments, troubleshooting, and continuous improvement

Mentor junior engineers and establish operational best practices as the program scales

Qualification

Site Reliability EngineeringLinux Systems AdministrationDevOps AutomationNVIDIA GPU InfrastructureDistributed Systems ManagementNetworking ArchitectureOut-of-Band ManagementAI/ML FrameworksMentoring Junior EngineersExperience in DoD EnvironmentsPalantirCommunicationProblem-Solving SkillsDocumentation Skills

Required

3+ years of experience in site reliability, systems engineering, or hardware operations roles

Deep expertise with physical infrastructure: server racking, cabling, diagnostics, and troubleshooting

Strong Linux systems administration experience, including imaging and automated deployment

Hands-on experience managing large-scale clusters or distributed systems in OpenShift or Kubernetes

Familiarity with DevOps automation (Ansible, Terraform, CI/CD pipelines)

Experience configuring and managing networking and mesh architectures

Direct experience with NVIDIA GPUs, CUDA, and AI/ML frameworks

Proficiency with out-of-band management tools (IPMI/iDRAC)

Certifications: Linux+ and Security+ (required or in progress)

Excellent communication, documentation, and problem-solving skills

Clearance: Active TS/SCI required

Preferred

Experience operating in secure DoD or intelligence environments

Familiarity with Palantir platforms or other government data systems

Experience supporting AI/ML infrastructure in production or tactical settings

Experience tuning and monitoring HPC or GPU-accelerated clusters

Benefits

Competitive compensation

Robust benefits

Professional development and certification opportunities

Clear paths for growth

Company

Talent Vine

Founded on integrity, Talent Vine is the “anti-agency” recruiting firm empowering early and growth stage companies with the support needed to find the right talent for the right role at the right time.

Founded in 2021

Brooklyn, New York, US

2-10 employees

http://talent-vine.io

Funding

Current Stage

Early Stage

Company data provided by crunchbase