Senior/Principal - Artificial Intelligence Infrastructure, NM/CA- Hybrid jobs in United States
cer-icon
Apply on Employer Site
company-logo

Sandia National Laboratories · 6 hours ago

Senior/Principal - Artificial Intelligence Infrastructure, NM/CA- Hybrid

Sandia National Laboratories is the nation’s premier science and engineering lab for national security and technology innovation, with a focus on cutting-edge work in various areas. The role involves designing, deploying, and operating the unified compute-and-data fabric for the U.S. Department of Energy's next-generation AI Platform, requiring innovative solutions to meet the unique needs of DOE applications.

GovernmentInformation TechnologyNational Security
check
Growth Opportunities
badNo H1BnoteSecurity Clearance RequirednoteU.S. Citizen Onlynote

Responsibilities

Architect and implement the hybrid compute fabric
Integrate exascale HPC systems with elastic cloud resources and specialized AI accelerator clusters (on-prem and in-cloud)
Deploy ruggedized edge servers and digital-twin infrastructure for sub-millisecond inference and real-time physics simulations
Develop infrastructure services and orchestration
Build federated Kubernetes clusters, container registry services, resource registry, and job scheduling abstractions
Implement self-configuring distributed clusters with intelligent network overlays, AI-driven traffic steering, and sensor-driven control loops
Design secure networking and enclaves
Configure ESnet-backed, multi-tier WAN overlays with low-latency, geo-diverse routing, failover, and encryption protocols
Provide software-defined, dynamic security enclaves for CUI/Restricted Data with attested runtime and curated egress
Enable observability, provenance & monitoring
Deploy unified logging, metrics, dashboards, and trace-analysis across cloud and on-prem environments using OpenTelemetry, Prometheus, ELK, or equivalent
Automate provenance capture for compute jobs, data movements, and AI workflows
Support federated identity and access control
Integrate multiple identity providers, attribute-based access controls, and allocation models for risk-shared governance
Manage enterprise licensing, token agreements, and software audits for AI and HPC frameworks
Manage the full lifecycle of the AI platform's infrastructure, including capacity planning, upgrades, documentation, and performance monitoring
Implement and enforce security best practices within container environments, including Role-Based Access Control (RBAC), secrets management, network policies, and vulnerability scanning
Stand up a new GPU-accelerated cluster, configure Slurm/Kubernetes, and validate performance benchmarks
Troubleshoot cross-site data transfers over ESnet and optimize WAN throughput for a petabyte-scale lakehouse
Deploy a hardened enclave for a classified ML training job with differential-privacy egress controls
Script an IaC workflow (Terraform/Ansible) to provision edge compute nodes
Collaborate with the Models team to tune network and storage parameters for distributed training jobs
Present real-time infrastructure status and forecasts to stakeholder
Present prototype demos and research results to stakeholders across DOE, DoD, IC, and industry

Qualification

HPC systems administrationContainer orchestrationInfrastructure as codeNetworking backgroundObservability toolchainsSoftware developmentDigital-twin architecturesDevSecOps principlesSecure enclave technologiesCollaboration skillsMentoring skills

Required

Bachelor's degree in Computer Science, Electrical Engineering, Mathematics, or a related STEM field plus five (5) years of directly relevant experience, or an equivalent combination of education and experience
Ability to acquire and maintain a DOE Q clearance

Preferred

Graduate degree in a relevant computationally-intensive discipline where an independent research project was a graduation requirement (e.g., independent project, thesis, or dissertation)
Experience in developing software and AI systems for enterprise and national security applications
Demonstrated software development skills and familiarity with modern software development practices
Proven ability to work and communicate effectively in a collaborative and interdisciplinary team environment mentoring junior engineers
Ph.D. in a STEM field with focus on high-performance or distributed computing (Data Science, Data and Computing Systems, Informatics or a related STEM field with a significant data systems research component
Experience with HPC systems administration (Slurm, PBS, Flux) and cloud platforms (AWS, Azure, GCP)
Proficiency in container orchestration (Kubernetes, Docker) and infrastructure as code (Terraform, Ansible)
Networking background: ESnet, VLANs, WAN overlays, encryption, and failover design
Hands-on experience with storage architectures: Lustre, GPFS, object stores, multi-tier caching
Experience implementing DevSecOps principles and security best practices in containerized infrastructure, including network policies, classification management, and vulnerability scanning
Experience with federated Kubernetes and large-scale container platforms across classification domains
Familiarity with secure enclave technologies and zero-trust security models
Background in digital-twin or real-time simulation steering architectures
Experience with SIEM environments, such as Splunk, and operational management of application infrastructure services
Proficiency in observability toolchains (OpenTelemetry, Prometheus, Grafana, ELK) and automated log analytics
Knowledge of DOE/NNSA compute and networking environments (Frontier, Aurora, Perlmutter, ESnet)
Integrating experimental facilities, robotics, or 3D-printing systems into automated AI workflows
Deploying large-scale secure enclaves for CUI and RD applications
Coordinating public-private partnerships on HPC and AI infrastructure deployments
Working in cross-lab federated teams with shared governance and risk models
Ability to obtain and maintain a SCI clearance, which may require a polygraph test

Benefits

Generous vacation
Strong medical and other benefits
Competitive 401k
Learning opportunities
Relocation assistance
Amenities aimed at creating a solid work/life balance

Company

Sandia National Laboratories

company-logo
Sandia is a conducts research and development into the non-nuclear components of nuclear weapons.

Funding

Current Stage
Late Stage
Total Funding
$4.4M
Key Investors
US Department of EnergyARPA-E
2023-09-21Grant· $0.5M
2023-07-27Grant
2023-01-10Grant· $3.7M

Leadership Team

leader-logo
Laura McGill
Deputy Laboratories Director - Nuclear Deterrence, and Chief Technology Officer
linkedin
leader-logo
Maria Gallardo
CFO Enterprise Risk Management Program Lead
linkedin

Recent News

Inside HPC & AI News | High-Performance Computing & Artificial Intelligence
Inside HPC & AI News | High-Performance Computing & Artificial Intelligence
Company data provided by crunchbase