Senior Datacenter Systems Architect jobs in United States
cer-icon
Apply on Employer Site
company-logo

Sustainable Talent · 1 month ago

Senior Datacenter Systems Architect

Sustainable Talent is seeking a high-level UNIX/Linux Systems Engineer to architect and operate their next-generation private cloud and GPU compute infrastructure. The role involves designing, scaling, and optimizing a world-class HPC/GPU datacenter environment while ensuring operational excellence and leading automation efforts.

ConsultingHuman ResourcesInformation Technology
check
Growth Opportunities

Responsibilities

Architect, scale, and optimize complex UNIX/Linux-based compute clusters, GPU farms, and high-density datacenter systems
Own the design and strategy for on-prem HPC/GPU compute environments including OS architecture, distributed storage, network tuning, and interconnects
Perform deep-dive troubleshooting across all layers — kernel, network stack, RPC/NFS, storage protocols, firmware, drivers, bootloaders, and orchestration systems
Lead automation efforts using Python, Bash, Ansible, and IaC to eliminate manual processes and improve system reliability
Drive configuration standards for compute, network, and storage layers across bare-metal systems
Collaborate with architects, system software teams, networking teams, and hardware engineering to ensure platform scalability
Own operational excellence: uptime, performance tuning, incident response processes, and long-term platform strategy
Mentor and technically lead junior engineers and datacenter technicians

Qualification

UNIX/Linux systems engineeringHPC/GPU architectureAutomation PythonAutomation BashAutomation AnsibleNetworking fundamentalsRoot-cause analysisDistributed storage systemsCertifications RHCECertifications CCNPMentoring junior engineers

Required

8–15+ years in UNIX/Linux systems engineering, system administration, or HPC/compute infrastructure roles
Expert-level knowledge of Linux internals (kernel, storage subsystems, networking stack, groups, system, NUMA, etc.)
Proven experience architecting and running large-scale compute clusters or farms (HPC, HCI, GPU clusters, or bare-metal automation environments)
Deep understanding of compute, network, and storage architectures end-to-end
Demonstrated skill in root-cause analysis at multiple layers, including: NFSv3/v4 deep troubleshooting, Packet-level analysis, Kernel performance tuning, Distributed storage (NetApp, Ceph, Lustre, BeeGFS, etc.)
Strong networking fundamentals: TCP/IP, VLANs, BGP, LACP, RoCE/RDMA, NIC offloading
Strong automation skills: Python, Bash, Ansible, Terraform, or IaC tools
Experience with PXE provisioning, Kickstart, bare-metal deployments, and OS image pipelines

Preferred

Certifications strongly preferred: UNIX/Linux certs (RHCE, RHCSA, Linux Foundation), Networking certs (CCNP, CCIE, JNCIP, etc.), Storage certs (NetApp NCIE/NCDA or similar)
Experience designing GPU clusters or accelerator-dense environments
Deep experience with distributed filesystems, block storage tuning, or NFS debugging
Strong background in systems and platform performance engineering
Ability to continuously evaluate emerging technologies and build long-term architectural recommendations
Experience leading and mentoring infrastructure teams

Company

Sustainable Talent

twittertwittertwitter
company-logo
Sustainable Talent provides staffing, consulting and outsourcing services.

Funding

Current Stage
Growth Stage

Leadership Team

leader-logo
Stephanie Rodis
CEO & Founder
linkedin
leader-logo
Gabriella Meneses
Senior Technical & Corporate Recruiter | Talent Acquisition Business Partner
linkedin
Company data provided by crunchbase