Staff Engineer, HPC Systems Software jobs in United States
cer-icon
Apply on Employer Site
company-logo

Tenstorrent · 2 weeks ago

Staff Engineer, HPC Systems Software

Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. We are seeking a HPC Systems Engineer to architect and maintain the operating system foundation that powers our global hardware design infrastructure.

AI InfrastructureApplication Specific Integrated Circuit (ASIC)Artificial Intelligence (AI)ElectronicsMachine LearningSemiconductor
check
Comp. & Benefits
badNo H1BnoteU.S. Citizen Onlynote

Responsibilities

Design and maintain automated OS deployment pipelines for bare-metal HPC clusters globally
Manage large-scale configuration management using Ansible to ensure consistency across compute infrastructure
Deploy and lifecycle manage RHEL and Ubuntu systems across diverse hardware platforms
Implement infrastructure-as-code for repeatable, version-controlled system configurations
Troubleshoot OS-level issues, optimize kernel parameters, and resolve system performance bottlenecks
Collaborate with hardware design teams to standardize system configurations, toolchains, and development environments
Build automation and tooling to streamline provisioning, patching, and system updates at scale

Qualification

Linux expertiseAnsibleHPC systemsRHEL administrationUbuntu administrationInfrastructure-as-codeBare-metal provisioningPython scriptingBash scriptingCollaboration

Required

Experienced in RHEL and Ubuntu administration at HPC or large-scale compute environments
Highly skilled in Ansible for automation and configuration management across hundreds of nodes
Proficient with bare-metal provisioning systems (MAAS, Foreman, Cobbler, Warewulf, or similar)
Deep understanding of Linux system internals, networking, kernel tuning, and performance troubleshooting
Familiar with HPC cluster architecture, workflows, and infrastructure-as-code practices
Capable of diagnosing and resolving complex infrastructure issues independently in fast-paced environments
Design and maintain automated OS deployment pipelines for bare-metal HPC clusters globally
Manage large-scale configuration management using Ansible to ensure consistency across compute infrastructure
Deploy and lifecycle manage RHEL and Ubuntu systems across diverse hardware platforms
Implement infrastructure-as-code for repeatable, version-controlled system configurations
Troubleshoot OS-level issues, optimize kernel parameters, and resolve system performance bottlenecks
Collaborate with hardware design teams to standardize system configurations, toolchains, and development environments
Build automation and tooling to streamline provisioning, patching, and system updates at scale

Preferred

Hands-on experience with IBM Spectrum LSF or similar HPC workload managers
Integration with commercial HPC storage platforms (Pure Storage, Weka, NetApp, DDN, Vast Data)
Deep exposure to EDA tools and hardware design workflows in semiconductor development
Container technologies (Docker, Singularity, Podman) for reproducible compute environments
Cluster monitoring and observability at scale using Prometheus, Grafana, and custom tooling
Advanced provisioning techniques including PXE boot, kickstart, cloud-init, and BMC/IPMI integration
Security hardening and compliance frameworks for multi-tenant HPC environments
Python and bash scripting for production-level infrastructure automation

Benefits

Highly competitive compensation package and benefits

Company

Tenstorrent

twittertwittertwitter
company-logo
Tenstorrent develops AI hardware and software solutions for data processing and machine learning application.

Funding

Current Stage
Late Stage
Total Funding
$1.03B
Key Investors
FidelityEPIQ Capital GroupEclipse Ventures
2024-12-02Series D· $693M
2023-08-02Series Unknown· $100M
2021-05-20Series C· $200M

Leadership Team

leader-logo
Jim Keller
CEO
linkedin
leader-logo
Keith Witek
Chief Operating Officer
linkedin
Company data provided by crunchbase