Staff Engineer, HPC Infrastructure jobs in United States
cer-icon
Apply on Employer Site
company-logo

Tenstorrent · 13 hours ago

Staff Engineer, HPC Infrastructure

Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. They are seeking a Staff HPC Engineer who will design and maintain automated bare-metal provisioning pipelines and ensure the performance and reliability of RHEL/Ubuntu systems as compute demands scale.

AI InfrastructureApplication Specific Integrated Circuit (ASIC)Artificial Intelligence (AI)ElectronicsMachine LearningSemiconductor
check
Comp. & Benefits
badNo H1BnoteU.S. Citizen Onlynote

Responsibilities

Design and maintain automated bare-metal provisioning pipelines that deploy hundreds of compute nodes globally with consistent configurations
Implement infrastructure-as-code practices using Ansible to manage large-scale OS configuration across diverse hardware platforms
Own the lifecycle management of RHEL and Ubuntu systems—from initial deployment through patching, upgrades, and performance tuning
Build automation and tooling to streamline provisioning, patching, and system updates as the compute environment scales
Troubleshoot OS-level issues, optimize kernel parameters, and resolve system performance bottlenecks that impact EDA workflows
Work directly with hardware design teams to standardize system configurations, toolchains, and development environments
Deploy and lifecycle manage systems across Tenstorrent's global engineering sites, ensuring consistency and reliability

Qualification

IBM Spectrum LSFHPC storage platformsLinux system administrationContainer technologiesInfrastructure-as-codeHPC networkingTroubleshooting skillsStartup environment adaptability

Required

Deep experience with IBM Spectrum LSF or similar workload managers
Strong background in commercial HPC storage platforms such as Pure Storage FlashBlade, Weka, NetApp, etc
Hands-on experience with container technologies (Docker, Singularity, Podman)
Solid Linux system administration skills
Understanding of HPC networking, storage architectures, and job scheduling
Ability to diagnose and resolve complex infrastructure issues independently
Comfortable working in a startup environment with rapidly changing requirements
Design and maintain automated bare-metal provisioning pipelines that deploy hundreds of compute nodes globally with consistent configurations
Implement infrastructure-as-code practices using Ansible to manage large-scale OS configuration across diverse hardware platforms
Own the lifecycle management of RHEL and Ubuntu systems—from initial deployment through patching, upgrades, and performance tuning
Build automation and tooling to streamline provisioning, patching, and system updates as the compute environment scales
Troubleshoot OS-level issues, optimize kernel parameters, and resolve system performance bottlenecks that impact EDA workflows
Work directly with hardware design teams to standardize system configurations, toolchains, and development environments
Deploy and lifecycle manage systems across Tenstorrent's global engineering sites, ensuring consistency and reliability

Preferred

Experience supporting EDA tools and hardware design workflows in production HPC environments
Hands-on expertise with commercial HPC storage platforms (Pure Storage, Weka, NetApp) and workload managers (LSF, Slurm)
Container technologies (Docker, Singularity, Podman) for reproducible compute environments at scale
Advanced provisioning techniques (PXE boot, kickstart, cloud-init) and modern infrastructure automation patterns
Cluster monitoring and observability tools (Prometheus, Grafana) for managing thousands of compute nodes
Security hardening and compliance frameworks for multi-tenant semiconductor design environments
Integration of open-source and commercial tools to improve provisioning efficiency and reliability
Work in a deeply technical environment solving infrastructure challenges that directly impact chip design velocity

Benefits

Highly competitive compensation package and benefits

Company

Tenstorrent

twittertwittertwitter
company-logo
Tenstorrent develops AI hardware and software solutions for data processing and machine learning application.

Funding

Current Stage
Late Stage
Total Funding
$1.03B
Key Investors
FidelityEPIQ Capital GroupEclipse Ventures
2024-12-02Series D· $693M
2023-08-02Series Unknown· $100M
2021-05-20Series C· $200M

Leadership Team

leader-logo
Jim Keller
CEO
linkedin
leader-logo
Keith Witek
Chief Operating Officer and Board Member
linkedin
Company data provided by crunchbase