Infrastructure Operations Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

TensorWave · 1 month ago

Infrastructure Operations Engineer

TensorWave is on a mission to build seamless, secure, reliable, and resilient AI infrastructure at scale. The Infrastructure Operations Engineer will manage enterprise hardware, monitor systems, and support compute clusters across multiple data centers.

AI InfrastructureArtificial Intelligence (AI)Cloud ComputingCloud InfrastructureGenerative AIIaaS

Responsibilities

Manage and maintain enterprise-grade server hardware including diagnostics and break/fix for CPUs, memory, disks, PSUs, and NICs
Operate out-of-band management systems for remote access and recovery - iLO, iDRAC, IPMI, Redfish
Design, build, and maintain infrastructure monitoring and alerting - Prometheus, Grafana, SNMP, or similar
Administer and troubleshoot Linux systems - OS install, boot issues, services, networking, filesystems, and access controls
Own bare-metal provisioning workflows - PXE/UEFI boot and automated node bring-up using MAAS, Foreman, or equivalents
Build and maintain infrastructure automation - shell scripting and CLI tooling to improve reliability and scale operations
Manage core networking - subnets, IP address management, VLANs, routing, NAT, and firewall configuration
Configure and support secure connectivity such as VPNs - IPsec, WireGuard, OpenVPN
Support Kubernetes clusters at the infrastructure layer - node lifecycle, access, basic troubleshooting, and scaling
Partner with internal teams to ensure compute clusters remain reliable, secure, and scalable across multiple data centers

Qualification

Enterprise hardware managementInfrastructure automationLinux system administrationNetwork administrationKubernetes supportMonitoring systemsBare metal provisioningAutomation languagesSoft skills

Required

Bachelor of Science in Computer Science, Computer Engineering, or a related technical field, or equivalent practical experience
Proven experience managing enterprise-grade hardware at scale
Expertise with automation languages such as Python, Go, PHP, or Perl
Strong understanding of out-of-band management systems - IPMI, BMC, Redfish
Hands-on expertise with monitoring systems - Prometheus, Grafana, SNMP, Nagios, CheckMK, or similar
Solid knowledge of network administration - firewalls, routing, VPNs, NAT, and managed switches
Linux system administration experience - installation, configuration, troubleshooting
Experience with filesystems - RAID, partitioning, and general storage management
Familiarity with certificate management - key-based auth, and cryptographic functions
Experience with bare metal provisioning - MAAS, Foreman, or similar
Understanding of PXE/UEFI/HTTP boot systems
Ability to write functional, maintainable bash scripts for automation

Preferred

Experience with Kubernetes - operators, cluster scaling, CRDs
Experience with Helm chart customization
Exposure to high-availability or distributed compute environments
Knowledge of infrastructure security and hardening practices

Benefits

100% paid Medical, Dental, and Vision insurance
Flexible PTO
Paid Holidays
401(k)
Parental Leave
Flexible Spending Account
Short Term Disability Insurance
Life and Voluntary Supplemental Insurance
Mental Health Benefits through Spring Health

Company

TensorWave

twittertwittertwitter
company-logo
TensorWave is an AMD GPU exclusive Cloud that supports training and inference at scale

Funding

Current Stage
Growth Stage
Total Funding
$146.71M
Key Investors
Nexus Venture PartnersFundNV
2025-05-14Series A· $100M
2024-10-08Seed· $43M
2024-04-23Seed· $0.89M

Leadership Team

leader-logo
Darrick Horton
Co-Founder / CEO
linkedin
leader-logo
Piotr Tomasik
Co-Founder, President & COO
linkedin
Company data provided by crunchbase