Site Reliability Engineer, AI & HPC Infrastructure @ Tesla | Jobright.ai
JOBSarrow
RecommendedLiked
0
Applied
0
Site Reliability Engineer, AI & HPC Infrastructure jobs in Palo Alto, CAH1B Visa Sponsored Site Reliability Engineer, AI & HPC Infrastructure jobs in Palo Alto, CA
Be an early applicantLess than 25 applicants
expire-info-iconThis job has closed.
company-logo

Tesla · 3 days ago

Site Reliability Engineer, AI & HPC Infrastructure

Wonder how qualified you are to the job?

ftfMaximize your interview chances
AutomotiveElectric Vehicle
check
H1B Sponsorship

Insider Connection @Tesla

Discover valuable connections within the company who might provide insights and potential referrals, giving your job application an inside edge.

Responsibilities

Support the AI/ML cluster infrastructure on both GPU and Dojo platforms, focusing on systems automation, configuration management and deployment at scale
Improve our cluster health monitoring and auto-recovery pipeline
Work with users on debugging application performance issues
Work with hardware and storage vendors to tune and optimize servers, storage and network
Write Ansible playbooks for configuration management
Performance tuning and OS provisioning on Linux systems
Manage HPC clusters, workloads and applications
Automation and systems engineering in Python, Golang or Bash/Shell
Participate in 24x7 on-call rotation

Qualification

Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.

PythonGolangBashLinux FundamentalsPerformance OptimizationsTCP/IPIPoIBFilesystemsStorage TechnologiesStorage ProtocolsConfiguration ManagementSystems MonitoringAlertingHPC Workload ManagersHigh-Throughput NetworksLow-Latency NetworksGPU-Based Computing SystemsHigh Performance Storage SystemsSlurmLSFStorage ManagementDistributed Parallel File Systems

Required

Proficiency in Python, Golang and/or Bash
Proficiency with Linux fundamentals and performance optimizations (Ubuntu/RHEL OS)
Demonstrable knowledge of TCP/IP, IPoIB, Linux operating system internals, filesystems, disk/storage technologies and storage protocols
Experience collaborating with network and data center teams for large scale cluster builds
Experience with configuration management software (Ansible, etc.), systems monitoring & alerting (Prometheus, Grafana, Telegraf, Splunk, etc.), and/or administering HPC workload managers
Experience with high-throughput low-latency networks, GPU-based computing systems, and/or high performance storage systems
Bachelor's Degree in Computer Science, Computer Engineering, Electrical Engineering, Physics or proof of exceptional skills in related field
3+ years of additional equivalent experience or evidence of exceptional ability related to the position

Preferred

Experience with Slurm, LSF and storage management of distributed parallel file systems a plus

Benefits

Aetna PPO and HSA plans
Family-building, fertility, adoption and surrogacy benefits
Dental and vision plans
Company Paid (Health Savings Account) HSA Contribution
Healthcare and Dependent Care Flexible Spending Accounts
LGBTQ+ care concierge services
401(k) with employer match
Employee Stock Purchase Plans
Company paid Basic Life, AD&D, short-term and long-term disability insurance
Employee Assistance Program
Sick and Vacation time
Paid Holidays
Back-up childcare and parenting support resources
Voluntary benefits: critical illness, hospital indemnity, accident insurance, theft & legal services, pet insurance
Weight Loss and Tobacco Cessation Programs
Tesla Babies program
Commuter benefits
Employee discounts and perks program

Company

Tesla Motors is an electric vehicle and clean energy company that provides electric cars, solar, and renewable energy solutions.

H1B Sponsorship

Tesla has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Trends of Total Sponsorships
2023 (1061)
2022 (1216)
2021 (1146)
2020 (782)

Funding

Current Stage
Public Company
Total Funding
$19.37B
Key Investors
European UnionPennDOTIndustrial and Commercial Bank of China
2023-09-13Grant· $159.6M
2023-08-15Grant· $0.23M
2022-06-30Post Ipo Equity· $20M

Leadership Team

leader-logo
Tom Zhu
Senior Vice President
leader-logo
Kenneth Morgan
Vice President, Sales Finance
linkedin
Company data provided by crunchbase
logo

Orion

Your AI Copilot