Apply on Employer Site

Boson AI · 6 days ago

Site Reliability Engineer, AI/ML Infrastructure

Santa Clara, CA

Full-time

Onsite

Senior Level

$150K/yr - $250K/yr

5+ years exp

Boson AI is looking for a Senior Site Reliability Engineer to help manage their GPU clusters in Toronto. The role involves overseeing HPC infrastructure, troubleshooting issues, and collaborating with engineering and science teams to ensure optimal performance and capacity planning.

Artificial Intelligence (AI)Information TechnologyMarket Research

H1B Sponsor Likely

Responsibilities

Manage and optimize HPC cluster operations

Deploy and maintain infrastructure-as-code solutions

Support ML/research teams with cluster usage optimization

Operate, troubleshoot and optimize Ceph storage clusters

Develop automation and tooling

Qualification

HPC operationsLinux systems administrationKubernetesCeph storageInfrastructure-as-codePython scriptingBash scriptingNetworking fundamentalsSecurity best practicesGitOpsDeep learning frameworksCloud platforms

Required

5+ years of experience in SRE or HPC operations

Proficiency in Linux systems administration (Ubuntu/Debian)

Experience with Kubernetes and container orchestration

Experience with Ceph >1PB deployments and maintenance

Knowledge of security best practices in multi-tenant environments

Understanding of L2/L3 networking fundamentals

Skilled in Python and Bash scripting

Preferred

Experience with infrastructure-as-code tools (Ansible/Terraform)

Experience with GitOps (Helm, ArgoCD)

Strong grasp of RDMA, InfiniBand, and GPUDirect technologies

Familiarity with deep learning frameworks such as PyTorch and TensorFlow

Familiarity in at least one cloud platform: AWS, Azure or GCP

Company

Boson AI

Boson AI is an AI company that develops large language model tools.

Founded in 2023

Santa Clara, California, USA

11-50 employees

https://boson.ai/

H1B Sponsorship

Boson AI has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (4)

2024 (7)

2023 (2)

Funding

Current Stage

Early Stage

Leadership Team

Alex Smola

CEO & Cofounder

Mu Li

Co-Founder & CTO

Company data provided by crunchbase