Boson AI ยท 6 days ago
Site Reliability Engineer, AI/ML Infrastructure
Boson AI is looking for a Senior Site Reliability Engineer to help manage their GPU clusters in Toronto. The role involves overseeing HPC infrastructure, troubleshooting issues, and collaborating with engineering and science teams to ensure optimal performance and capacity planning.
Artificial Intelligence (AI)Information TechnologyMarket Research
Responsibilities
Manage and optimize HPC cluster operations
Deploy and maintain infrastructure-as-code solutions
Support ML/research teams with cluster usage optimization
Operate, troubleshoot and optimize Ceph storage clusters
Develop automation and tooling
Qualification
Required
5+ years of experience in SRE or HPC operations
Proficiency in Linux systems administration (Ubuntu/Debian)
Experience with Kubernetes and container orchestration
Experience with Ceph >1PB deployments and maintenance
Knowledge of security best practices in multi-tenant environments
Understanding of L2/L3 networking fundamentals
Skilled in Python and Bash scripting
Preferred
Experience with infrastructure-as-code tools (Ansible/Terraform)
Experience with GitOps (Helm, ArgoCD)
Strong grasp of RDMA, InfiniBand, and GPUDirect technologies
Familiarity with deep learning frameworks such as PyTorch and TensorFlow
Familiarity in at least one cloud platform: AWS, Azure or GCP
Company
Boson AI
Boson AI is an AI company that develops large language model tools.
H1B Sponsorship
Boson AI has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (4)
2024 (7)
2023 (2)
Funding
Current Stage
Early StageCompany data provided by crunchbase