Apply on Employer Site

The Voleon Group · 16 hours ago

Senior Cluster Site Reliability Engineer

United States

Full-time

Remote

Senior Level

$205K/yr - $235K/yr

5+ years exp

The Voleon Group is a technology company that applies advanced machine learning techniques to finance. As a Senior Cluster Site Reliability Engineer, you will be responsible for maintaining the uptime and reliability of research compute clusters, ensuring they meet the needs of the organization while collaborating with engineering teams to improve operational frameworks and metrics.

Financial ServicesVenture Capital

Comp. & Benefits

H1B Sponsor Likely

Responsibilities

Be a first responder in the event of cluster outages or issues. Triage and resolve urgent issues as they arise

Ensure a high degree of cluster uptime (measured in multiple nines), and define + track SLAs to quantify reliability

Diagnose systemic/recurring patterns of problems, and engineer precision solutions to them in collaboration with engineering teams

Develop robust metrics and observability for cluster health and use those metrics to inform your work. Build out custom observability mechanisms when off-the-shelf ones won't do

Help software and research teams design policies around fair cluster usage, and help develop enforcement mechanisms for said policies

Assist in forecasting cluster growth, and help select appropriate scale-up strategies. Help optimize operations across dimensions of cost and usability

Qualification

HPC frameworksInfrastructure-as-codeCloud infrastructureObservability stacksScripting languagesDistributed storageContainerizationSystem engineer mindsetSecurity/IAM foundationsMachine learning frameworks

Required

5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead

Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod)

Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.)

Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible)

Experience with cloud infrastructure (AWS or GCP)

Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry)

Experience with distributed storage technologies (Lustre, Ceph, S3)

Embodies a 'system engineer' rather than 'system administrator' mindset, thinking systematically and leveraging automation

Bachelor degree in computer science

Preferred

Hands-on experience with HPC frameworks (Slurm, Grid Engine) and Kubernetes-based job orchestrators (Airflow, Kueue, Kubeflow Pipelines), along with other distributed computing frameworks (Ray, Modin, Dask, Spark)

Familiarity with ML frameworks (PyTorch/Tensorflow, JAX, Horovod, DeepSpeed)

Familiarity with hybrid/on-prem environments

Experience with containerization (Docker, Podman, Singularity), particularly for HPC/batch compute environments

Experience with HPC networking (InfiniBand, RDMA)

Solid security/IAM foundations (Identity management systems, AWS/GCP IAM, Zero Trust)

Benefits

Medical, dental and vision coverage

Life and AD&D insurance

20 days of paid time off

9 sick days

401(k) plan with a company match

Company

The Voleon Group

Glassdoor4.0

The Voleon Group is a family of companies committed to the development & deployment of cutting-edge technologies in investment management.

Founded in 2008

Berkeley, California, USA

201-500 employees

http://voleon.com/

H1B Sponsorship

The Voleon Group has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (2)

2024 (2)

2023 (3)

2022 (4)

2021 (1)

2020 (1)

Funding

Current Stage

Growth Stage

Leadership Team

Jon McAuliffe

Chief Investment Officer and Co-founder

Prem Gopalan

Chief Technology Officer

Company data provided by crunchbase