Senior DevOps Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Autopoiesis Sciences · 1 hour ago

Senior DevOps Engineer

Autopoiesis Sciences is an applied AI lab based in San Francisco, California on a mission to accelerate scientific discovery across every discipline. As a Senior DevOps Engineer, you will design and operate the infrastructure that enables ML researchers to train massive language models and deploy AI systems that interact with real-world scientific workflows.

Artificial Intelligence (AI)BiotechnologyMachine LearningMedical

Responsibilities

Design and implement distributed training infrastructure for large language models and RL systems, including multi-node GPU orchestration, fault tolerance, and experiment tracking
Architect and lead the build-out of GPU clusters for ML workloads, including resource scheduling, utilization monitoring, and cost optimization across cloud and on-premise environments
Own the technical vision and roadmap for core infrastructure platforms that support the entire research organization
Develop data pipeline architecture for ingesting, processing, and serving scientific literature, experimental data, and training datasets at petabyte scale
Create model serving infrastructure that enables efficient inference for large language models in production research workflows, including batching, caching, and autoscaling
Implement monitoring, logging, and observability systems that provide deep visibility into training runs, model performance, and system health
Build CI/CD pipelines for ML workflows, including automated testing for model training code, reproducible experiment environments, and deployment automation
Develop infrastructure-as-code solutions using tools like Terraform, Kubernetes, and container orchestration to ensure reproducible and version-controlled infrastructure
Collaborate with ML researchers to understand computational requirements and translate them into scalable infrastructure solutions
Provide technical leadership and mentorship to junior engineers, establishing best practices and standards for infrastructure engineering
Make critical architectural decisions that will shape the foundation of our AI research platform for years to come
Optimize cost and performance across cloud providers (AWS, GCP, Azure) and specialized ML infrastructure providers
Ensure security, compliance, and data governance practices are embedded throughout the infrastructure stack

Qualification

DevOpsInfrastructure EngineeringGPU InfrastructureCloud PlatformsPythonContainerizationInfrastructure-as-CodeMonitoring ToolsTechnical LeadershipSoft Skills

Required

Bachelor's degree or higher in Computer Science, Systems Engineering, or a related technical field
5+ years of experience in DevOps, SRE, infrastructure engineering, or related areas, with demonstrated technical leadership
Strong programming skills in Python, Go, or similar languages, with deep experience in infrastructure automation and systems design
Demonstrated expertise in containerization (Docker, Kubernetes) and orchestration of distributed systems at scale
Proven track record of architecting and operating GPU infrastructure for ML workloads, including familiarity with CUDA, GPU drivers, and ML-specific optimization
Deep understanding of distributed training frameworks (PyTorch Distributed, DeepSpeed, Ray, etc.) and their infrastructure requirements
Expert-level grasp of cloud platforms (AWS, GCP, or Azure), networking, storage systems, and compute optimization
Experience with infrastructure-as-code tools (Terraform, Ansible, CloudFormation) and CI/CD systems
Proven ability to debug complex distributed systems and optimize performance bottlenecks
Experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, Datadog, or similar)

Preferred

Experience with MLOps platforms (MLflow, Weights & Biases, etc.)
Contributions to open-source infrastructure projects
Experience with HPC or scientific computing environments
Background in ML or data science
Experience scaling training infrastructure for large language models

Benefits

Competitive compensation, equity, and benefits
Collaborate in building a thoughtful, mission-driven team culture focused on scientific discovery
Frequent team events, dinners, off-sites, and gatherings

Company

Autopoiesis Sciences

twittertwittertwitter
company-logo
Autopoiesis Sciences develops AI models that enable genuine scientific breakthroughs and discovery through advanced reasoning capabilities.

Funding

Current Stage
Early Stage
Total Funding
unknown
Key Investors
Informed Ventures
2025-07-30Seed
Company data provided by crunchbase