Apply on Employer Site

Moonlite AI · 9 hours ago

Sr. Site Reliability Engineer (SRE)

United States

Full-time

Remote

Senior Level

$165K/yr - $225K/yr

5+ years exp

Moonlite AI delivers high-performance AI infrastructure for organizations running intensive computational research and large-scale model training. The role involves building and operating production-grade AI infrastructure with a focus on Kubernetes, ensuring enterprise-grade reliability and operational excellence.

Computer Software

Responsibilities

Kubernetes Infrastructure Engineering: Design, build, and operate production Kubernetes clusters on bare-metal infrastructure – including cluster bootstrapping, control plane architecture, etcd management, and scaling strategies for high-performance compute workloads

Kubernetes Networking & CNIs: Implement and operate custom Kubernetes networking solutions with SR-IOV for high-performance GPU interconnects, multi-tenancy isolation and advanced networking policies. Configure CNI plugins and network segmentation for research workloads

Custom Operators & Controllers: Develop and maintain custom Kubernetes operators and controllers for bare-metal provisioning, infrastructure lifecycle management, and resource orchestration across compute, storage, and networking domains

GPU Infrastructure Integration: Deploy and optimize NVIDIA GPU operators, device plugins, and other custom scheduling logic for GPU workload placement and utilization optimization

Platform Integration & Storage: Build deep integrations between Kubernetes and underlying infrastructure including CSI drivers for storage, custom admission controllers for policy enforcement, and scheduling extensions for specialized hardware placement

Infrastructure Automation: Design and implement automation using Terraform, Ansible, Helm, and custom operators to orchestrate infrastructure workflows and enable deployments across multiple regions

Production Operations & Reliability: Manage production bare-metal infrastructure across multiple regions. Build systems ensuring high availability, fault tolerance, and graceful degradation – establishing SLIs, SLOs, and monitoring to meet enterprise reliability commitments

Observability & Incident Response: Build comprehensive monitoring, logging, and alerting using Prometheus, Grafana, and ELK stack. Lead incident response, conduct postmortems, and implement preventative measures to improve reliability and reduce MTTR

Performance & Capacity Planning: Identify and resolve performance bottlenecks across infrastructure domains. Monitor utilization trends, forecast capacity needs, and optimize resource allocation for various workloads

Qualification

Kubernetes InfrastructureInfrastructure AutomationObservability & MonitoringLinux SystemsNetworking FundamentalsScripting & AutomationReliability PracticesProblem-Solving Under PressureCollaboration & Communication

Required

5+ years in SRE, DevOps, or infrastructure engineering roles with proven experience operating production infrastructure at scale

Deep hands-on experience building and operating production Kubernetes clusters on bare-metal infrastructure – not just deploying workloads in managed clusters. Must understand cluster bootstrapping, control plane architecture, etcd operations, and scaling strategies

Strong understanding of Kubernetes internals including custom resource definitions (CRDs), operators, controllers, admission webhooks, and scheduling. Experience integrating storage (CSI drivers), networking (CNI, SR-IOV), and specialized hardware (GPU device plugins) with Kubernetes

Strong fundamentals in Linux systems administration, performance tuning, troubleshooting, and automation in production environments

Proficiency with infrastructure-as-code tools (Terraform, Ansible, Helm) and building automation to reduce operational overhead

Solid understanding of networking concepts including IPAM, DNS, DHCP, VLAN/VXLAN, routing, load balancing, and experience troubleshooting network issues in production

Experience building and maintaining comprehensive monitoring solutions using tools like Prometheus, Grafana, and centralized logging systems

Understanding of SRE principles including SLIs/SLOs/SLAs, error budgets, incident management, and blameless postmortems

Strong scripting skills in Go, Python, or Bash for automation, tooling development, and operational efficiency

Demonstrated ability to troubleshoot complex issues under pressure, manage incidents effectively, and communicate clearly during outages

Excellent communication skills and ability to work across teams including systems engineers, network engineers, and software developers

Preferred

Experience building custom Kubernetes operators or controllers for infrastructure orchestration

Deep familiarity with Kubernetes networking (Calico, Cilium, Multus), service mesh technologies, and network policy management

Experience with GPU workload orchestration including NVIDIA GPU Operator, MIG, time-slicing, and device plugins

Background with advanced Kubernetes features including custom schedulers, admission controllers, and API server extensions

Experience with Kubernetes cluster federation or multi-cluster management

Knowledge of high-performance networking technologies (InfiniBand, RDMA, RoCE) and their integration with Kubernetes

Experience with enterprise storage systems (VAST, Lightbits, Ceph, or similar)

Familiarity with configuration management at scale and GitOps practices

Understanding of security best practices for Kubernetes and bare-metal infrastructure

Experience operating infrastructure in regulated industries or co-located data center environments

Background supporting research institutions, technical computing environments, or enterprise AI infrastructure

Benefits

6% 401(k) match

Fully covered health insurance premiums

Other comprehensive offerings to support your well-being and success as we grow together

Company

Moonlite AI

Moonlite is building a cloud-native experience on-prem. Our software provides the control and customization enterprises need for AI.

Founded in 2024

Chicago, US

2-10 employees

https://moonlite.ai

Funding

Current Stage

Early Stage

Company data provided by crunchbase