Moonlite AI · 9 hours ago
Sr. Site Reliability Engineer (SRE)
Moonlite AI delivers high-performance AI infrastructure for organizations running intensive computational research and large-scale model training. The role involves building and operating production-grade AI infrastructure with a focus on Kubernetes, ensuring enterprise-grade reliability and operational excellence.
Computer Software
Responsibilities
Kubernetes Infrastructure Engineering: Design, build, and operate production Kubernetes clusters on bare-metal infrastructure – including cluster bootstrapping, control plane architecture, etcd management, and scaling strategies for high-performance compute workloads
Kubernetes Networking & CNIs: Implement and operate custom Kubernetes networking solutions with SR-IOV for high-performance GPU interconnects, multi-tenancy isolation and advanced networking policies. Configure CNI plugins and network segmentation for research workloads
Custom Operators & Controllers: Develop and maintain custom Kubernetes operators and controllers for bare-metal provisioning, infrastructure lifecycle management, and resource orchestration across compute, storage, and networking domains
GPU Infrastructure Integration: Deploy and optimize NVIDIA GPU operators, device plugins, and other custom scheduling logic for GPU workload placement and utilization optimization
Platform Integration & Storage: Build deep integrations between Kubernetes and underlying infrastructure including CSI drivers for storage, custom admission controllers for policy enforcement, and scheduling extensions for specialized hardware placement
Infrastructure Automation: Design and implement automation using Terraform, Ansible, Helm, and custom operators to orchestrate infrastructure workflows and enable deployments across multiple regions
Production Operations & Reliability: Manage production bare-metal infrastructure across multiple regions. Build systems ensuring high availability, fault tolerance, and graceful degradation – establishing SLIs, SLOs, and monitoring to meet enterprise reliability commitments
Observability & Incident Response: Build comprehensive monitoring, logging, and alerting using Prometheus, Grafana, and ELK stack. Lead incident response, conduct postmortems, and implement preventative measures to improve reliability and reduce MTTR
Performance & Capacity Planning: Identify and resolve performance bottlenecks across infrastructure domains. Monitor utilization trends, forecast capacity needs, and optimize resource allocation for various workloads
Qualification
Required
5+ years in SRE, DevOps, or infrastructure engineering roles with proven experience operating production infrastructure at scale
Deep hands-on experience building and operating production Kubernetes clusters on bare-metal infrastructure – not just deploying workloads in managed clusters. Must understand cluster bootstrapping, control plane architecture, etcd operations, and scaling strategies
Strong understanding of Kubernetes internals including custom resource definitions (CRDs), operators, controllers, admission webhooks, and scheduling. Experience integrating storage (CSI drivers), networking (CNI, SR-IOV), and specialized hardware (GPU device plugins) with Kubernetes
Strong fundamentals in Linux systems administration, performance tuning, troubleshooting, and automation in production environments
Proficiency with infrastructure-as-code tools (Terraform, Ansible, Helm) and building automation to reduce operational overhead
Solid understanding of networking concepts including IPAM, DNS, DHCP, VLAN/VXLAN, routing, load balancing, and experience troubleshooting network issues in production
Experience building and maintaining comprehensive monitoring solutions using tools like Prometheus, Grafana, and centralized logging systems
Understanding of SRE principles including SLIs/SLOs/SLAs, error budgets, incident management, and blameless postmortems
Strong scripting skills in Go, Python, or Bash for automation, tooling development, and operational efficiency
Demonstrated ability to troubleshoot complex issues under pressure, manage incidents effectively, and communicate clearly during outages
Excellent communication skills and ability to work across teams including systems engineers, network engineers, and software developers
Preferred
Experience building custom Kubernetes operators or controllers for infrastructure orchestration
Deep familiarity with Kubernetes networking (Calico, Cilium, Multus), service mesh technologies, and network policy management
Experience with GPU workload orchestration including NVIDIA GPU Operator, MIG, time-slicing, and device plugins
Background with advanced Kubernetes features including custom schedulers, admission controllers, and API server extensions
Experience with Kubernetes cluster federation or multi-cluster management
Knowledge of high-performance networking technologies (InfiniBand, RDMA, RoCE) and their integration with Kubernetes
Experience with enterprise storage systems (VAST, Lightbits, Ceph, or similar)
Familiarity with configuration management at scale and GitOps practices
Understanding of security best practices for Kubernetes and bare-metal infrastructure
Experience operating infrastructure in regulated industries or co-located data center environments
Background supporting research institutions, technical computing environments, or enterprise AI infrastructure
Benefits
6% 401(k) match
Fully covered health insurance premiums
Other comprehensive offerings to support your well-being and success as we grow together
Company
Moonlite AI
Moonlite is building a cloud-native experience on-prem. Our software provides the control and customization enterprises need for AI.
Funding
Current Stage
Early StageCompany data provided by crunchbase