Site Reliability Engineer (SRE) jobs in United States
cer-icon
Apply on Employer Site
company-logo

Denvr · 1 month ago

Site Reliability Engineer (SRE)

Denvr is a vertically integrated AI Platform Services company that provides foundational compute infrastructure and services to support the AI ecosystem. The Site Reliability Engineer will be responsible for driving infrastructure reliability, observability, and scalability, while designing and operating high-performance systems for data solutions.

Artificial Intelligence (AI)Cloud ComputingCloud Data ServicesCloud InfrastructureGenerative AIMachine LearningNatural Language ProcessingPrivate Cloud

Responsibilities

Design, implement, and maintain observability systems with Grafana, Prometheus, Victoria metrics and PromQL to monitor system health and performance
Explore opportunities of improving overall observability of HPC environment using industry best practices
Participate in on-call rotations, rapidly diagnose and resolve incidents, and perform postmortem reviews to drive continuous improvements
Hands on experience in automating DevOps pipeline using GitHub Action (or similar tools)

Qualification

Kubernetes ProficiencyAWS Cloud/Hybrid CloudObservability ToolsInfrastructure as Code (IaC)DevOps & CI/CDHPC KnowledgeLinux SystemsNetworkingIncident ManagementSecurity Best PracticesProgramming ExperienceSoftware Development background

Required

3-5 years in a Site Reliability Engineering (SRE) or DevOps role
Strong software development background, Computer science fundamentals
Familiarity with tools like Terraform or Helm, Ansible, Python for automated infrastructure provisioning
Knowledge of security practices and compliance standards for enterprise environments
Familiarity with high-performance computing, specifically in administering GPU-related workloads
Strong experience in managing Kubernetes clusters in production environments
Expertise observability platforms (Grafana, Prometheus, PromQL) for tracking and analyzing system metrics
Solid understanding of networking fundamentals (TCP/IP, DNS, load balancing, VPNs)
Hands on experience on developing and deploying production grade applications in AWS Cloud under hybrid cloud architecture
Proficiency in Linux administration, shell scripting, and performance tuning
Strong software development skills (e.g., Bash, Python, Golang) to automate infrastructure and operational tasks

Company

Denvr

twittertwitter
company-logo
Denvr AI Platforms provide foundational AI services for the AI ecosystem and end users of AI, comprising of cloud-enabled services for inferencing, computing, data processing & storage, and software toolsets for the accelerated development, operations, adoption, and integration of AI technologies, delivered through the public Denvr AI Cloud, and also through Denvr AI Platform Services for private, fully dedicated, sovereign, and highly secure AI Services, including private platform infrastructure deployments that consist of advanced data centers, compute architectures, data processing & storage fabrics, with integrated platform operations software.

Funding

Current Stage
Growth Stage

Leadership Team

leader-logo
Geoff Gordon
Founder, Chairman and Chief Executive Officer
linkedin
Company data provided by crunchbase