Denvr · 1 month ago
Site Reliability Engineer (SRE)
Denvr is a vertically integrated AI Platform Services company that provides foundational compute infrastructure and services to support the AI ecosystem. The Site Reliability Engineer will be responsible for driving infrastructure reliability, observability, and scalability, while designing and operating high-performance systems for data solutions.
Artificial Intelligence (AI)Cloud ComputingCloud Data ServicesCloud InfrastructureGenerative AIMachine LearningNatural Language ProcessingPrivate Cloud
Responsibilities
Design, implement, and maintain observability systems with Grafana, Prometheus, Victoria metrics and PromQL to monitor system health and performance
Explore opportunities of improving overall observability of HPC environment using industry best practices
Participate in on-call rotations, rapidly diagnose and resolve incidents, and perform postmortem reviews to drive continuous improvements
Hands on experience in automating DevOps pipeline using GitHub Action (or similar tools)
Qualification
Required
3-5 years in a Site Reliability Engineering (SRE) or DevOps role
Strong software development background, Computer science fundamentals
Familiarity with tools like Terraform or Helm, Ansible, Python for automated infrastructure provisioning
Knowledge of security practices and compliance standards for enterprise environments
Familiarity with high-performance computing, specifically in administering GPU-related workloads
Strong experience in managing Kubernetes clusters in production environments
Expertise observability platforms (Grafana, Prometheus, PromQL) for tracking and analyzing system metrics
Solid understanding of networking fundamentals (TCP/IP, DNS, load balancing, VPNs)
Hands on experience on developing and deploying production grade applications in AWS Cloud under hybrid cloud architecture
Proficiency in Linux administration, shell scripting, and performance tuning
Strong software development skills (e.g., Bash, Python, Golang) to automate infrastructure and operational tasks
Company
Denvr
Denvr AI Platforms provide foundational AI services for the AI ecosystem and end users of AI, comprising of cloud-enabled services for inferencing, computing, data processing & storage, and software toolsets for the accelerated development, operations, adoption, and integration of AI technologies, delivered through the public Denvr AI Cloud, and also through Denvr AI Platform Services for private, fully dedicated, sovereign, and highly secure AI Services, including private platform infrastructure deployments that consist of advanced data centers, compute architectures, data processing & storage fabrics, with integrated platform operations software.
Funding
Current Stage
Growth StageRecent News
linkedin.com
2025-09-09
SiliconANGLE
2024-12-03
2024-11-20
Company data provided by crunchbase