Network to Code · 20 hours ago
Site Reliability Engineer, Cloud
Network to Code is dedicated to pioneering network automation technologies. As a Site Reliability Engineer, you will operate, support, and evolve customer environments in AWS, focusing on maintaining uptime, performance, and security for our managed Nautobot SaaS offering.
Responsibilities
Operate and support Nautobot Cloud deployments in AWS, including EKS, EC2, RDS, and associated services
Use Jira to manage operational and project-related tasks, track incidents, and document changes
Support resolution of escalated issues related to other Kubernetes-like, including AKS or on-prem, customers as needed
Deploy and update Nautobot instances using Helm charts, Kubernetes manifests, and automation workflows
Automate improvements to CI/CD pipelines (GitHub Actions, Terraform, Ansible) for provisioning, upgrades, and configuration management
Maintain observability tools (Prometheus, Loki, Grafana) to ensure accurate monitoring, alerting, and logging
Troubleshoot application and infrastructure issues across containerized environments
Collaborate with engineers across Cloud Operations, Nautobot Core, and Nautobot Apps teams to deliver cross-functional solutions
Contribute to documentation for operational runbooks, troubleshooting guides, and architecture diagrams
Participate in Agile ceremonies, including standups and retrospectives
Qualification
Required
3–5 years of experience applying DevOps or SRE practices to production systems
2+ years experience operating workloads in AWS, with a focus on EKS, EC2, IAM, and networking
2+ years working with Kubernetes (preferably in production) and Helm
Experience with IaC tools such as Terraform and configuration management tools like Ansible
Familiarity with CI/CD pipelines (GitHub Actions, Jenkins, CircleCI, etc.)
Proficiency in scripting languages such as Python or Bash
Comfortable working in Linux-based environments
Familiarity with monitoring, logging, and alerting solutions (Prometheus, Loki, Grafana, Datadog, ELK)
Skilled in using Jira to manage operational tasks, incident response, sprint planning, and project tracking. Experience with similar ticketing systems is also a plus
Analytical and troubleshooting skills using k9s for real-time Kubernetes management and Terraform for diagnosing and resolving Infrastructure-as-Code deployment issues. Prior experience with these tools is a plus
Networking fundamentals (equivalent to CCNA-level understanding) is a plus
Passion for reliability, customer success, and operational excellence
Ability to troubleshoot complex distributed systems and quickly identify root causes
Strong communication skills—able to clearly convey technical concepts to both peers and customers
A proactive mindset, looking for opportunities to improve processes and prevent issues before they occur
Flexibility to adapt to changing priorities and technologies
Benefits
Discretionary bonuses
Option grants
Comprehensive benefits package
Company
Network to Code
Network to Code is THE network automation solution provider changing the way networks are managed, consumed, and operated.
Funding
Current Stage
Growth StageRecent News
Company data provided by crunchbase