Apply on Employer Site

Network to Code · 20 hours ago

Site Reliability Engineer, Cloud

United States

Full-time

Remote

Mid Level

3+ years exp

Network to Code is dedicated to pioneering network automation technologies. As a Site Reliability Engineer, you will operate, support, and evolve customer environments in AWS, focusing on maintaining uptime, performance, and security for our managed Nautobot SaaS offering.

Information TechnologyInternetSoftware

Hiring Manager

Kim Brown

Responsibilities

Operate and support Nautobot Cloud deployments in AWS, including EKS, EC2, RDS, and associated services

Use Jira to manage operational and project-related tasks, track incidents, and document changes

Support resolution of escalated issues related to other Kubernetes-like, including AKS or on-prem, customers as needed

Deploy and update Nautobot instances using Helm charts, Kubernetes manifests, and automation workflows

Automate improvements to CI/CD pipelines (GitHub Actions, Terraform, Ansible) for provisioning, upgrades, and configuration management

Maintain observability tools (Prometheus, Loki, Grafana) to ensure accurate monitoring, alerting, and logging

Troubleshoot application and infrastructure issues across containerized environments

Collaborate with engineers across Cloud Operations, Nautobot Core, and Nautobot Apps teams to deliver cross-functional solutions

Contribute to documentation for operational runbooks, troubleshooting guides, and architecture diagrams

Participate in Agile ceremonies, including standups and retrospectives

Qualification

AWSKubernetesDevOps practicesCI/CD pipelinesTerraformAnsiblePythonLinux environmentsJiraObservability toolsNetworking fundamentalsAnalytical skillsProactive mindsetCommunication skills

Required

3–5 years of experience applying DevOps or SRE practices to production systems

2+ years experience operating workloads in AWS, with a focus on EKS, EC2, IAM, and networking

2+ years working with Kubernetes (preferably in production) and Helm

Experience with IaC tools such as Terraform and configuration management tools like Ansible

Familiarity with CI/CD pipelines (GitHub Actions, Jenkins, CircleCI, etc.)

Proficiency in scripting languages such as Python or Bash

Comfortable working in Linux-based environments

Familiarity with monitoring, logging, and alerting solutions (Prometheus, Loki, Grafana, Datadog, ELK)

Skilled in using Jira to manage operational tasks, incident response, sprint planning, and project tracking. Experience with similar ticketing systems is also a plus

Analytical and troubleshooting skills using k9s for real-time Kubernetes management and Terraform for diagnosing and resolving Infrastructure-as-Code deployment issues. Prior experience with these tools is a plus

Networking fundamentals (equivalent to CCNA-level understanding) is a plus

Passion for reliability, customer success, and operational excellence

Ability to troubleshoot complex distributed systems and quickly identify root causes

Strong communication skills—able to clearly convey technical concepts to both peers and customers

A proactive mindset, looking for opportunities to improve processes and prevent issues before they occur

Flexibility to adapt to changing priorities and technologies