31 applicants

Company

Original Job Post

Starkflow · 2 days ago

Senior Site Reliability Engineer

Arkansas, United States

Full-time

Onsite

Senior Level

5+ years exp

Wonder how qualified you are to the job?

Maximize your interview chances

AppsOutsourcing

Growth Opportunities

Insider Connection @Starkflow

Discover valuable connections within the company who might provide insights and potential referrals, giving your job application an inside edge.

Responsibilities

Participate in system design, reliability, monitoring, incident response, and automation initiatives

Design, analyze, develop, and troubleshoot large-scale distributed systems

Build tools for automation, minimize downtime, and provide self-service solutions

Improve observability and monitoring of systems, monitor capacity, performance, and cost metrics

Share on-call duties, respond to incidents, lead triage efforts, and conduct postmortems

Collaborate with engineering, security, and product teams to ensure reliable and efficient services

Promote SRE best practices, explore new technologies, and push capabilities forward

Qualification

Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.

Software EngineeringSite Reliability EngineeringDevOps EngineeringMonitoringSelf-Service Tool DevelopmentIncident ResponsePythonGoJavaLinux EnvironmentsAWSInfrastructure AutomationTerraformContainer OrchestrationAWS ECSAWS EKSAutomationSoftware Engineering MindsetObservabilityFull Stack MonitoringCI/CD PipelinesCross-Functional CollaborationSystem ReliabilityOn-Time DeliverySoftware DevelopmentDataBricksNodePHPLinuxWindows

Required

5+ years combined experience as a Software Engineer, Site Reliability Engineer or DevOps Engineer

Proven technical abilities in the areas of reliability, monitoring, self-service tool development, incident response, and build and release

Experience in one of these languages: Python, Go or Java

Strong experience with Linux environments

Demonstrated expertise designing, building, and triaging highly scaled production infrastructure in AWS

Experience with infrastructure automation technologies like Terraform

Experience in container/container-fleet-orchestration technologies like AWS ECS or EKS

Approach your job with an automation and software engineering mindset

Passion for uptime, observability, and full stack monitoring

Experience participating in a team’s 24x7 incident response efforts

Experience building ci/cd pipelines that are fast, informative, drive quality and achieve zero downtime releases

Ability to work across functional and domain boundaries to improve system reliability and deliver solutions on time and with quality

Preferred

Prior software development experience

DataBricks experience would be ideal

Common Technologies In Our Ecosystem Include Java, Go, Node, PHP

Linux-based, some Windows

Apache Web, Nginx, IIS, Apache Tomcat, Jetty

Docker, AWS ECS, AWS EKS, and home-grown Kubernetes

ELB, CloudFront, S3, EC2s, RDS, IAM, SQS, SES, SNS, Lambda, API Gateway, Kinesis, Lambda, ElasticCache, ElasticSearch, SSM, Control Tower, and much more

MySQL, Oracle, PostgreSQL, SQL Server

Artifactory, GitHub Enterprise, CircleCI, Jenkins, GitHub Actions, SonarQube, Jfrog X-Ray, Control Tower

Terraform (preferred), CloudFormation

Packer, Puppet, Ansible

New Relic, CloudWatch, PagerDuty