48 applicantsPosted by Agency

Company

Original Job Post

BRAMKAS INC · 2 days ago

Site Reliability Engineer (SRE)

Virginia, United States

Full-time

Remote

Senior Level

5+ years exp

Wonder how qualified you are to the job?

Maximize your interview chances

AnalyticsCloud Computing

Insider Connection @BRAMKAS INC

Discover valuable connections within the company who might provide insights and potential referrals, giving your job application an inside edge.

Responsibilities

Work with DevOps teams to Build, Release, Monitor and run the services to improve service reliability.

Write software to automate API-driven tasks at scale and contribute to the product codebase in Java, JS, React, Node, Go, and Python.

Write automation to reduce toil and eliminate manual tasks that are repeatable.

Work with Ansible, Puppet, Chef, Terraform, or another config management / orchestration suite, know where it's broken, work towards fixing them and explore new alternatives.

Maintain services once they are live by measuring and monitoring availability, latency, and overall system reliability.

Handle cross team performance issues from identification of the cause, determining the areas of improvement and driving those actions to closure.

Performance and maturity baselining of DevOps process, tools maturity & coverage, metrics, technology, and engineering practices.

Define, Measure and improve Reliability Metrics (SLO/SLI), Observability (Monitoring, Logging-Tracing solutions), Ops process (Incident, Problem Mgmt.) and streamline – automate release management.

Build dashboards to provide visibility into performance of the applications.

Understand the current process, system setup and propose the improvements needed in the processes, and technology so that the application exceeds the desired Service Level Objective.

Strong believer of automation to bring in sustained continuous improvement by automating Toil, Runbooks, improving ability of the applications to auto-heal leading to improved reliability.

Qualification

Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.

Development OperationsSRECodingPythonGolangJavaBashObservabilityChaos EngineeringAPMNew RelicAWSGoogle Cloud PlatformConfiguration ManagementAnsibleSaltStackTerraformCloudFormationPager DutySLIsSLOsIncident ManagementAgileLeanDevOpsInfrastructure ManagementService OwnershipStakeholder ManagementProblem-SolvingCommunication

Required

5 + years of Development and Operations experience in building and running applications in production that has uptime over 99%

3-5 years of experience as a SRE in handling applications that are web scale

Strong hands-on coding experience in one or more of programming languages such as Python, Golang, Java, Bash, etc.

Good understanding of Observability (monitoring, logging, tracing, metrics), Chaos engineering concepts

Proficiency in using Application Performance Monitoring (APM) tool New Relic for monitoring, logging, tracing

Expert level hands-on knowledge in public cloud platform AWS and/or Google Cloud Platform

Must have hands-on experience in using configuration management systems such as Ansible or SaltStack and infrastructure automation tools like Terraform or CloudFormation

Should have used alerting systems such as Pager Duty

Should have implemented solutions around Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for services

Should have supported Production Incidents (PIs) on critical applications of a company

Troubleshoot, debug, and diagnose operational issues and drive them to closure

Understanding of software delivery life cycles, particularly Agile/Lean & DevOps

Proven experience in handling large scale and growing infrastructure across Data Centers and heterogeneous Cloud platforms

Experience as a service owner in managing large – geographically diverse stakeholders

Ability to work with creative – fast growing engineering team and motivate them to deliver their best work

History of driving innovation

Bachelor’s/Master’s Degrees

Preferred

Professional level certificate on one of the public clouds is highly desirable

Familiarity with handling: Containerization – Kubernetes, Docker, Rancher, etc Kafka, Yarn, ElasticSearch etc. Source code management and Implementation of Security best practices

Networking knowledge

Contribution to open source community

Tech Stack - Python, Falcon, Elastic Search, MongoDB, AWS (SQS S3), Map Reduce

Understanding of software delivery life cycles, particularly Agile/Lean & DevOps