This job has closed.

Peraton · 1 month ago

Cloud SRE Lead and Major Incident Digital Transformation Evangelist

United States

Full-time

Remote

Senior Level, Lead/Staff

$112K/yr - $179K/yr

8+ years exp

Peraton is a next-generation national security company that drives missions of consequence spanning the globe. They are seeking an experienced Cloud SRE Lead and Major Incident Digital Transformation Evangelist to evaluate and modernize incident management processes while ensuring the reliability and performance of their AWS cloud infrastructure.

Information TechnologyRobotics

No H1B

Security Clearance Required

U.S. Citizen Only

Responsibilities

Execute Ideation Sessions: Execute ideation sessions across multiple teams and companies to identify areas of improvement and ideas to improve and radically change the current incident management process

Establish Modern Incident Management Tooling: Review of currently available tools and industry best-of-breed to recommend and champion the right tool and technology and the right capabilities to empower, visualize, communicate, and activate cross functional teams

Lead Major Incidents: Coordinate and lead the Major Incidents by directing the troubleshooting, communicating status, encouraging action, guiding the use of tools, and ensuring swift and complete resolution of the Major Incident

Guide Postmortem Analysis: Schedule and lead blameless postmortems encouraging independent ideas, identification of true root causes, and communication of findings

Infrastructure Automation: Design, implement, and manage infrastructure as code (IaC) solutions using tools like AWS CloudFormation, Terraform or Helm Charts to automate deployment and scaling processes. Collaborate with development teams to integrate continuous deployment practices and ensure the reliability of applications

Monitoring and Alerting: Implement robust monitoring and alerting systems to proactively identify and address potential issues before they impact system performance. Analyze system metrics, logs, and alerts to troubleshoot and resolve issues promptly

Performance Optimization: Conduct performance analysis and optimization of AWS infrastructure components to enhance system efficiency and reduce latency. Identify and implement improvements to enhance system reliability and resilience

Incident Response: Participate in on-call rotations to respond to and resolve incidents promptly. Conduct post-incident reviews to identify root causes and implement preventive measures

Security and Compliance: Work closely with security teams to implement and enforce best practices for securing AWS environments. Ensure compliance with industry standards and regulations related to cloud infrastructure

Communication: Facilitate clear communication across teams, providing updates on release status, known issues, and any potential impact on stakeholders. Coordinate communication of release schedules and changes to all relevant parties

Release Planning and Coordination: Collaborate with development, QA, and operations teams to plan and coordinate software releases. Define release scope, schedule, and dependencies to ensure timely and smooth deployments. Create and submit change records as required for process and audit compliance. Participation in Technical Change Advisory and Review boards as required

Release Automation: Develop and maintain automated deployment pipelines using industry-standard tools such as AWS Cl/CD, GitLab CI/CD, Jenkins or similar. Automate and streamline release processes to improve efficiency and reduce manual errors

Continuous Improvement: Proactively identify areas for process improvement within the release management lifecycle. Implement feedback loops to capture lessons learned from each release and apply improvements iteratively. Stay up to date with industry best practices, emerging technologies, and trends related to release management and reliability engineering

Quality Assurance: Collaborate with QA teams to establish and execute release validation procedures. Ensure releases are thoroughly tested and meet quality standards before deployment. Drive continuous improvement by analyzing release management trends, identifying recurring issues, and working with teams to implement solutions

Qualification

AWS servicesSite Reliability EngineeringDigital TransformationInfrastructure as CodeCI/CD toolsProgramming languagesMonitoring toolsContainerization toolsAgile methodologiesProblem-solving skillsCommunication skills

Required

Bachelor's degree and 8 years of experience or 12 years of experience with a HS Degree/Diploma

Proven experience as a Site Reliability Engineer or similar role

In-depth knowledge of AWS services and expertise in managing cloud infrastructure

Proven experience in a Digital Transformation role

Advanced level programming and/or scripting in 3 or more of the following languages: Python, Java, Chef, Helm, Playwright, Bash, JavaScript, Terraform

Strong understanding of DevOps principles and continuous integration/continuous deployment (CI/CD) pipelines

Proficiency in CI/CD tools such as AWS CI/CD, GitLab CI/CD, or others

Familiarity with infrastructure as code (IaC) tools like CloudFormation, Terraform, Helm Charts, Morpheus, or similar technologies

Hands-on experience with version control systems (GitLab, AWS CodeCommit, SVN) and branching strategies

Experience with containerization and orchestration tools (e.g., Amazon Elastic Compute Service (ECS), Amazon Elastic Kubernetes Service (EKS), Docker, Kubernetes)

Familiarity with monitoring tools (e.g., CloudWatch, Prometheus, Grafana, Datadog, DynaTrace) and log analysis

Attention to detail, with a focus on maintaining high-quality software releases

Solid understanding of Agile methodologies and their application in release management and Cloud operations

Excellent problem-solving and troubleshooting skills

Strong communication and collaboration skills

Must be a US Citizen

Must be able to obtain and maintain a 6C Public Trust clearance

Preferred

Relevant certifications in DevOps or related fields are a plus

High Risk Public Trust or Secret Clearance preferred

3 or more years in SRE or Platform Engineering group for high availability/critical platforms/applications

Experience managing a distributed container platform including but not limited to deployment/release management, provisioning, capacity management, workload management