Phoenix Recruitment ยท 11 hours ago
Site Reliability Engineer
Maximize your interview chances
Computer Software
Insider Connection @Phoenix Recruitment
Get 3x more responses when you reach out via email instead of LinkedIn.
Responsibilities
Apply SRE principles to maintain the reliability, availability, and performance of software systems.
Automate deployment processes, configuration management, and CI/CD pipelines to streamline software development and delivery.
Planned and assisted with the migration of Windows and Linux-based machines to containerized machines.
Plan and Assist with the overall Disaster Recovery (DR) of the infrastructure and operations (InfraOps).
Manage and maintain software infrastructure, ensuring proper configuration, security, and scalability.
Perform system administration tasks, monitor system performance, troubleshoot issues, and apply necessary fixes.
Act as a versatile problem solver, filling gaps in team knowledge and expertise to ensure smooth and efficient software operations.
Facilitate smooth team and project transitions, providing guidance, training, and support for development teams to manage their infrastructure independently.
Develop a reliability rating system to assess team and project performance, collecting and analyzing metrics to evaluate adherence to best practices.
Respond quickly and effectively to critical incidents, conducting post-incident reviews to identify root causes and implement preventive measures.
Develop and maintain automation tools and scripts to improve operational efficiency.
Identify performance bottlenecks and implement optimizations to enhance system response times and resource utilization.
Stay up to date with the latest industry trends, technologies, and best practices related to SRE, DevOps, and infrastructure management.
Collaborate effectively with cross-functional teams and communicate technical concepts and recommendations clearly to both technical and non-technical stakeholders.
Implement a reliability-based release management process, allowing teams with higher reliability scores to perform quick and frequent releases.
Proactively identify potential issues and implement preventive measures to reduce incidents and outages.
Implement severability practices to detect abnormal behaviors in the software and collect information for effective problem resolution.
Set and monitor critical metrics to gain insights into system reliability, including latency, traffic, errors, and saturation levels.
Establish Service-Level Objectives (SLOs) and measure Service-Level Indicators (SLIs) to assess the quality-of-service delivery and reliability.
Planned, participated, and managed on-call rotations to ensure prompt response to reported software issues.
Utilize incident response tools to categorize the severity of reported cases and handle them promptly.
Implement configuration management tools to automate software workflows and enhance team productivity.
Qualification
Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.
Required
1+ years of experience in Site Reliability Engineering or related field
Ability to apply SRE principles to maintain the reliability, availability, and performance of software systems
Experience in automating deployment processes, configuration management, and CI/CD pipelines
Experience in migrating Windows and Linux-based machines to containerized machines
Knowledge of Disaster Recovery (DR) planning and execution for infrastructure and operations
Experience in managing and maintaining software infrastructure with a focus on configuration, security, and scalability
Ability to perform system administration tasks and troubleshoot issues
Strong problem-solving skills to fill gaps in team knowledge and expertise
Experience in facilitating team and project transitions, providing guidance and training
Ability to develop a reliability rating system and analyze metrics for performance evaluation
Experience in responding to critical incidents and conducting post-incident reviews
Ability to develop and maintain automation tools and scripts for operational efficiency
Experience in identifying performance bottlenecks and implementing optimizations
Knowledge of industry trends, technologies, and best practices related to SRE and DevOps
Ability to collaborate effectively with cross-functional teams and communicate technical concepts clearly
Experience in implementing reliability-based release management processes
Ability to proactively identify potential issues and implement preventive measures
Experience in implementing severability practices for software behavior detection
Ability to set and monitor critical metrics related to system reliability
Experience in establishing Service-Level Objectives (SLOs) and measuring Service-Level Indicators (SLIs)
Experience in managing on-call rotations for prompt response to software issues
Ability to utilize incident response tools for case categorization and handling
Experience in implementing configuration management tools to automate workflows
Company
Phoenix Recruitment
Phoenix Recruitment is a leading staffing and recruitment firm that helps companies of all sizes find the best possible talent.
Funding
Current Stage
Early StageCompany data provided by crunchbase