200+ applicants

Company

Groq · 9 hours ago

Principal Site Reliability Engineer, Infrastructure Platform

Mountain View, CA

Full-time

Remote

Senior Level

$165K/yr - $332K/yr

6+ years exp

Maximize your interview chances

Artificial Intelligence (AI)Electronics

H1B Sponsor Likely

Insider Connection @Groq

Discover valuable connections within the company who might provide insights and potential referrals.
Get 3x more responses when you reach out via email instead of LinkedIn.

Responsibilities

Reliability Architecture: Design and implement scalable and reliable architectures for the platform infrastructure. Define and enforce operational standards and best practices for site reliability. Develop and implement disaster recovery and business continuity plans.

Monitoring & Alerting: Establish comprehensive monitoring systems to track key performance indicators (KPIs) and identify potential issues. Implement robust alerting and notification workflows to ensure timely response to incidents. Analyze data and identify opportunities for platform optimization.

Incident Management: Lead the investigation and resolution of production incidents. Develop and maintain incident response playbooks and escalation procedures. Work collaboratively with engineering teams to identify and mitigate potential risks.

Automation & Continuous Improvement: Develop and implement automated testing frameworks to ensure software quality and reliability. Drive continuous improvement by identifying and implementing process and tool enhancements.

Qualification

Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.

Site Reliability EngineeringCloud-native technologiesInfrastructure as a Service (IaaS)Monitoring systemsIncident managementDisaster recovery planningRoot cause analysisTeamwork skills

Required

6/10+ years of experience in site reliability engineering or a related field.

Deep understanding of cloud-native technologies and infrastructure as a service (IaaS).

Expertise in monitoring and alerting systems, incident management processes, and disaster recovery planning.

Strong analytical and problem-solving skills with a focus on root cause analysis and mitigation.

Excellent communication and teamwork skills with the ability to collaborate effectively across engineering teams.

Benefits

Equity

Benefits

Company

Groq

Groq radically simplifies compute to accelerate workloads in artificial intelligence, machine learning, and high-performance computing.

Founded in 2016

Mountain View, California, USA

51-200 employees

http://groq.com

H1B Sponsorship

Groq has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2023 (4)

2022 (6)

2021 (18)

2020 (2)