Principal Site Reliability Engineer, Infrastructure Platform @ Groq | Jobright.ai
JOBSarrow
RecommendedLiked
0
Applied
0
External
0
Principal Site Reliability Engineer, Infrastructure Platform jobs in Mountain View, CA
200+ applicants
company-logo

Groq · 9 hours ago

Principal Site Reliability Engineer, Infrastructure Platform

ftfMaximize your interview chances
Artificial Intelligence (AI)Electronics
check
H1B Sponsor Likelynote

Insider Connection @Groq

Discover valuable connections within the company who might provide insights and potential referrals.
Get 3x more responses when you reach out via email instead of LinkedIn.

Responsibilities

Reliability Architecture: Design and implement scalable and reliable architectures for the platform infrastructure. Define and enforce operational standards and best practices for site reliability. Develop and implement disaster recovery and business continuity plans.
Monitoring & Alerting: Establish comprehensive monitoring systems to track key performance indicators (KPIs) and identify potential issues. Implement robust alerting and notification workflows to ensure timely response to incidents. Analyze data and identify opportunities for platform optimization.
Incident Management: Lead the investigation and resolution of production incidents. Develop and maintain incident response playbooks and escalation procedures. Work collaboratively with engineering teams to identify and mitigate potential risks.
Automation & Continuous Improvement: Develop and implement automated testing frameworks to ensure software quality and reliability. Drive continuous improvement by identifying and implementing process and tool enhancements.

Qualification

Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.

Site Reliability EngineeringCloud-native technologiesInfrastructure as a Service (IaaS)Monitoring systemsIncident managementDisaster recovery planningRoot cause analysisTeamwork skills

Required

6/10+ years of experience in site reliability engineering or a related field.
Deep understanding of cloud-native technologies and infrastructure as a service (IaaS).
Expertise in monitoring and alerting systems, incident management processes, and disaster recovery planning.
Strong analytical and problem-solving skills with a focus on root cause analysis and mitigation.
Excellent communication and teamwork skills with the ability to collaborate effectively across engineering teams.

Benefits

Equity
Benefits

Company

Groq

twittertwittertwitter
company-logo
Groq radically simplifies compute to accelerate workloads in artificial intelligence, machine learning, and high-performance computing.

H1B Sponsorship

Groq has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2023 (4)
2022 (6)
2021 (18)
2020 (2)

Funding

Current Stage
Late Stage
Total Funding
$1B
Key Investors
Social Capital
2024-08-05Series D· $640M
2024-06-20Secondary Market
2021-04-14Series C· $300M

Leadership Team

leader-logo
Jonathan Ross
CEO and Founder
linkedin
leader-logo
Stuart C. Pann
COO
linkedin
Company data provided by crunchbase
logo

Orion

Your AI Copilot