Be an early applicantLess than 25 applicantsPosted by Agency

This job has closed.

Company

Original Job Post

RemoteWorker US · 2 days ago

Site Reliability Engineer

Santa Clara, CA

Full-time

Onsite

Entry, Mid Level

3+ years exp

Wonder how qualified you are to the job?

Maximize your interview chances

Staffing and Recruiting

Insider Connection @RemoteWorker US

Discover valuable connections within the company who might provide insights and potential referrals, giving your job application an inside edge.

Responsibilities

Infrastructure Management: Design, implement, and maintain scalable and resilient infrastructure using Terraform for infrastructure as code, ensuring high availability and performance.

Kubernetes and Containers: Deploy, manage, and optimize Kubernetes clusters and containerized applications using Docker. Implement best practices for container orchestration and management.

Systems and Application Monitoring/Observability: Develop and maintain comprehensive monitoring and observability solutions using Datadog. Ensure detailed visibility into system performance and application health.

SLOs and SLA Management: Define, monitor, and maintain Service Level Objectives (SLOs) and Service Level Agreements (SLAs) to ensure reliable and consistent service delivery.

Incident Response and Troubleshooting: Respond to incidents, perform root cause analysis, and implement solutions to prevent recurrence. Participate in post-incident reviews and contribute to blameless postmortems.

Reliability and Production Environment Management: Ensure the reliability and stability of our production environments. Continuously assess and improve system reliability, identifying and addressing potential points of failure.

Automation and Scripting: Develop automation scripts and tools to reduce manual intervention and improve system reliability using Python, Bash, or Go. Implement and improve CI/CD pipelines.

CI/CD Pipeline Management: Enhance and maintain continuous integration and continuous deployment pipelines using GitLab CI. Ensure seamless and reliable deployment processes.

Capacity Planning and Scaling: Assist in capacity planning and ensure that systems are scalable to meet future demands. Implement auto-scaling strategies where applicable.

Security and Compliance: Implement security best practices and ensure compliance with industry standards. Regularly review and update security policies and procedures.

Collaboration and Support: Work closely with development teams to ensure reliability and scalability of new features and services. Provide technical support and guidance on infrastructure-related issues.

Software Engineering for Operations: Develop and maintain internal tools and services that enhance the efficiency and reliability of our operations.

On-Call Rotation: Participate in an on-call rotation to address production issues and collaborate in incident response efforts.

Qualification

Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.

TerraformKubernetesDockerDatadogService Level ObjectivesService Level AgreementsIncident ResponseAutomation ScriptsPythonBashGoContinuous IntegrationContinuous DeploymentGitLab CICapacity PlanningAuto-ScalingSecurity Best PracticesComplianceDevelopment CollaborationInternal ToolsOperational EfficiencyOn-call Rotation

Required

Experience designing, implementing, and maintaining scalable and resilient infrastructure using Terraform

Experience deploying, managing, and optimizing Kubernetes clusters and containerized applications using Docker

Experience developing and maintaining comprehensive monitoring and observability solutions using Datadog

Experience defining, monitoring, and maintaining Service Level Objectives (SLOs) and Service Level Agreements (SLAs)

Experience responding to incidents, performing root cause analysis, and implementing solutions to prevent recurrence

Experience ensuring the reliability and stability of production environments

Experience developing automation scripts and tools using Python, Bash, or Go

Experience enhancing and maintaining continuous integration and continuous deployment pipelines using GitLab CI

Experience in capacity planning and implementing auto-scaling strategies

Experience implementing security best practices and ensuring compliance with industry standards

Experience working closely with development teams to ensure reliability and scalability of new features and services

Experience developing and maintaining internal tools and services to enhance operational efficiency and reliability

Participation in an on-call rotation to address production issues and collaborate in incident response efforts

Company

RemoteWorker US

The Home of Remote Workers in the United States We understand that outstanding performance begins with outstanding hiring, and this approach sits at the head of everything we do.

Austin

2-10 employees

https://www.remoteworker.jobs

Funding

Current Stage

Early Stage

Company data provided by crunchbase

Orion

Your AI Copilot