Site Reliability Engineer @ RemoteWorker US | Jobright.ai
JOBSarrow
RecommendedLiked
0
Applied
0
Site Reliability Engineer jobs in Santa Clara, CA
Be an early applicantLess than 25 applicantsPosted by Agency
expire-info-iconThis job has closed.
company-logo

RemoteWorker US ยท 2 days ago

Site Reliability Engineer

Wonder how qualified you are to the job?

ftfMaximize your interview chances
Staffing and Recruiting

Insider Connection @RemoteWorker US

Discover valuable connections within the company who might provide insights and potential referrals, giving your job application an inside edge.

Responsibilities

Infrastructure Management: Design, implement, and maintain scalable and resilient infrastructure using Terraform for infrastructure as code, ensuring high availability and performance.
Kubernetes and Containers: Deploy, manage, and optimize Kubernetes clusters and containerized applications using Docker. Implement best practices for container orchestration and management.
Systems and Application Monitoring/Observability: Develop and maintain comprehensive monitoring and observability solutions using Datadog. Ensure detailed visibility into system performance and application health.
SLOs and SLA Management: Define, monitor, and maintain Service Level Objectives (SLOs) and Service Level Agreements (SLAs) to ensure reliable and consistent service delivery.
Incident Response and Troubleshooting: Respond to incidents, perform root cause analysis, and implement solutions to prevent recurrence. Participate in post-incident reviews and contribute to blameless postmortems.
Reliability and Production Environment Management: Ensure the reliability and stability of our production environments. Continuously assess and improve system reliability, identifying and addressing potential points of failure.
Automation and Scripting: Develop automation scripts and tools to reduce manual intervention and improve system reliability using Python, Bash, or Go. Implement and improve CI/CD pipelines.
CI/CD Pipeline Management: Enhance and maintain continuous integration and continuous deployment pipelines using GitLab CI. Ensure seamless and reliable deployment processes.
Capacity Planning and Scaling: Assist in capacity planning and ensure that systems are scalable to meet future demands. Implement auto-scaling strategies where applicable.
Security and Compliance: Implement security best practices and ensure compliance with industry standards. Regularly review and update security policies and procedures.
Collaboration and Support: Work closely with development teams to ensure reliability and scalability of new features and services. Provide technical support and guidance on infrastructure-related issues.
Software Engineering for Operations: Develop and maintain internal tools and services that enhance the efficiency and reliability of our operations.
On-Call Rotation: Participate in an on-call rotation to address production issues and collaborate in incident response efforts.

Qualification

Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.

TerraformKubernetesDockerDatadogService Level ObjectivesService Level AgreementsIncident ResponseAutomation ScriptsPythonBashGoContinuous IntegrationContinuous DeploymentGitLab CICapacity PlanningAuto-ScalingSecurity Best PracticesComplianceDevelopment CollaborationInternal ToolsOperational EfficiencyOn-call Rotation

Required

Experience designing, implementing, and maintaining scalable and resilient infrastructure using Terraform
Experience deploying, managing, and optimizing Kubernetes clusters and containerized applications using Docker
Experience developing and maintaining comprehensive monitoring and observability solutions using Datadog
Experience defining, monitoring, and maintaining Service Level Objectives (SLOs) and Service Level Agreements (SLAs)
Experience responding to incidents, performing root cause analysis, and implementing solutions to prevent recurrence
Experience ensuring the reliability and stability of production environments
Experience developing automation scripts and tools using Python, Bash, or Go
Experience enhancing and maintaining continuous integration and continuous deployment pipelines using GitLab CI
Experience in capacity planning and implementing auto-scaling strategies
Experience implementing security best practices and ensuring compliance with industry standards
Experience working closely with development teams to ensure reliability and scalability of new features and services
Experience developing and maintaining internal tools and services to enhance operational efficiency and reliability
Participation in an on-call rotation to address production issues and collaborate in incident response efforts

Company

RemoteWorker US

twitter
company-logo
The Home of Remote Workers in the United States We understand that outstanding performance begins with outstanding hiring, and this approach sits at the head of everything we do.

Funding

Current Stage
Early Stage
Company data provided by crunchbase
logo

Orion

Your AI Copilot