Apply on Employer Site

Teraswitch Inc. · 4 months ago

Staff Software Engineer - Site Reliability and Observability

Pittsburgh, PA

Full-time

Onsite

Senior Level

$175K/yr - $250K/yr

7+ years exp

Teraswitch is a leading provider of high-performance bare metal servers with a global presence. They are seeking a Staff Software Engineer for Site Reliability and Observability to ensure the reliability, scalability, and performance of their software systems through monitoring, automation, performance optimization, and collaboration with development teams.

EmailInternet

Responsibilities

Monitoring the performance and availability of software systems, identifying and resolving issues, and implementing proactive measures to prevent future incidents

Developing and maintaining automation tools and infrastructure to streamline software deployment, configuration management, and system monitoring

Analyzing system performance, identifying bottlenecks, and implementing optimizations to improve the efficiency and scalability of software systems

Responding to incidents, conducting root cause analysis, and implementing corrective actions to prevent similar incidents in the future

Collaborating with software development teams to ensure that reliability and scalability considerations are incorporated into the software design and implementation

Identifying opportunities for process improvement, implementing best practices, and driving initiatives to enhance the reliability and performance of software systems

Implement scalable, reliable, secure SRE and Observability platform to monitor health of our production system and provide a holistic view of the environment

Deliver tools/software to improve the reliability, scalability and operability of services

Collaborate with engineering teams to analyze and provide inputs in architecture, infrastructure resources, observability to achieve reliability and scalability goals

Serve as a technical leader for key initiatives across the organization, identify potential issues and opportunities, and lead teams to architect the next generation reliability software

Deliver impact by building software that helps maintain reliability on our backend and frontend systems

Improve best practices through developing technical implementations that solve multiple developer and business needs

Participate in 24/7 On-call Rotation of critical systems

Qualification

Site Reliability EngineeringSoftware DevelopmentMonitoring ToolsDockerKubernetesTerraformIncident ResponseContinuous ImprovementTroubleshootingCollaboration

Required

7+ years of hands-on SRE experience (software development, systems monitoring) with Software Development experience (Java, golang, python)

Experience building and operating high-availability, fault-tolerant, scalable, distributed software in production: Building monitoring, defining alerts, writing run books, establishing dashboards etc

Experience with monitoring and logging tools, such as Grafana, Loki, Logstash, Clickhouse, etc

Experience with owning and maintaining software including the SDLC and deployment

Strong working knowledge of Docker, Kubernetes, Terraform, Chef or Ansible

Experience troubleshooting production applications driving mitigation and remediation