Apptad · 3 hours ago
Observability Specialist (W2 Only)
Apptad is seeking an experienced Observability Specialist with a strong focus on observability and monitoring. In this role, you will design, implement, and maintain comprehensive observability solutions while collaborating with various teams to ensure system reliability and performance.
Responsibilities
Design and implement comprehensive framework for observability roadmap
Lead system performance benchmarking and optimization initiatives
Establish automated recovery mechanisms for common failure scenarios
Develop and enforce reliable monitoring solution
Create technical standards for resilient monitoring solution and approach
Participate in Root Cause Analysis (RCA) and postmortem processes
Develop frameworks to establish correlation in system failures
Design, implement, and manage end-to-end observability solutions encompassing metrics, logs, and traces across our infrastructure and applications
Evaluate, deploy, and maintain tools for monitoring, logging, tracing, alerting, and automation
Define intelligent alerting rules and escalation policies to ensure timely and effective incident response
Implement automated recovery mechanisms for common failure scenarios
Lead system performance benchmarking and optimization initiatives, leveraging observability data to identify bottlenecks and areas for improvement
Analyze observability data to identify trends, anomalies, and potential risks. Generate actionable insights and reports to improve system reliability and performance
Qualification
Required
Significant experience in Observability Specialist or similar role with a strong focus on observability and monitoring
Deep understanding of observability principles and best practices (metrics, logging, tracing)
Experience implementing and managing centralized logging and monitoring systems
Experience with cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes, OpenShift)
Background in database performance monitoring and optimization
Knowledge of Service Level Objectives (SLOs) and KPI implementation
Experience participating in Root Cause Analysis (RCA) and postmortem processes
Understanding of compliance requirements related to monitoring and logging
Excellent problem-solving and analytical skills
Strong communication and collaboration skills
Preferred
Familiarity with AIOps and ML-based anomaly detection systems is a plus