Apply on Employer Site

ShiftCode Analytics, Inc. · 1 month ago

Lead Observability Engineer

St. Louis, MO

Full-time

Onsite

Senior Level, Lead/Staff

ShiftCode Analytics, Inc. is a company focused on performance engineering and observability. They are seeking a Lead Observability Engineer responsible for designing and operating end-to-end observability across hybrid cloud and AWS environments, ensuring full visibility into system performance and service interactions.

AnalyticsConsultingInformation Technology

Responsibilities

Define service-level objectives (SLOs), performance budgets, and latency/throughput targets across services

Architect and champion comprehensive distributed tracing strategies (Dynatrace, AWS X-Ray, etc.)

Analyze application, platform, and cloud behavior using deep-dive techniques such as heap dumps, thread dumps, flame graphs, GC logs, network traces, and storage I/O profiling

Review service and system architectures for performance risks (e.g., synchronous hops, excessive dependencies, misconfigured connection pools, poor cache placement)

Conduct and lead root-cause analysis for performance incidents in production and pre-production environments

Develop capacity models and performance baselines for services running across cloud environments

Architect and implement a unified observability strategy using Dynatrace

Design and deploy distributed tracing across all Spring Boot microservices, ensuring end-to-end transaction visibility

Engineer golden signals dashboards and trace-driven diagnostics that support real-time incident response and long-term trend analysis

Lead instrumentation deep dives: JVM metrics, custom Micrometer metrics, trace attributes, log correlation, and database timing

Implement and tune anomaly detection, alerting strategies, and noise reduction techniques

Develop reference architectures and best practices for observability in hybrid cloud environments

Perform root cause analysis for latency issues, error spikes, and system degradation incidents

Mentor teams on observability tooling and ensure developers adopt instrumentation patterns by default

Qualification

Spring BootDynatraceAWSKubernetesPrometheusGrafanaContainer orchestrationRoot cause analysisPerformance tuningSoft skills

Required

Local candidates to Saint Louis - MO only, with address proof

Responsible for identifying and resolving end-to-end performance bottlenecks across distributed systems, Spring Boot services, middleware components, and hybrid cloud environments (private cloud + AWS)

Define service-level objectives (SLOs), performance budgets, and latency/throughput targets across services

Architect and champion comprehensive distributed tracing strategies (Dynatrace, AWS X-Ray, etc.)

Analyze application, platform, and cloud behavior using deep-dive techniques such as heap dumps, thread dumps, flame graphs, GC logs, network traces, and storage I/O profiling

Review service and system architectures for performance risks (e.g., synchronous hops, excessive dependencies, misconfigured connection pools, poor cache placement)

Conduct and lead root-cause analysis for performance incidents in production and pre-production environments

Develop capacity models and performance baselines for services running across cloud environments

Application Layer: Spring Boot internals, JVM tuning, thread/heap management, concurrency debugging, GC optimization

Container Runtime: PCF, Docker, container resource limits, CPU throttling, memory pressure

Orchestrators: PCF, Kubernetes, ECS (autoscaling, pod health, scheduling issues)

Networking: Service-to-service hops, TLS overhead, DNS, routing, load balancer configs (F5, Nginx, ALB/NLB), service mesh performance

Storage: Latency, IOPS constraints, distributed file system behavior

Caching & Middleware: Redis, Hazelcast, NATS, Kafka, RabbitMQ configuration and throughput tuning

Databases: Connection pool tuning, slow queries, indexing, replication lag

Cloud Layer: AWS compute/storage/network performance, regional latency, cross-cloud traffic patterns

Responsible for designing and operating the end-to-end observability across hybrid private cloud and AWS environments