Lead Observability Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

ShiftCode Analytics, Inc. · 1 month ago

Lead Observability Engineer

ShiftCode Analytics, Inc. is a company focused on performance engineering and observability. They are seeking a Lead Observability Engineer responsible for designing and operating end-to-end observability across hybrid cloud and AWS environments, ensuring full visibility into system performance and service interactions.

AnalyticsConsultingInformation Technology

Responsibilities

Define service-level objectives (SLOs), performance budgets, and latency/throughput targets across services
Architect and champion comprehensive distributed tracing strategies (Dynatrace, AWS X-Ray, etc.)
Analyze application, platform, and cloud behavior using deep-dive techniques such as heap dumps, thread dumps, flame graphs, GC logs, network traces, and storage I/O profiling
Review service and system architectures for performance risks (e.g., synchronous hops, excessive dependencies, misconfigured connection pools, poor cache placement)
Conduct and lead root-cause analysis for performance incidents in production and pre-production environments
Develop capacity models and performance baselines for services running across cloud environments
Architect and implement a unified observability strategy using Dynatrace
Design and deploy distributed tracing across all Spring Boot microservices, ensuring end-to-end transaction visibility
Engineer golden signals dashboards and trace-driven diagnostics that support real-time incident response and long-term trend analysis
Lead instrumentation deep dives: JVM metrics, custom Micrometer metrics, trace attributes, log correlation, and database timing
Implement and tune anomaly detection, alerting strategies, and noise reduction techniques
Develop reference architectures and best practices for observability in hybrid cloud environments
Perform root cause analysis for latency issues, error spikes, and system degradation incidents
Mentor teams on observability tooling and ensure developers adopt instrumentation patterns by default

Qualification

Spring BootDynatraceAWSKubernetesPrometheusGrafanaContainer orchestrationRoot cause analysisPerformance tuningSoft skills

Required

Local candidates to Saint Louis - MO only, with address proof
Responsible for identifying and resolving end-to-end performance bottlenecks across distributed systems, Spring Boot services, middleware components, and hybrid cloud environments (private cloud + AWS)
Define service-level objectives (SLOs), performance budgets, and latency/throughput targets across services
Architect and champion comprehensive distributed tracing strategies (Dynatrace, AWS X-Ray, etc.)
Analyze application, platform, and cloud behavior using deep-dive techniques such as heap dumps, thread dumps, flame graphs, GC logs, network traces, and storage I/O profiling
Review service and system architectures for performance risks (e.g., synchronous hops, excessive dependencies, misconfigured connection pools, poor cache placement)
Conduct and lead root-cause analysis for performance incidents in production and pre-production environments
Develop capacity models and performance baselines for services running across cloud environments
Application Layer: Spring Boot internals, JVM tuning, thread/heap management, concurrency debugging, GC optimization
Container Runtime: PCF, Docker, container resource limits, CPU throttling, memory pressure
Orchestrators: PCF, Kubernetes, ECS (autoscaling, pod health, scheduling issues)
Networking: Service-to-service hops, TLS overhead, DNS, routing, load balancer configs (F5, Nginx, ALB/NLB), service mesh performance
Storage: Latency, IOPS constraints, distributed file system behavior
Caching & Middleware: Redis, Hazelcast, NATS, Kafka, RabbitMQ configuration and throughput tuning
Databases: Connection pool tuning, slow queries, indexing, replication lag
Cloud Layer: AWS compute/storage/network performance, regional latency, cross-cloud traffic patterns
Responsible for designing and operating the end-to-end observability across hybrid private cloud and AWS environments
Architect and implement a unified observability strategy using Dynatrace
Design and deploy distributed tracing across all Spring Boot microservices, ensuring end-to-end transaction visibility
Engineer golden signals dashboards and trace-driven diagnostics that support real-time incident response and long-term trend analysis
Lead instrumentation deep dives: JVM metrics, custom Micrometer metrics, trace attributes, log correlation, and database timing
Implement and tune anomaly detection, alerting strategies, and noise reduction techniques
Develop reference architectures and best practices for observability in hybrid cloud environments
Perform root cause analysis for latency issues, error spikes, and system degradation incidents
Mentor teams on observability tooling and ensure developers adopt instrumentation patterns by default
Application Instrumentation: Spring Boot metrics/logging/tracing, Micrometer, custom instrumentation, trace context propagation
Tracing & Telemetry: Dynatrace
Metrics Pipeline: Prometheus, Grafana, Dynatrace metrics, CloudWatch metrics, histogram management, RED/USE methodologies
Logging & Correlation: Structured logging, log-enrichment, log aggregation, trace-log correlation in Splunk
Container & Orchestrator Observability: PCF, Kubernetes, ECS — pod health, autoscaling, CPU throttling, memory pressure, node-level signals
Cloud & Infrastructure Visibility: AWS compute/network/storage telemetry, VPC flow logs, ALB/NLB observability, network path tracing
Database & Middleware Observability: Query latency, connection pool behavior, Redis/Kafka/Hazelcast metrics, MQ message flow visibility

Company

ShiftCode Analytics, Inc.

twittertwitter
company-logo
ShiftCode Analytics Inc is a Tampa, FL based firm formed with one sole purpose of delivering best and quick services to its clients nationwide.

Funding

Current Stage
Growth Stage
Company data provided by crunchbase