Manager of Infrastructure Engineering (Observability) jobs in United States
cer-icon
Apply on Employer Site
company-logo

Voltage Park · 2 weeks ago

Manager of Infrastructure Engineering (Observability)

Voltage Park is your enterprise AI factory, offering scalable compute power and on-demand AI infrastructure. They are seeking a Manager of Infrastructure Engineering to lead the observability strategy and manage a team focused on building automation and tooling for AI/ML workloads.

AI InfrastructureCloud ComputingMachine Learning
check
H1B Sponsor Likelynote

Responsibilities

Own Voltage Park’s observability strategy across infrastructure and platform layers
Define standards for metrics, logs, traces, alerts, dashboards, and SLOs
Drive architecture decisions for telemetry pipelines, storage, and retention
Balance signal quality, system performance, and cost at scale
Build, manage, and mentor a team of infrastructure engineers focused on observability
Set clear technical direction, priorities, and expectations
Review designs, guide implementation, and raise the bar on operational rigor
Partner closely with Engineering and Operations teams
Design and operate high-throughput observability pipelines (metrics, logs, traces)
Ensure observability platforms are reliable, scalable, and resilient
Improve alert quality and reduce noise across production systems
Enable self-service observability for internal engineering teams
Participate in and lead infrastructure incident response
Use observability data to drive root-cause analysis and systemic improvements
Build feedback loops from incidents into better tooling, alerts, and runbooks
Help establish a culture of measurement-driven reliability

Qualification

Infrastructure engineeringLinuxGPU infrastructureObservability systemsDistributed systemsMetrics systemsLogging systemsDistributed tracingKubernetes observabilityTechnical ownershipIncident responseTeam leadership

Required

7+ years in infrastructure engineering, SRE, or platform roles
2+ years managing technical teams
Deep experience designing and operating observability systems at scale
Strong background in Linux, distributed systems, and production operations
Experience in GPU, HPC, or AI infrastructure environments
Hands-on experience with bare-metal systems and hardware-level telemetry (power, thermal, network, GPU)
Comfort operating in environments with hardware dependencies, physical failure modes, and tight SLAs
Strong Technical Background In Metrics systems (Prometheus, VictoriaMetrics, Mimir, etc.)
Strong Technical Background In Logging systems (ELK / OpenSearch, Loki, ClickHouse, Kafka-based pipelines)
Strong Technical Background In Distributed tracing (OpenTelemetry, Jaeger, Tempo)
Strong Technical Background In Kubernetes observability (nodes, clusters, workloads, control plane)
Strong Technical Background In Alerting strategy, SLOs, SLIs, and error budgets
Strong Technical Background In High-cardinality, high-volume telemetry tradeoffs

Preferred

Experience designing observability for monitoring hardware failure modes (GPU ECC, PCIe, NIC errors, power or thermal limits)
Experience operating observability platforms across multiple data centers and failure domains
Familiarity with capacity-aware or constraint-driven alerting (power, thermal, rack-level limits)
Experience balancing telemetry cost, retention, and fidelity at large scale
Prior experience evolving alerting from reactive to SLO-driven
Experience building or scaling observability teams or platforms in high-growth environments

Company

Voltage Park

twittertwitter
company-logo
Voltage Park provides infrastructure for machine learning.

H1B Sponsorship

Voltage Park has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (5)

Funding

Current Stage
Growth Stage
Total Funding
$500M
2023-10-30Undisclosed· $500M

Leadership Team

leader-logo
Eric Park
Chief Executive Officer
linkedin
leader-logo
Mike Xia
Chief Product Officer
linkedin
Company data provided by crunchbase