Voltage Park · 2 weeks ago
Manager of Infrastructure Engineering (Observability)
Voltage Park is your enterprise AI factory, offering scalable compute power and on-demand AI infrastructure. They are seeking a Manager of Infrastructure Engineering to lead the observability strategy and manage a team focused on building automation and tooling for AI/ML workloads.
AI InfrastructureCloud ComputingMachine Learning
Responsibilities
Own Voltage Park’s observability strategy across infrastructure and platform layers
Define standards for metrics, logs, traces, alerts, dashboards, and SLOs
Drive architecture decisions for telemetry pipelines, storage, and retention
Balance signal quality, system performance, and cost at scale
Build, manage, and mentor a team of infrastructure engineers focused on observability
Set clear technical direction, priorities, and expectations
Review designs, guide implementation, and raise the bar on operational rigor
Partner closely with Engineering and Operations teams
Design and operate high-throughput observability pipelines (metrics, logs, traces)
Ensure observability platforms are reliable, scalable, and resilient
Improve alert quality and reduce noise across production systems
Enable self-service observability for internal engineering teams
Participate in and lead infrastructure incident response
Use observability data to drive root-cause analysis and systemic improvements
Build feedback loops from incidents into better tooling, alerts, and runbooks
Help establish a culture of measurement-driven reliability
Qualification
Required
7+ years in infrastructure engineering, SRE, or platform roles
2+ years managing technical teams
Deep experience designing and operating observability systems at scale
Strong background in Linux, distributed systems, and production operations
Experience in GPU, HPC, or AI infrastructure environments
Hands-on experience with bare-metal systems and hardware-level telemetry (power, thermal, network, GPU)
Comfort operating in environments with hardware dependencies, physical failure modes, and tight SLAs
Strong Technical Background In Metrics systems (Prometheus, VictoriaMetrics, Mimir, etc.)
Strong Technical Background In Logging systems (ELK / OpenSearch, Loki, ClickHouse, Kafka-based pipelines)
Strong Technical Background In Distributed tracing (OpenTelemetry, Jaeger, Tempo)
Strong Technical Background In Kubernetes observability (nodes, clusters, workloads, control plane)
Strong Technical Background In Alerting strategy, SLOs, SLIs, and error budgets
Strong Technical Background In High-cardinality, high-volume telemetry tradeoffs
Preferred
Experience designing observability for monitoring hardware failure modes (GPU ECC, PCIe, NIC errors, power or thermal limits)
Experience operating observability platforms across multiple data centers and failure domains
Familiarity with capacity-aware or constraint-driven alerting (power, thermal, rack-level limits)
Experience balancing telemetry cost, retention, and fidelity at large scale
Prior experience evolving alerting from reactive to SLO-driven
Experience building or scaling observability teams or platforms in high-growth environments
Company
Voltage Park
Voltage Park provides infrastructure for machine learning.
H1B Sponsorship
Voltage Park has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (5)
Funding
Current Stage
Growth StageTotal Funding
$500M2023-10-30Undisclosed· $500M
Recent News
2025-10-21
2025-09-23
Company data provided by crunchbase