Apply on Employer Site

Voltage Park · 2 weeks ago

Manager of Infrastructure Engineering (Observability)

Seattle, WA

Full-time

Onsite

Senior Level

7+ years exp

Voltage Park is your enterprise AI factory, offering scalable compute power and on-demand AI infrastructure. They are seeking a Manager of Infrastructure Engineering to lead the observability strategy and manage a team focused on building automation and tooling for AI/ML workloads.

AI InfrastructureCloud ComputingMachine Learning

H1B Sponsor Likely

Responsibilities

Own Voltage Park’s observability strategy across infrastructure and platform layers

Define standards for metrics, logs, traces, alerts, dashboards, and SLOs

Drive architecture decisions for telemetry pipelines, storage, and retention

Balance signal quality, system performance, and cost at scale

Build, manage, and mentor a team of infrastructure engineers focused on observability

Set clear technical direction, priorities, and expectations

Review designs, guide implementation, and raise the bar on operational rigor

Partner closely with Engineering and Operations teams

Design and operate high-throughput observability pipelines (metrics, logs, traces)

Ensure observability platforms are reliable, scalable, and resilient

Improve alert quality and reduce noise across production systems

Enable self-service observability for internal engineering teams

Participate in and lead infrastructure incident response

Use observability data to drive root-cause analysis and systemic improvements

Build feedback loops from incidents into better tooling, alerts, and runbooks

Help establish a culture of measurement-driven reliability

Qualification

Infrastructure engineeringLinuxGPU infrastructureObservability systemsDistributed systemsMetrics systemsLogging systemsDistributed tracingKubernetes observabilityTechnical ownershipIncident responseTeam leadership

Required

7+ years in infrastructure engineering, SRE, or platform roles

2+ years managing technical teams

Deep experience designing and operating observability systems at scale

Strong background in Linux, distributed systems, and production operations

Experience in GPU, HPC, or AI infrastructure environments

Hands-on experience with bare-metal systems and hardware-level telemetry (power, thermal, network, GPU)

Comfort operating in environments with hardware dependencies, physical failure modes, and tight SLAs

Strong Technical Background In Metrics systems (Prometheus, VictoriaMetrics, Mimir, etc.)

Strong Technical Background In Logging systems (ELK / OpenSearch, Loki, ClickHouse, Kafka-based pipelines)

Strong Technical Background In Distributed tracing (OpenTelemetry, Jaeger, Tempo)

Strong Technical Background In Kubernetes observability (nodes, clusters, workloads, control plane)

Strong Technical Background In Alerting strategy, SLOs, SLIs, and error budgets

Strong Technical Background In High-cardinality, high-volume telemetry tradeoffs

Preferred

Experience designing observability for monitoring hardware failure modes (GPU ECC, PCIe, NIC errors, power or thermal limits)

Experience operating observability platforms across multiple data centers and failure domains

Familiarity with capacity-aware or constraint-driven alerting (power, thermal, rack-level limits)

Experience balancing telemetry cost, retention, and fidelity at large scale

Prior experience evolving alerting from reactive to SLO-driven

Experience building or scaling observability teams or platforms in high-growth environments

Company

Voltage Park

Voltage Park provides infrastructure for machine learning.

Founded in 2023

Berkeley, California, USA

51-200 employees

https://voltagepark.com/

H1B Sponsorship

Voltage Park has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (5)

Funding

Current Stage

Growth Stage

Total Funding

$500M

2023-10-30Undisclosed· $500M

Leadership Team

Eric Park

Chief Executive Officer

Mike Xia

Chief Product Officer

Recent News

Business Wire

JBK Group Partners with Voltage Park and Matrice.ai For Vision AI Factory Deployments Across Qatar

2025-11-08

Business Wire

Voltage Park Launches Its AI Factory: A Faster Path to AI Transformation

2025-10-21

SiliconANGLE

Nvidia will invest up to $100B in OpenAI to finance data center construction

2025-09-23

Company data provided by crunchbase