AI Engineer - Infrastructure jobs in United States
cer-icon
Apply on Employer Site
company-logo

Traversal · 2 hours ago

AI Engineer - Infrastructure

Traversal is an AI Site Reliability Engineer (SRE) for the enterprise, trusted by large companies to manage complex production incidents. As an AI Infrastructure Engineer, you will design and operate core systems for AI products, focusing on high-concurrency inference and Kafka data pipelines.

Artificial Intelligence (AI)SoftwareSoftware Engineering

Responsibilities

System Design & Architecture: Design scalable, reliable infrastructure for AI inference, data pipelines, and agentic workflows
Queue & Job Scheduling (K8s-native): Migrate from Python multiprocessing + Postgres-as-queue to Kubernetes-native queuing and orchestration (KEDA/HPA, Jobs/CronJobs, Kueue/Argo)
Managed Kafka Operations: Tune partitioning and throughput, design DLQ + replay runbooks, implement idempotent sinks to avoid duplicates
Autoscaling: Scale on real signals (queue lag, in-flight requests, latency); add burst capacity and safe drains
Per-Tool Reliability: Productionize MCP toolchains with circuit breaking, timeouts, sandboxing, and audit
Progressive Delivery: Implement canary and blue/green rollouts for stateful services, pre-warm caches/weights, and enable graceful termination
Observability: Build RED/USE dashboards and OpenTelemetry traces across gateway → agent → tool → Kafka → sinks
Infrastructure as Code: Evolve Terraform/Helm/Kustomize for multi-environment deployments, secrets, policy-as-code (OPA/Rego), and workload identity

Qualification

PythonRustKafkaKubernetesAWSTerraformHelm/KustomizeIncident responseDebugging skillsSecurity-minded

Required

3+ years of experience at technically rigorous companies or teams
Proven experience operating high-concurrency backends with managed Kafka fan-in/out and at-least-once processing
Experience designing idempotent systems (outbox, dedupe keys, safe replay)
Production experience building and maintaining systems in Python and Rust (Rust 2024)
Incident response, chaos testing, capacity planning
Familiarity with AWS, EKS, Terraform, Helm/Kustomize
Strong debugging skills across runtime, Kafka, network, and auth layers
Security-minded, with experience implementing least privilege, default-deny egress, auditability, and policy-as-code

Preferred

GPU workload operations (MIG, topology-aware placement), inference servers, token streaming gateways
Data governance (PII discovery/redaction), lineage, tokenization
Cross-region active/active for Kafka and stateless services
Service mesh (Envoy/Istio), Cilium/eBPF, ClickHouse for analytics

Benefits

Health insurance
Startup equity
Flexible time off
Plenty of in-office snacks

Company

Traversal

twittertwittertwitter
company-logo
Traversal is building the AI SRE for the enterprise.

Funding

Current Stage
Early Stage
Total Funding
$48M
Key Investors
Sequoia CapitalKleiner Perkins
2025-06-20Seed
2025-06-18Series A· $48M
Company data provided by crunchbase