SIGN IN
AI Infrastructure & Operations Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Cisco · 3 hours ago

AI Infrastructure & Operations Engineer

Cisco Systems is a leader in innovative technology solutions, and they are seeking an AI Infrastructure & Operations Engineer to join their AI Platform Team. The role involves owning the reliability and scalability of multi-cluster GPU infrastructure that supports enterprise AI model development, while collaborating with ML engineers and ensuring operational excellence.
Communications InfrastructureEnterprise SoftwareHardwareSoftware
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Build and operate multi-cluster GPU infrastructure: Stand up, configure, and operate GPU clusters across AWS, GCP, and Cisco IT, scaling from tens to hundreds (eventually thousands) of GPUs with high throughput and cost-efficiency
Own platform reliability and operations: Establish SLOs, implement monitoring/alerting, build operational runbooks, and drive incident response to ensure enterprise-grade uptime
Optimize costs and utilization: Implement cost optimization strategies (Spot instances, fractional GPU configs, autoscaling) and scheduling policies to maximize cluster utilization and ROI
Build infrastructure automation: Develop automation for cluster provisioning, deployment pipelines, and lifecycle management using Infrastructure-as-Code (Terraform) and CI/CD best practices
Enable distributed AI workloads: Configure and optimize networking for multi-node training (RDMA, EFA, NCCL), implement storage abstractions for large datasets, and ensure high-bandwidth GPU communication
Ensure security and compliance: Implement multi-tenant GPU isolation, namespace-level security policies, and access control mechanisms for enterprise workloads
Support model training workflows: Partner with ML engineers and researchers to ensure infrastructure supports their needs—from custom runtimes to storage performance and network bandwidth

Qualification

GPU infrastructure managementKubernetes orchestrationInfrastructure-as-Code (Terraform)Cloud provider experienceCI/CD pipelinesObservability toolsScripting skillsNetworking fundamentalsSoft skills

Required

BS/MS in Computer Science, Engineering, or related technical field with 5+ years of experience in infrastructure engineering, DevOps, SRE, or platform engineering, or equivalent practical experience
Strong experience with Kubernetes orchestration, container technologies (Docker), and cloud-native infrastructure patterns
Hands-on experience managing production infrastructure on at least one major cloud provider (AWS, GCP, or Azure)
Proficiency with Infrastructure-as-Code tools such as Terraform, OpenTofu, CloudFormation
Experience with observability tools: Prometheus, Grafana, or similar for metrics, logging, and alerting
Strong scripting and automation skills in Python, Bash, or Go
Understanding of networking fundamentals: VPCs, load balancers, DNS, firewalls, and cross-cloud connectivity
Experience with CI/CD pipelines and automation (GitHub Actions, GitLab CI, Jenkins, or ArgoCD)
Ability to troubleshoot complex distributed systems issues

Preferred

Experience with GPU infrastructure and AI/ML workloads: Ray clusters, Kubeflow, MLflow, or similar platforms
Hands-on experience with GPU orchestration: configuring NVIDIA GPUs (A100, H100), managing GPU drivers and CUDA runtimes
Knowledge of distributed training networking: RDMA, InfiniBand, EFA, NCCL
Experience with multi-cloud infrastructure: cross-cloud networking, unified storage abstractions, disaster recovery
Familiarity with cost optimization strategies: Spot instances, Reserved Instances, Savings Plans, and FinOps practices
Experience building SRE practices: SLOs/SLIs, on-call rotations, incident management, and operational runbooks
Track record of scaling infrastructure from prototype to production

Benefits

Medical, dental and vision insurance
A 401(k) plan with a Cisco matching contribution
Paid parental leave
Short and long-term disability coverage
Basic life insurance
10 paid holidays per full calendar year
1 floating holiday for non-exempt employees
1 paid day off for employee’s birthday
Paid year-end holiday shutdown
4 paid days off for personal wellness determined by Cisco
16 days of paid vacation time per full calendar year
80 hours of sick time off provided on hire date and each January 1st thereafter
Up to 80 hours of unused sick time carried forward from one calendar year to the next
Additional paid time away may be requested to deal with critical or emergency issues for family members
Optional 10 paid days per full calendar year to volunteer
Annual bonuses subject to Cisco’s policies

Company

Cisco develops, manufactures, and sells networking hardware, telecommunications equipment, and other technology services and products. It is a sub-organization of Cisco Press.

H1B Sponsorship

Cisco has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1238)
2024 (1231)
2023 (1273)
2022 (2127)
2021 (1991)
2020 (1173)

Funding

Current Stage
Public Company
Total Funding
unknown
1990-02-13IPO

Leadership Team

leader-logo
Chuck Robbins
Chair and CEO
linkedin
leader-logo
Carl Solder
Chief Technology Officer - Cisco System Australia/New Zealand
linkedin
Company data provided by crunchbase