AI DevOps and Cloud Infrastructure Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Crowe · 18 hours ago

AI DevOps and Cloud Infrastructure Engineer

Crowe is a leading public accounting, consulting, and technology firm in the United States, focused on leveraging AI to transform clients' business models. The AI DevOps and Cloud Infrastructure Engineer will lead teams in designing and managing AI/ML infrastructure and cloud platforms, ensuring efficient operations and compliance with security standards.

AccountingAdviceConsultingFinanceFinancial ServicesInformation TechnologyProfessional ServicesTax Consulting
badNo H1Bnote

Responsibilities

Leading engineering teams responsible for AI/ML infrastructure, cloud operations, and MLOps automation
Defining cloud, Kubernetes, and infrastructure strategy to support scalable model training, inference, and generative AI platforms
Guiding the design and operation of distributed compute environments, GPU clusters, and vector database infrastructure
Overseeing CI/CD pipelines that automate model training, testing, deployment, monitoring, and lifecycle management
Managing incident response, failure analysis, and reliability engineering across AI platforms
Directing performance testing, capacity planning, and cost optimization for AI infrastructure
Ensuring compliance with cloud security, IAM practices, governance requirements, and responsible AI frameworks
Implementing multi-cloud resilience patterns, high availability, and automated failover for critical AI workloads
Supporting platform modernization initiatives, including adoption of optimized LLM runtimes and new orchestration technologies
Evaluating third-party infrastructure tools, GPU scheduling solutions, and platform enhancements
Communicating system status, dependencies, risks, and technical decisions to senior leadership
Managing 4–5 direct reports, including coaching, performance management, and career development
Owning project delivery, including budget, timelines, and quality of outcomes
Coordinating with sales and stakeholders on project sizing, feasibility, and strategic opportunities
Driving continuous improvement initiatives to advance DevOps maturity and AI infrastructure operational readiness

Qualification

DevOpsCloud EngineeringMLOpsKubernetesInfrastructure-as-CodePythonBashMonitoringReliability EngineeringObservabilityCoachingLeadershipCommunicationCollaborationStrategic Decision-Making

Required

7+ years of professional experience in DevOps, cloud engineering, MLOps, or platform engineering
2+ years of experience in engineering leadership or senior technical leadership roles
Expert proficiency with distributed cloud systems, Kubernetes, and infrastructure-as-code
Advanced ability to troubleshoot infrastructure, networking, container, and deployment issues
Proficiency in Python, Bash, or similar automation and scripting languages
Strong understanding of monitoring, observability, and reliability engineering patterns
Hands-on experience supporting infrastructure for ML or generative AI workloads
Strong leadership, communication, and cross-functional collaboration skills

Preferred

Bachelor's degree in computer science, engineering, cloud computing, or a related field
Master's degree in technical discipline
Cloud and AI certifications, including Azure (AZ-900, AZ-104, AZ-305, AZ-700, AZ-800, AI-102) or equivalent AWS/GCP certifications
Extensive experience with Kubernetes platforms (EKS, AKS, GKE) and cloud ML services (Azure ML, SageMaker)
Experience with GPU workload orchestration, optimization, and multi-tenant inference environments
Expertise in observability and distributed tracing (Prometheus, Grafana, CloudWatch, OpenTelemetry)
Strong experience with Terraform and infrastructure governance at scale
Familiarity with service mesh architectures (Istio, Linkerd) and advanced deployment patterns (blue/green, canary)
Advanced experience supporting generative AI platforms, including LLM inference runtimes (vLLM, TGI), RAG infrastructure, and vector databases (Pinecone, Weaviate, FAISS)
Experience operating fine-tuned LLMs (LoRA, QLoRA), managing GenAI CI/CD pipelines, and implementing hallucination, drift, and reliability monitoring
Demonstrated ability to make strategic technical decisions within defined delivery and budget constraints

Benefits

Unlimited PTO
Flexible remote work policy

Company

Crowe LLP is a public accounting, consulting, and technology firm.

Funding

Current Stage
Late Stage
Total Funding
unknown
2023-08-29Acquired

Leadership Team

leader-logo
James L. Powers
CEO
linkedin
leader-logo
Joy Mikolajczak Duce
Managing Principal/Partner - Human Capital Consulting
linkedin
Company data provided by crunchbase