Walmart Canada · 2 months ago
Distinguished, Architect - AI/ML
Walmart Inc. is a leading retailer seeking a Distinguished Architect - AI/ML to join its Global Tech team. The role focuses on architecting advanced AI systems to enhance reliability engineering across Walmart's technology ecosystem, impacting millions of customers and associates globally.
DeliveryRetailShopping
Responsibilities
Architect and develop advanced agentic AI systems that can autonomously handle complex reliability engineering workflows, predictive failure analysis, and self-optimization across all Walmart technology systems
Design and implement multi-agent orchestration platforms that coordinate between different AI agents for automated incident response, capacity planning, and performance optimization across e-commerce, supply chain, and in-store systems
Build intelligent observability and monitoring systems using ML-driven anomaly detection, predictive analytics, and autonomous incident resolution capabilities that span all of Walmart's technology ecosystem
Develop self-healing infrastructure platforms that leverage AI to predict, prevent, and automatically resolve system issues before they impact customers, associates, or business operations across any Walmart system
Design, write and build advanced tools to improve reliability, latency, availability, and scalability of all Walmart Tech systems including: 1) Engineer reliability and availability starting with metrics and measurements across all domains, 2) Enable scaling by providing technical solutions, developing automation and/or optimizing processes for all engineering teams, 3) Build tools/automate to prevent re-occurrence of problems across all mission critical Walmart services, 4) Augment existing instrumentation to build a cohesive picture of system characteristics across the entire Walmart technology landscape with special attention to points of failure
Architect and implement fault-tolerant systems and services across Walmart's hybrid cloud infrastructure with focus on autonomous recovery and intelligent failure prediction for e-commerce, supply chain, financial services, and in-store technology
Collaborate with engineering teams and leadership across all Walmart technology organizations to establish technical strategies and solutions to improve mean time to detect (MTTD) and mean time to restore (MTTR) through intelligent automation and predictive capabilities
Work with service owners across all domains (e-commerce, supply chain, stores, fintech, etc.) to define SLOs and build SLIs to ensure all critical systems are meeting SLAs while maintaining optimal performance and user experience
Perform complex troubleshooting and analysis of large-scale distributed systems across Walmart's entire technology stack, using expertise in coding, algorithms, and distributed system design
Partner closely with all engineering organizations including E-commerce, Supply Chain, Store Technology, Fintech, and Data Platform teams to deliver autonomous reliability solutions through advanced machine learning, natural language processing, and computer vision technologies
Drive the development of MLOps and AIOps platforms that enable continuous learning, model deployment, monitoring, and autonomous optimization of reliability engineering systems across all Walmart domains
Innovate in agentic AI technologies for SRE including large language models (LLMs) for automated incident response, reinforcement learning agents for capacity optimization, multi-modal AI for infrastructure monitoring, and federated learning for cross-domain reliability insights
Implement advanced CI/CD pipelines for reliability systems including automated deployment, validation, and rollback mechanisms for SRE tools and monitoring systems with built-in observability and performance monitoring
Establish platform engineering excellence by building reusable SRE infrastructure, intelligent monitoring platforms, and developer productivity tools that serve all Walmart engineering teams
Provide technical mentorship and guidance to engineering teams across all Walmart organizations on advanced SRE concepts, AI/ML for reliability, platform engineering best practices, and autonomous system design through code reviews, technical discussions, and knowledge sharing
Qualification
Required
Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 6 years' experience in software engineering, architecture, or related area
8 years' experience in software engineering, architecture, or related area
12+ years of expert-level experience with machine learning algorithms, deep learning frameworks (TensorFlow, PyTorch), and production ML deployment at enterprise scale
Deep hands-on experience building agentic AI systems, multi-agent frameworks, LLM-based agents, and autonomous decision-making platforms
Proven ability to architect and implement AI-driven solutions for complex technical challenges
Comprehensive SRE expertise including Service Management (Incident, Problem & Change), Performance Engineering, and capacity planning for mission-critical systems
Deep understanding of reliability KPIs (MTTD, MTTR, availability) with proven track record of improving system reliability at scale
Experience with chaos engineering, fault injection, and building self-healing systems across diverse technology stacks
Expert-level cloud engineering experience (Azure, GCP, AWS) with deep knowledge of containerization (Kubernetes, Docker) and serverless architectures
Strong platform engineering skills including Infrastructure as Code (Terraform, CloudFormation), service mesh architectures, and building developer productivity tools
Experience designing and implementing self-service ML deployment platforms and API gateways for enterprise environments
Deep expertise with distributed tracing (OpenTelemetry, Jaeger), metrics collection (Prometheus, Grafana), and log aggregation (ELK stack, Splunk)
Hands-on experience building AI-driven anomaly detection, predictive monitoring systems, and ML-specific dashboards
Proven ability to implement comprehensive observability solutions for complex AI/ML pipelines and distributed systems
Preferred
Master's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 4 years' experience in software engineering, architecture or related area
Background in creating inclusive digital experiences, demonstrating knowledge in implementing Web Content Accessibility Guidelines (WCAG) 2.2 AA standards, assistive technologies, and integrating digital accessibility seamlessly
Knowledge of accessibility best practices and Walmart's accessibility standards and guidelines for supporting an inclusive culture
Benefits
401(k) match
Stock purchase plan
Paid maternity and parental leave
PTO
Multiple health plans
Health benefits include medical, vision and dental coverage.
Financial benefits include 401(k), stock purchase and company-paid life insurance.
Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting.
Short-term and long-term disability
Company discounts
Military Leave Pay
Adoption and surrogacy expense reimbursement
Live Better U is a Walmart-paid education benefit program for full-time and part-time associates in Walmart and Sam's Club facilities.
Company
Walmart Canada
Walmart Canada is a subsidiary of Walmart that operates a chain of more than 400 stores nationwide. It is a sub-organization of Walmart.
Funding
Current Stage
Late StageRecent News
Canada NewsWire
2025-12-18
Canada NewsWire
2025-12-03
Company data provided by crunchbase