Walmart Canada · 1 month ago
Principal, Software Engineer - Cloud Storage
Walmart Inc. is seeking a highly skilled Principal Engineer with extensive experience in distributed storage systems. This role involves hands-on architecture, operations, and performance tuning of large-scale storage clusters, while leading innovation in private cloud storage.
DeliveryRetailShopping
Responsibilities
Extensive experience in the design, architecture, and management of scale-out distributed storage systems in large production environments
Demonstrated expertise in system performance tuning, data durability optimization (replication and/or erasure coding), and lifecycle management for petabyte-scale data deployments
Proven ability to drive the evaluation, selection, and deployment of best-of-breed software-defined storage (SDS) solutions that meet demanding SLAs for latency, throughput, and availability
Architect, deploy, and manage large-scale clusters across multiple production sites
Ensure storage availability, data durability, and cluster resiliency through advanced CRUSH map configurations, erasure coding, and replication strategies
Define upgrade strategy, cluster augmentation, node rebalancing, and hardware refreshes with minimal downtime
Own end-to-end lifecycle management of storage clusters, including OS/Kernel tuning, firmware upgrades, and hardware integration
Deep (hands-onº architectural experience with the design, deployment, and management of large-scale OpenStack platforms in production environments
Expert-level knowledge of core OpenStack storage services, specifically Cinder (Block Storage), Swift (Object Storage), and/or the integration of Ceph or similar distributed storage solutions
Experience must include data center networking design, high-availability design and multi-region/multi-site OpenStack deployments
Identify, diagnose, and resolve performance bottlenecks across Ceph/Scale-Out storage solution, Linux kernel, networking, and hardware layers
Utilize tools such as perf, blktrace, iostat, tcpdump, bpftrace, atop for advanced debugging
Perform deep analysis of OSD, MON, MDS, RGW performance and optimize cluster parameters
Debug network congestion, packet loss, latency, and RDMA/Ethernet issues impacting storage
Drive root cause analysis (RCA) for critical production issues and provide long-term remediation
Build and standardize automation for cluster deployment, expansion, and monitoring using Ansible, Terraform, and custom Python/Shell scripts
Develop observability views for real-time monitoring of IOPS, throughput, latency, and cluster health
Automate alerting, log analysis, and anomaly detection for proactive incident response
Design storage solutions to scale to hundreds of nodes and multiple petabytes while ensuring high availability and fault tolerance
Collaborate with compute and networking teams to integrate Storage clusters with Kubernetes, OpenStack, and VM workloads
Research and implement new features like CephFS, RGW S3/Swift gateways, Bluestore optimizations, RocksDB tuning
Evaluate next-gen hardware (NVMe SSDs, RDMA NICs, high-density HDDs) and their impact on storage performance
Evaluate next-gen server SKUs, perform benchmarking, and compare options to select the most appropriate storage hardware
Implement encryption (at-rest and in-transit), access controls, and audit mechanisms for secure data management
Ensure compliance with enterprise and regulatory standards (e.g., PCI-DSS, SOC, HIPAA)
Act as technical SME for Storage within the organization, mentoring junior engineers
Collaborate with cross-functional teams (Compute, Networking, Cloud, Security) to ensure seamless infrastructure integration
Partner with hardware and software stakeholders and the Ceph community to drive adoption of best practices and contribute to open-source improvements
Qualification
Required
15–18 years of experience in scale-out distributed storage systems, infrastructure engineering, and Linux systems
10+ years hands-on experience with Ceph, including architecture, operations, and large-scale production support
Proven experience managing clusters at petabyte scale with high performance and resiliency requirements
Strong expertise in Linux Systems: Kernel tuning, cgroups, systemd, process/thread debugging
Strong expertise in Networking: TCP/IP, VLANs, BGP/OSPF, bonding, load balancing, RDMA, Jumbo Frames
Strong expertise in Storage Internals: LVM, OSD design, Bluestore, RocksDB tuning, journaling, caching layers
Strong expertise in Performance Tools: perf, iostat, atop, strace, tcpdump, Wireshark, eBPF
Strong expertise in Debugging: Core dump analysis, kernel crash dump (kdump), system call tracing
Proficiency in Python and Shell scripting for automation and tooling
Hands-on experience with configuration management (Ansible, Salt, Puppet) and IaC tools like Terraform
Knowledge of containerization (Docker, Kubernetes, LXC) and their storage backends (CSI, RBD)
Experience with monitoring and logging stacks (Prometheus, Grafana, ELK, OpenObserve)
Familiarity with cloud platforms (Azure, GCP, OpenStack, AWS) and hybrid cloud storage
Option 1: Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 5 years' experience in software engineering or related area
Option 2: 7 years' experience in software engineering or related area
Preferred
Master's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 3 years' experience in software engineering or related area
Background in creating inclusive digital experiences, demonstrating knowledge in implementing Web Content Accessibility Guidelines (WCAG) 2.2 AA standards, assistive technologies, and integrating digital accessibility seamlessly
Knowledge of accessibility best practices
Company
Walmart Canada
Walmart Canada is a subsidiary of Walmart that operates a chain of more than 400 stores nationwide. It is a sub-organization of Walmart.
Funding
Current Stage
Late StageRecent News
Canada NewsWire
2025-12-18
Canada NewsWire
2025-12-03
Company data provided by crunchbase