Walmart Canada · 2 months ago
Software Engineer II - Site Reliability Operations Engineer
Walmart Canada is seeking a Site Reliability Operations Engineer to join their Global Technology Platforms team. This role involves maintaining mission-critical infrastructure and ensuring high levels of availability and reliability for Walmart’s technology stack through monitoring, incident response, and collaboration with cross-functional teams.
DeliveryRetailShopping
Responsibilities
Acquire in-depth technical knowledge of omnichannel cloud platforms, web traffic flows, micro-services, and service dependencies for major incident resolution
Provide support for Unix and Linux systems from Kernel to Shell and beyond, taking into consideration system libraries, file systems, and client-server protocols
Leverage knowledge of network technologies such as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, CDN, OSI layers, Firewalls, Gateway, Proxy, and Load balancers
Provide L1 and L2 production support for multiple cloud technologies such as Open stack, Cloud Native platform, Microsoft Azure, and Google Cloud Platform for triaging critical issues using various internal and vendor-related tools
Detect and analyze monitoring graphs and alerts to identify systems causing production impacts with various tools like Grafana, Prometheus, MMS, Kibana, Graphite, Service Now, JIRA, Dynatrace, New Relic, Omniture, Splunk, and CDN logs [Reduce MTTD – Mean Time to Detect]
Triage site-impacting production issues by quantifying impact, severity and urgency, analyzing systems for quick remediation, engaging the right teams for recovery [Reduce MTTE – Mean Time to Engage], and focusing on immediate restoration [ Reduce MTTR – Mean Time to Restore] of large-scale enterprise systems
Develop enterprise monitoring and utilize tooling software solutions such as Grafana, Kibana, Splunk, Graphite, New Relic, to improve visibility, pro-actively detect issues and restore system availability
Designing and implementing JavaScript for the integration of alerting tool with service API endpoints with various tools like ServiceNow, Spotlight and xMatters
Design and develop solutions for widespread internal communications for cloud applications support or workflows for infrastructure availability issues with various internal applications with multiple programming languages like Java, JavaScript (React, Node JS), Python and Shell programming technologies like Prometheus, Database Query languages
Demonstrate knowledge of scripting and software development for automation and self-healing of multi-cloud environments. Help enhance existing solutions by developing automation with Docker, Kubernetes and working with DevOps and Engineering partners
Qualification
Required
2+ years in an infrastructure, systems, engineering or development environment delivering operational excellence to highly complex distributed systems
Bachelor's Degree in Computer Science or a related field, or relevant work experience
Strong and demonstrable incident management skills with relevant experience in an enterprise organization
Experience and exposure working in a 24/7 operations support environment
Methodical and systematic problem-solving approach, combined with a solid awareness of ownership, initiative and drive
Experience investigating, analyzing and troubleshooting large scale enterprise systems
Networking knowledge and understanding of network concepts, such as different protocols (TCP/IP, UDP, ICMP, etc.), MAC addresses, IP packets, DNS, OSI layers, and load balancing
Experience administering Unix/Linux in a production environment
Experience working with and developing enterprise monitoring/tooling/logging solutions like Grafana, Kibana, Splunk, Openobserve, Graphite, Nagios, New Relic, DynaTrace and Prometheus
Working knowledge of one or more cloud technologies such as AZURE, GCP, OpenStack
Experience with distributed version control like Git or similar
Designing and implementing JavaScript for the integration of alerting tool with service API endpoints with various tools like ServiceNow, Spotlight, Splunk, and xMatters
Programming experience in one or more of the following languages: Go, Java, Python, Shell, etc
Experience in data science/machine learning would be advantageous
Preferred
Background in creating inclusive digital experiences
Knowledge in implementing Web Content Accessibility Guidelines (WCAG) 2.2 AA standards
Familiarity with assistive technologies
Knowledge of accessibility best practices
Benefits
Competitive pay
Performance-based bonus awards
Medical, vision and dental coverage
401(k)
Stock purchase
Company-paid life insurance
PTO (including sick leave)
Parental leave
Family care leave
Bereavement
Jury duty
Voting
Short-term and long-term disability
Company discounts
Military Leave Pay
Adoption and surrogacy expense reimbursement
Live Better U is a Walmart-paid education benefit program
Company
Walmart Canada
Walmart Canada is a subsidiary of Walmart that operates a chain of more than 400 stores nationwide. It is a sub-organization of Walmart.
Funding
Current Stage
Late StageRecent News
Canada NewsWire
2025-12-18
Canada NewsWire
2025-12-03
Company data provided by crunchbase