Cardinal Integrated Technologies Inc · 2 months ago
SRE Consultant
Cardinal Integrated Technologies Inc is seeking an SRE Consultant to manage Nvidia's on-prem infrastructure and ensure the reliability and uptime of engineering cloud services. The role involves maintaining KPI pipelines, implementing monitoring and alerting systems, and providing day-to-day support for user-reported issues.
BankingInformation TechnologyInsurancePharmaceuticalWeb Development
Responsibilities
Manage Nvidia's on-prem infrastructure. Maintain uptime, reliability and readiness of on-prem engineering cloud spread across multiple data Centers
Guard service level agreements (SLAs) for critical engineering services. Implement monitoring, alerting, and incident response procedures to ensure adherence to defined performance targets. Perform root cause analysis and post-mortems of incidents for any threshold breaches
Set up and manage monitoring and logging tools such as Prometheus, Grafana, or the ELK Stack to oversee system health and performance. Maintain KPI pipelines using Jenkins, Python and ELK
Improve monitoring systems by adding custom alerts based on business needs
Help in capacity planning, optimization and better utilization efforts
Support user reported issues & issues. Monitor alerts and take necessary action
Actively participate in WAR room for critical issues
Create and maintain documentation for operational procedures, configurations, and troubleshooting guides
Qualification
Required
Manage Nvidia's on-prem infrastructure. Maintain uptime, reliability and readiness of on-prem engineering cloud spread across multiple data centers
Maintain KPI pipelines using Jenkins, Python and ELK
Baremetal data centre machine management tools like IPMI, Redfish, KVM
Guard service level agreements (SLAs) for critical engineering services. Implement monitoring, alerting, and incident response procedures to ensure adherence to defined performance targets. Perform root cause analysis and post-mortems of incidents for any threshold breaches
Set up and manage monitoring and logging tools such as Prometheus, Grafana, or the ELK Stack to oversee system health and performance
Improve monitoring systems by adding custom alerts based on business needs
Help in capacity planning, optimization and better utilization efforts
Support user reported issues & issues. Monitor alerts and take necessary action
Actively participate in WAR room for critical issues
Create and maintain documentation for operational procedures, configurations, and troubleshooting guides
Automation using Jenkins, Python, Go, Bash
Infrastructure tools like Kubernetes, MySQL, Prometheus, Grafana and ELK
Preferred
Any familiarity with Nvidia hardware like GPU & Tegras is a plus
Company
Cardinal Integrated Technologies Inc
We are a company of IT professionals who passionately believe that good quality products & services are delivered by great resources.
H1B Sponsorship
Cardinal Integrated Technologies Inc has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (5)
2024 (1)
2023 (3)
2022 (5)
2021 (4)
2020 (4)
Funding
Current Stage
Growth StageCompany data provided by crunchbase