SRE Consultant jobs in United States
info-icon
This job has closed.
company-logo

Cardinal Integrated Technologies Inc · 2 months ago

SRE Consultant

Cardinal Integrated Technologies Inc is seeking an SRE Consultant to manage Nvidia's on-prem infrastructure and ensure the reliability and uptime of engineering cloud services. The role involves maintaining KPI pipelines, implementing monitoring and alerting systems, and providing day-to-day support for user-reported issues.

BankingInformation TechnologyInsurancePharmaceuticalWeb Development
check
H1B Sponsor Likelynote

Responsibilities

Manage Nvidia's on-prem infrastructure. Maintain uptime, reliability and readiness of on-prem engineering cloud spread across multiple data Centers
Guard service level agreements (SLAs) for critical engineering services. Implement monitoring, alerting, and incident response procedures to ensure adherence to defined performance targets. Perform root cause analysis and post-mortems of incidents for any threshold breaches
Set up and manage monitoring and logging tools such as Prometheus, Grafana, or the ELK Stack to oversee system health and performance. Maintain KPI pipelines using Jenkins, Python and ELK
Improve monitoring systems by adding custom alerts based on business needs
Help in capacity planning, optimization and better utilization efforts
Support user reported issues & issues. Monitor alerts and take necessary action
Actively participate in WAR room for critical issues
Create and maintain documentation for operational procedures, configurations, and troubleshooting guides

Qualification

On-prem infrastructure managementMonitoring tools PrometheusMonitoring tools GrafanaMonitoring tools ELKAutomation JenkinsAutomation PythonBaremetal management toolsKubernetesMySQLGoBashNvidia hardware familiarity

Required

Manage Nvidia's on-prem infrastructure. Maintain uptime, reliability and readiness of on-prem engineering cloud spread across multiple data centers
Maintain KPI pipelines using Jenkins, Python and ELK
Baremetal data centre machine management tools like IPMI, Redfish, KVM
Guard service level agreements (SLAs) for critical engineering services. Implement monitoring, alerting, and incident response procedures to ensure adherence to defined performance targets. Perform root cause analysis and post-mortems of incidents for any threshold breaches
Set up and manage monitoring and logging tools such as Prometheus, Grafana, or the ELK Stack to oversee system health and performance
Improve monitoring systems by adding custom alerts based on business needs
Help in capacity planning, optimization and better utilization efforts
Support user reported issues & issues. Monitor alerts and take necessary action
Actively participate in WAR room for critical issues
Create and maintain documentation for operational procedures, configurations, and troubleshooting guides
Automation using Jenkins, Python, Go, Bash
Infrastructure tools like Kubernetes, MySQL, Prometheus, Grafana and ELK

Preferred

Any familiarity with Nvidia hardware like GPU & Tegras is a plus

Company

Cardinal Integrated Technologies Inc

twittertwitter
company-logo
We are a company of IT professionals who passionately believe that good quality products & services are delivered by great resources.

H1B Sponsorship

Cardinal Integrated Technologies Inc has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (5)
2024 (1)
2023 (3)
2022 (5)
2021 (4)
2020 (4)

Funding

Current Stage
Growth Stage
Company data provided by crunchbase