Site Reliability Engineer - Monitoring Specialist jobs in United States
cer-icon
Apply on Employer Site
company-logo

xAI · 2 months ago

Site Reliability Engineer - Monitoring Specialist

xAI is focused on creating AI systems to enhance human understanding of the universe. The Site Reliability Engineer - Monitoring Specialist will develop and manage monitoring solutions, emphasizing Grafana for dashboards that provide insights into datacenter health, while collaborating with teams to optimize operational reliability.

Artificial Intelligence (AI)Foundational AIGenerative AIInformation TechnologyMachine Learning
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Design, build, and maintain Grafana dashboards tailored for datacenter technician organizations, providing real-time views into system health, performance metrics, and monitoring alerts
Develop automation scripts and tools using languages such as Java, Golang, Python, C/C++/C#, Bash, or Linux shell scripting to integrate monitoring systems and process data in JSON formats
Collaborate with Datacenter Operations Technicians to identify monitoring needs, troubleshoot issues, and ensure dashboards support efficient incident response and preventive maintenance
Evaluate and optimize existing dashboards for scalability, drawing from past experiences in creating monitoring solutions that have driven business growth
Manage dashboard lifecycle, including version control, updates, and performance tuning to handle large-scale datacenter environments
Participate in on-call rotations, incident analysis, and root cause investigations using monitoring data to improve system reliability
Document monitoring strategies, dashboard designs, and best practices to foster knowledge sharing within the team

Qualification

GrafanaJavaPythonLinuxBash scriptingJSONGolangC/C++/C#Monitoring toolsProblem-solvingCollaborationCommunication

Required

Bachelor's degree in Computer Science, Software Engineering, or a related field (or equivalent experience)
5+ years of experience in site reliability engineering or monitoring roles, preferably in datacenter or cloud environments
Proficiency in at least two of the following programming languages: Java, Golang, Python, C/C++/C#, with strong skills in Linux and Bash scripting
Hands-on experience working with JSON for data parsing, integration, and API interactions
Expert-level knowledge of Grafana, including creating complex dashboards, queries, and integrations with data sources like Prometheus or InfluxDB
Proven track record of developing dashboards that provide health and monitoring views for operational teams, with examples of how they scaled business operations
Experience managing monitoring tools and dashboards, including optimization, alerting, and integration into CI/CD pipelines
Strong problem-solving skills with a focus on data-driven decision-making and collaboration in fast-paced environments

Preferred

Experience in AI/ML infrastructure or high-performance computing monitoring
Familiarity with other monitoring tools (e.g., Grafana) and observability practices
Prior work in a startup or tech company like xAI, with contributions to scalable monitoring systems

Company

xAI

twittertwittertwitter
company-logo
XAI is an artificial intelligence startup that develops AI solutions and tools to enhance reasoning and search capabilities.

H1B Sponsorship

xAI has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1)

Funding

Current Stage
Late Stage
Total Funding
$42.73B
Key Investors
Neptune Digital AssetsSpaceXMorgan Stanley
2026-01-06Series E· $20B
2025-12-11Secondary Market· $0.3M
2025-07-13Corporate Round· $5.32B

Leadership Team

leader-logo
Toby Pohlen
Founding Member
linkedin
Company data provided by crunchbase