Apply on Employer Site

Virtasant · 2 hours ago

Principal Site Reliability Engineer - Operations

Santa Clara, CA

Full-time

Hybrid

Senior Level, Lead/Staff

5+ years exp

Virtasant is a fast-growing global consultancy transforming technology services. They are seeking a Principal-Level Site Reliability Engineer (Operations) to provide operational support for a major global client, ensuring system reliability and improving operational workflows.

Information Technology & Services

H1B Sponsor Likely

Responsibilities

Monitor dashboards, alerts, and system health in real time

Respond to incidents quickly and decisively, driving issues to resolution

Perform root-cause analysis and contribute to post-incident reviews

Troubleshoot complex system and infrastructure issues across distributed environments

Maintain and improve runbooks, playbooks, and operational documentation

Support and enhance the observability tooling used for metrics, logs, and alerting

Work cross-functionally with engineering teams to escalate system-level issues when required

Run routine operational checks to ensure platform stability

Tune alerts, update dashboards, and ensure monitoring accuracy

Identify recurring operational issues and recommend improvements

Implement small automation and scripting solutions to improve operational workflows

Keep services running smoothly through proactive maintenance

Partner with Engineering, SRE, and Product teams to ensure transparent communication during incidents

Provide clear, concise updates and documentation for operational work

Participate in shift patterns or rotational incident coverage depending on client needs

Qualification

SRE experienceTroubleshooting distributed systemsLinux fundamentalsMonitoring toolsIncident management workflowsRoot-cause analysisCI/CD processesKubernetesScripting PythonScripting BashProactive mindsetCommunication skillsAttention to detail

Required

Must be based in the San Francisco Bay Area, with the ability to be on site 2 days per week in Santa Clara

5–10+ years in SRE, Production Operations, or Infrastructure Engineering roles

Strong hands-on experience troubleshooting distributed systems in production

Proficiency in Linux fundamentals, including process management, networking, storage, and diagnostics

Solid understanding of cloud-native architectures, containers, and modern infrastructure tooling

Experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK, Datadog, etc.)

Experience with incident management workflows

Experience with root-cause analysis / postmortems

Experience with CI/CD operational processes

Strong Linux debugging and performance troubleshooting skills

Familiarity with Kubernetes, containers, or cloud-native runtime environments

Ability to write or modify scripts (Python, Bash, or similar) for operational automation

Hands-on experience with logs, metrics, traces, and alert lifecycle management

Calm, structured decision-making under pressure

Excellent communication — clear, concise, and reliable

Strong attention to detail and consistency in documentation

A proactive, ownership-driven mindset for reliability and operations

Company

Virtasant

Virtasant is a global team of cloud experts building the next generation of cloud solutions.

Founded in 2020

Austin, Texas, USA

51-200 employees

https://virtasant.com

H1B Sponsorship

Virtasant has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2021 (1)

Funding

Current Stage

Growth Stage

Leadership Team

Michael Kearns

Founder and CEO

Recent News

The New Stack

Collaborative Coding and Generative AI: The Future of Code Pairing

2025-07-02

Company data provided by crunchbase