Platform Site Reliability Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Nexthink · 1 day ago

Platform Site Reliability Engineer

Nexthink is the leader in digital employee experience management software, providing IT leaders with insights to enhance employee experiences. The Platform Site Reliability Engineer will design and maintain the infrastructure of the SaaS platform, ensuring reliability, security, and scalability while managing cloud-native systems and enhancing incident response practices.

AnalyticsInformation TechnologySoftware

Responsibilities

Design, build, and maintain the infrastructure powering our multi-tenant SaaS platform with reliability, security, and scalability in mind
Implement and manage cloud-native systems (AWS) using best-in-class tools and automation
Operate and enhance Kubernetes clusters, deployment pipelines, and service meshes to support continuous delivery
Establish and enforce SLOs, SLAs, and error budgets, and proactively address availability and performance issues
Develop infrastructure as code (Terraform or similar) for repeatable and auditable provisioning
Experience in programming solutions for Platform Tools such as for automation, monitoring, provisioning, using programming technologies
Solid understanding of the network stack (TCP/IP, VPN, HTTP, SSL, routing, etc.), cloud topologies (VPC, Virtual Subnets, NACLS, NSG, ILB, ELB, etc.) and storage (S3, EBS, Azure Files etc)
Monitor system health, application performance, and user-facing SLAs using tools like Datadog, Prometheus, Grafana
Be a main actor and improve incident response practices and help reduce mean time to detect (MTTD) and recover (MTTR). Experience in coordinating teams and persons to maintain a SLA
Ability to troubleshoot, narrow down and fix incidents with minimal intervention of other functions
Participate in a shared on-call rotation, responding to incidents, troubleshooting outages, and driving timely resolution and communication
Work closely with software engineers to embed reliability and observability into every service
Develop automated runbooks, health checks, and alerting to support reliable operations with minimal manual intervention
Support automated testing, canary deployments, and rollback strategies to ensure safe, fast, and reliable releases
Contribute to security best practices, compliance automation, and cost optimization

Qualification

AWSKubernetesInfrastructure as CodeCI/CD pipelinesPythonLinux systemsObservability stacksMicroservices architectureTroubleshootingIncident responseService meshCompliance standardsChaos engineering

Required

Minimum BS in Computer Science/Engineering
5+ years in an SRE/platform engineering role supporting SaaS platforms
Strong hands-on experience with public cloud services (AWS, GCP, Azure)
Proficiency with Kubernetes, container-based deployment and related ecosystems (Helm...), and containerized microservices
Strong programming or scripting skills (Python, Go, Bash...)
Experience with CI/CD pipelines (e.g., GitHub Actions, GitLab CI, ArgoCD)
Experience with observability stacks (Prometheus, ELK/EFK, Datadog, etc.)
Comfort with being part of a rotating on-call schedule, including handling critical incidents and conducting post-incident reviews
Strong system-level troubleshooting skills and a proactive mindset toward incident prevention
Deep understanding of Linux systems, networking, and common troubleshooting practices
Experience supporting multi-tenant microservices architectures
Familiarity with service mesh, e.g., Istio
Knowledge of zero-downtime deployment strategies, blue/green and canary releases
Exposure to compliance standards such as SOC 2, ISO 27001, or HIPAA. FedRAMP experience is a big plus
Experience with chaos engineering or resilience testing practices

Benefits

100% covered company benefits
Flexible Hours and unlimited vacation (employees have unlimited paid time off on top of the 15 days of holidays we offer), 11 company-paid holidays, and 3 extra days for volunteering.
Hybrid work model that balances office and remote work, with structured onboarding to foster connections and team integration.
Free access to professional training platforms to explore your interests and enhance your skills.
Up to 16 weeks of paid leave for birthing parents/primary caregivers, 6 weeks for secondary caregivers.
Plan for the future with a 401(k) plan featuring up to 4% company matching contributions, vesting immediately, to grow your retirement savings.
Bonuses for referring successful hires after three months of continuous employment.

Company

Nexthink

company-logo
Nexthink allows enterprises to create highly productive digital workplaces for their employees by delivering optimal end-user experience.

Funding

Current Stage
Late Stage
Total Funding
$345.88M
Key Investors
PermiraIndex VenturesHighland Europe
2025-10-27Acquired
2021-02-08Series D· $180.16M
2018-12-12Series C· $86.4M

Leadership Team

leader-logo
Pedro Bados
CEO & Co-Founder
linkedin
leader-logo
Patrick Hertzog
Co-founder & User Experience Officer
linkedin
Company data provided by crunchbase