SIGN IN
Senior Devops Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

QBench · 13 hours ago

Senior Devops Engineer

QBench is a fast-growing, fully remote SaaS company powering modern laboratory operations through their cloud-based LIMS platform. The Senior DevOps Engineer will ensure the reliability, availability, and performance of production systems while applying software engineering and systems thinking to infrastructure and operations, focusing on production stability and incident reduction.
Enterprise SoftwareSoftwareManagement Information Systems
check
Culture & Values
badNo H1Bnote

Responsibilities

Own the reliability, availability, and performance of production systems
Lead incident response for production issues, including coordination, mitigation, and communication
Conduct blameless postmortems and ensure follow-up actions are completed
Proactively identify reliability risks and eliminate single points of failure
Define reliability standards and operational practices that enable teams to ship safely
Design, provision, and manage AWS infrastructure using Infrastructure as Code (IaC) tools such as AWS CDK and AWS SAM (and/or Terraform)
Maintain clean, version-controlled, and auditable infrastructure across all environments
Manage and improve core AWS services including: ECS Fargate, Elastic Beanstalk, AWS Lambda, RDS Aurora MySQL, S3, VPC, IAM, and networking components
Lead infrastructure modernization initiatives, including re-architecting services for reliability and scalability
Own and continuously improve CI/CD pipelines to ensure fast, reliable, and secure builds, tests, and deployments
Standardize deployment workflows across services and environments (dev/staging/prod), including safe rollout strategies and automated rollback where appropriate
Implement quality and security gates in the delivery pipeline (unit/integration tests, linting, dependency scanning, IaC checks)
Improve deployment observability by integrating release markers, build metadata, and change tracking into monitoring and alerting
Partner with Engineering to reduce friction in the delivery process and improve developer velocity while maintaining reliability standards
Perform capacity planning and load analysis to ensure systems scale predictably
Optimize AWS resource usage for performance and cost efficiency
Tune and optimize MongoDB Atlas and RDS Aurora MySQL for throughput, latency, and reliability
Identify and remediate performance bottlenecks across the stack, including application-level configuration and runtime tuning
Lead and execute infrastructure and platform migrations, including: Migrating workloads from Elastic Beanstalk to ECS Fargate, Migrating AWS resources to more resilient architectures
Perform server, runtime, and dependency upgrades, including: Python version upgrades, Library and dependency updates, Server Platform Upgrades (EC2, RDS, etc.)
Design and execute changes using safe rollout strategies, monitoring, and rollback plans
Build and maintain strong observability using metrics, logs, dashboards, and alerts
Ensure alerting is actionable and aligned with user-impacting symptoms
Reduce alert fatigue by continuously tuning thresholds and signals
Identify and eliminate sources of operational toil through automation
Perform operational tasks supporting SOC 2 Security and Availability controls
Manage secrets and credentials using AWS Secrets Manager and AWS Systems Manager Parameter Store
Rotate credentials and enforce least-privilege access across infrastructure
Partner with Operations to support audits, evidence collection, and remediation
Ensure production changes follow auditable and repeatable processes
Automate manual and repetitive operational tasks
Improve deployment safety and reliability through tooling and process improvements
Document runbooks, operational procedures, and reliability standards
Act as a DevOps and infrastructure advisor to Engineering teams during design and implementation

Qualification

AWSInfrastructure as CodeMySQL / AuroraIncident responseSOC 2Service reliability metricsObservability platformsPythonDocumentation skills

Required

5+ years of experience in DevOps, Site Reliability Engineering, or Cloud Infrastructure roles
Must be located in the U.S
Strong hands-on experience with AWS, including ECS, Lambda, RDS, VPC, and IAM
Experience defining and operating service reliability metrics (SLIs/SLOs) and using them to drive operational improvements
Experience using Infrastructure as Code (AWS CDK, AWS SAM, Terraform, or similar)
Experience managing and optimizing MySQL / Aurora in production
Strong incident response and postmortem experience
Solid understanding of cloud security and secrets management
Experience operating systems in SOC 2–aligned environments
Excellent written communication and documentation skills
Comfortable operating independently in a remote, async-first company
Legal resident of the United States residing in the United States

Preferred

Experience migrating workloads from Elastic Beanstalk to ECS Fargate
Experience building or operating observability platforms
Familiarity with SOC 2 tooling (Drata, Vanta, Secureframe, or similar)
Experience supporting high-availability SaaS platforms
Background in Python application infrastructure

Benefits

Fully remote, US-based role
High ownership and autonomy
Direct impact on production reliability and customer trust
Competitive compensation and benefits
Opportunity to shape DevOps practices as the company scales

Company

QBench

twittertwittertwitter
company-logo
QBench is a cloud-based Laboratory Information Management System (LIMS) that enables laboratories to streamline their operations.

Funding

Current Stage
Early Stage

Leadership Team

leader-logo
Nicholas Evans
CEO
linkedin
leader-logo
Trevor Ewen
COO
linkedin
Company data provided by crunchbase