Production Operations Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

DryvIQ · 1 day ago

Production Operations Engineer

DryvIQ is a rapidly growing, venture-backed software company headquartered in the Ann Arbor tech cluster. The Production Operations Engineer role focuses on ensuring reliability, security, and consistency across DryvIQ’s hybrid environments, while bridging engineering and operations to manage deployment automation, monitoring, and incident response for critical data-management workloads.

Artificial Intelligence (AI)Enterprise SoftwareProfessional ServicesSoftware
check
Growth Opportunities

Responsibilities

Understand, deploy and maintain Helm charts, and CI/CD workflows for AKS, EKS, and on-prem Kubernetes (K3s or RKE2) in customer environments
Standardize customer deployments (private cloud / air-gapped) using reproducible manifests and configuration validation tooling
Maintain our single-node and multi-node install processes; improve installer packaging
Monitor uptime, capacity, and performance across distributed clusters (migration, scan, OLAP DB node groups)
Implement proactive alerting (Prometheus, Grafana, Azure Monitor, CloudWatch) and ensure runbooks exist for all major services
Coordinate with customer IT/security teams to handle firewall, proxy, and credential configurations safely and consistently
Participate in release-readiness and hardening cycles; validate new images and helm charts before customer rollout
Lead incident response for production issues—triage, communicate status, and drive post-incident reviews and root-cause documentation
Track reliability metrics (MTTR, deployment success rate, change-failure rate) and feed insights back into engineering planning
Integrate static/dynamic security scanning (GitHub Advanced Security / CodeQL / Dependabot) and image-signing pipelines
Ensure secrets, credentials, and certificates are rotated and stored per corporate security standards
Support ISO / SOC2 audit evidence collection (CCR change control, deployment logs, access reviews)
Extend monitoring to include customer-facing telemetry where allowed; maintain log shipping and retention policies
Contribute to internal dashboards showing environment health, install duration, and customer success metrics
Work closely with Dev / QA / Support to reproduce issues in controlled environments and publish fixes or workarounds
Provide training and documentation for Services and Support engineers deploying or maintaining on-prem instances
Champion 'build-to-run' culture—drive automation, resiliency testing, and feedback loops between engineering and field ops

Qualification

KubernetesHelmDockerNetwork securityScripting skillsCI/CD pipelinesDistributed systemsTrainingCollaborationDocumentation

Required

5 + years in SRE, DevOps, or Production Ops roles supporting hybrid or on-prem software delivery
Minimum 3 years working with Fortune 500 companies implementing or maintaining enterprise software
Expertise with Kubernetes, Helm, and Docker in mixed cloud environments (Azure AKS, AWS EKS, on-prem K3s)
Solid understanding of network security (proxies, TLS, VPN, firewalls) and Linux administration
Strong scripting and automation skills (Bash, Python, PowerShell, YAML / Terraform) especially as it relates to K8s
Familiarity with CI/CD pipelines (GitHub Actions, TeamCity, Argo CD or Flux)
Experience supporting distributed systems (e.g., Apache Pulsar, Postgres, ClickHouse, Redis, MinIO)
Comfort working directly with enterprise customer admins and security teams

Company

DryvIQ

twittertwittertwitter
company-logo
Unified platform for discovering, migrating, and governing unstructured data.

Funding

Current Stage
Growth Stage

Leadership Team

leader-logo
Sean Nathaniel
President & Chief Executive Officer
linkedin
leader-logo
Brad Chase
Vice President of Sales
linkedin
Company data provided by crunchbase