Apply on Employer Site

ChatGPT Jobs · 9 hours ago

Principal Reliability Engineer - EDS

Greater Hartford

Full-time

Hybrid

Senior Level, Lead/Staff

$153K/yr - $229K/yr

10+ years exp

The Hartford is seeking a Principal Reliability Engineer to serve as the senior technical authority responsible for the reliability, resilience, availability, and performance of all data platforms and cloud infrastructure. This role involves setting the strategic vision for Reliability Engineering and leading cross-organizational technical initiatives to enhance engineering excellence and proactive reliability improvement.

Computer Software

No H1B

Responsibilities

Work closely with the AVP, RE & Production Support, EDS defining the Reliability Engineering strategy for data platforms, data cloud environments, and data products

Establish longterm RE roadmaps, target operating models, and architectural patterns that scale with organizational growth

Serve as the highestlevel technical escalation point for systemic reliability issues, influencing executive stakeholders and engineering leaders

Leverage Enterprise provided standards and building blocks to Architect and evolve highly reliable, performant, and costefficient cloudbased platforms across AWS and GCP for all EDS services

Influence and work directly with Platform Solution Architecture on new product enablement, hyper automation (end to end blueprint automation)

Oversee reliability controls and failsafe patterns for Snowflake, EMR, Hadoop/Spark clusters, container platforms (e.g., Kubernetes), and missioncritical data systems

Lead the creation and enforcement of SLO/SLI frameworks that span the entire data lifecycle

Develop and implement AIdriven automation for anomaly detection, alert correlation, autonomous remediation, and predictive capacity management

Leverage LLMs, prompt engineering, and cloudnative AI services (AWS Bedrock, SageMaker, Vertex AI) to build intelligent runbooks, advanced troubleshooting agents, and generativeAIenabled operational tooling

Champion the adoption of machine learning-based observability and reliability analytics

Adopt and architect enterprisewide data observability frameworks-including logging, metrics, tracing, distributed profiling, and event pipelines-for all data platforms and pipelines

Establish goldstandard incident response patterns, postincident reviews, and continuous improvement processes

Drive elimination of toil across EDS, focusing on selfhealing systems, proactive detection, and autonomous operations

Define RE best practices for modern data products, governed data pipelines, realtime/streaming systems, and operational analytics platforms

Ensure data quality, data timeliness, and SLAs for data products through automated checks, lineage-informed alerting, and pipeline reliability tooling

Partner with Data Engineering to embed resilience patterns (idempotency, checkpointing, replayability, disaster recovery) into pipeline architectures

Set and enforce standards for IaC, CI/CD, platform automation, reliability frameworks, operational readiness, and runbook quality across EDS

Provide technical leadership and mentorship to Staff/Senior Engineers in the RE team and Production Support teams, influencing engineering culture and helping grow RE capabilities across the organization

Represent Reliability Engineering in architectural reviews, enterprise governance forums, and executivelevel discussions

Qualification

Reliability EngineeringCloud PlatformsData EngineeringInfrastructure as CodeMachine LearningObservability ToolsScripting (Python)Data Quality EngineeringMentoring EngineersCross-Functional InfluenceTechnical LeadershipCommunication Skills

Required

10+ years in one or more of the following areas: data, cloud, platform engineering, site/reliability engineering, or largescale distributed systems, with experience in leadership or technology leader roles

Proficiency with data or cloud platforms, including architectural patterns for resilience, networking, security, and distributed data infrastructure

Deep experience supporting or engineering platforms such as Snowflake, EMR, Hadoop/Spark, Data Integration, and cloudnative data ecosystems

Scripting and programming (preferably Python) for largescale automation, platform tooling, and reliability frameworks

Experience with InfrastructureasCode (Terraform, CloudFormation) and enterprise CI/CD

Must be eligible to work in the US without company sponsorship

Preferred

Experience in regulated or highly complex enterprise environments (financial services, insurance, healthcare)

Prior experience as a Senior Staff Engineer, Engineering or Architecture leader with hands on experience, or similar senior technical role

Knowledge of data governance, metadata, lineage systems, and data quality engineering practices

Certifications in AWS, GCP, Kubernetes, or SRE/DevOps frameworks

Background applying machine learning to operations-anomaly detection, event correlation, predictive modeling, and automated remediation

Understand of AIenabled developer/operations tools using LLMs, prompt engineering, or cloud AI services for reliability improvements

Expertise with enterprise observability stacks (Prometheus, Grafana, Datadog, Splunk, Dynatrace, OpenTelemetry)

Ability to design and enforce advanced SLI/SLO frameworks across complex data ecosystems

Demonstrated ability to lead technical strategy at scale, influence senior engineering leaders, and set enterprisewide standards

Strong capability in mentoring engineers, providing architectural guidance, and fostering engineering excellence

Exceptional communication skills for interacting with executives, senior architects, product leaders, and engineering teams

Company

ChatGPT Jobs

We find the best job offers for experts in ChatGPT and related technologies.

Founded in 2024

New York, NY, New York, US

2-10 employees

https://www.chatgpt-jobs.com

Funding

Current Stage

Early Stage

Company data provided by crunchbase