Apply on Employer Site

Lovelace AI · 2 months ago

Software Engineer - Site Reliability Engineer (SRE)

Pittsburgh, PA

Full-time

Onsite

Senior Level

5+ years exp

Lovelace AI is a company focused on applying advanced AI and systems engineering to enhance human safety in critical situations. They are seeking a highly skilled Site Reliability Engineer (SRE) to ensure the availability, scalability, and performance of their AI-powered applications and infrastructure, while collaborating with software engineering teams and leading troubleshooting efforts for production issues.

Artificial Intelligence (AI)Machine LearningSoftware

No H1B

U.S. Citizen Only

Responsibilities

Design, implement, and maintain robust monitoring, alerting, and observability solutions to proactively detect and resolve issues before they impact end-users

Lead troubleshooting efforts for complex production issues, providing detailed root cause analysis (RCA) and implementing preventative measures

Develop and maintain automation scripts, build systems (Bazel) and infrastructure as code (IaC) using tools like Terraform, Ansible, or CloudFormation to eliminate manual tasks and improve system reliability and efficiency

Collaborate closely with software engineering teams to influence the design of new services and applications, ensuring they are scalable, reliable, and resilient from the outset

Participate in on-call rotations to respond to platform emergencies, alerts, and escalations, ensuring high service uptime

Analyze system performance and recommend optimizations for scalability, reliability, and efficiency

Implement and enforce best practices in deployment, monitoring, and incident management to continuously improve overall system reliability and reduce downtime

Develop and maintain internal tools that streamline complex operations, track bugs, manage CI/CD pipelines, and facilitate cross-team communication

Conduct post-incident reviews, documenting software problems and solutions in a shared knowledge base to prevent similar issues in the future

Assist with vulnerability management, system patching, and implementing security measures to protect the integrity and availability of services

Qualification

Site Reliability EngineeringCloud PlatformsLinux/Unix AdministrationContainerization TechnologiesMonitoring ToolsCI/CD ToolsInfrastructure AutomationDistributed SystemsAnalytical SkillsProblem-Solving SkillsInterpersonal SkillsCommunication Skills

Required

5+ years of experience in site reliability engineering, DevOps, systems administration, or related roles

Proven track record of managing complex infrastructure, troubleshooting production issues, and optimizing system performance in high-scale environments

Strong experience with Linux/Unix administration and proficiency in scripting languages (e.g., Python, Bash, Go)

Deep understanding of cloud platforms (AWS, GCP, Azure) and related services (e.g., EC2, S3, Lambda, Kubernetes)

Experience with containerization and orchestration technologies like Docker and Kubernetes

Proficiency with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Dynatrace, ELK Stack)

Strong understanding of networking fundamentals (DNS, HTTP, TCP/IP), load balancing, and CDNs

Experience with CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI) and infrastructure automation

Familiarity with distributed systems and microservices architecture

Excellent problem-solving and troubleshooting skills

Strong analytical skills with the ability to identify Service Level Indicators (SLIs) and align efforts to meet availability and latency objectives

Ability to balance both development and support roles effectively

Strong interpersonal skills and excellent communication skills, with the ability to collaborate effectively across various teams

Experience in working on projects that involve business segments

Must be a US Citizen

Benefits

Competitive compensation packages

Comprehensive benefits

Company

Lovelace AI

Lovelace AI was born from the desire to apply state of the art AI and systems engineering to the question of human safety, especially in dangerous conditions such as conflict, disaster response, anti-terrorism and deterrence against AIs designed by adversaries to harm civilians.

Founded in 2023

Pittsburgh, Pennsylvania, USA

11-50 employees