Software Engineer - Site Reliability Engineer (SRE) jobs in United States
cer-icon
Apply on Employer Site
company-logo

Lovelace AI · 2 months ago

Software Engineer - Site Reliability Engineer (SRE)

Lovelace AI is a company focused on applying advanced AI and systems engineering to enhance human safety in critical situations. They are seeking a highly skilled Site Reliability Engineer (SRE) to ensure the availability, scalability, and performance of their AI-powered applications and infrastructure, while collaborating with software engineering teams and leading troubleshooting efforts for production issues.

Artificial Intelligence (AI)Machine LearningSoftware
badNo H1BnoteU.S. Citizen Onlynote

Responsibilities

Design, implement, and maintain robust monitoring, alerting, and observability solutions to proactively detect and resolve issues before they impact end-users
Lead troubleshooting efforts for complex production issues, providing detailed root cause analysis (RCA) and implementing preventative measures
Develop and maintain automation scripts, build systems (Bazel) and infrastructure as code (IaC) using tools like Terraform, Ansible, or CloudFormation to eliminate manual tasks and improve system reliability and efficiency
Collaborate closely with software engineering teams to influence the design of new services and applications, ensuring they are scalable, reliable, and resilient from the outset
Participate in on-call rotations to respond to platform emergencies, alerts, and escalations, ensuring high service uptime
Analyze system performance and recommend optimizations for scalability, reliability, and efficiency
Implement and enforce best practices in deployment, monitoring, and incident management to continuously improve overall system reliability and reduce downtime
Develop and maintain internal tools that streamline complex operations, track bugs, manage CI/CD pipelines, and facilitate cross-team communication
Conduct post-incident reviews, documenting software problems and solutions in a shared knowledge base to prevent similar issues in the future
Assist with vulnerability management, system patching, and implementing security measures to protect the integrity and availability of services

Qualification

Site Reliability EngineeringCloud PlatformsLinux/Unix AdministrationContainerization TechnologiesMonitoring ToolsCI/CD ToolsInfrastructure AutomationDistributed SystemsAnalytical SkillsProblem-Solving SkillsInterpersonal SkillsCommunication Skills

Required

5+ years of experience in site reliability engineering, DevOps, systems administration, or related roles
Proven track record of managing complex infrastructure, troubleshooting production issues, and optimizing system performance in high-scale environments
Strong experience with Linux/Unix administration and proficiency in scripting languages (e.g., Python, Bash, Go)
Deep understanding of cloud platforms (AWS, GCP, Azure) and related services (e.g., EC2, S3, Lambda, Kubernetes)
Experience with containerization and orchestration technologies like Docker and Kubernetes
Proficiency with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Dynatrace, ELK Stack)
Strong understanding of networking fundamentals (DNS, HTTP, TCP/IP), load balancing, and CDNs
Experience with CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI) and infrastructure automation
Familiarity with distributed systems and microservices architecture
Excellent problem-solving and troubleshooting skills
Strong analytical skills with the ability to identify Service Level Indicators (SLIs) and align efforts to meet availability and latency objectives
Ability to balance both development and support roles effectively
Strong interpersonal skills and excellent communication skills, with the ability to collaborate effectively across various teams
Experience in working on projects that involve business segments
Must be a US Citizen

Benefits

Competitive compensation packages
Comprehensive benefits

Company

Lovelace AI

twittertwitter
company-logo
Lovelace AI was born from the desire to apply state of the art AI and systems engineering to the question of human safety, especially in dangerous conditions such as conflict, disaster response, anti-terrorism and deterrence against AIs designed by adversaries to harm civilians.

Funding

Current Stage
Early Stage
Total Funding
$16.2M
Key Investors
RRE Ventures
2025-05-06Seed· $16.2M

Leadership Team

leader-logo
Andrew Moore
Chief Executive Officer
linkedin

Recent News

Company data provided by crunchbase