Lovelace AI · 2 months ago
Software Engineer - Site Reliability Engineer (SRE)
Lovelace AI is a company focused on applying advanced AI and systems engineering to enhance human safety in critical situations. They are seeking a highly skilled Site Reliability Engineer (SRE) to ensure the availability, scalability, and performance of their AI-powered applications and infrastructure, while collaborating with software engineering teams and leading troubleshooting efforts for production issues.
Artificial Intelligence (AI)Machine LearningSoftware
Responsibilities
Design, implement, and maintain robust monitoring, alerting, and observability solutions to proactively detect and resolve issues before they impact end-users
Lead troubleshooting efforts for complex production issues, providing detailed root cause analysis (RCA) and implementing preventative measures
Develop and maintain automation scripts, build systems (Bazel) and infrastructure as code (IaC) using tools like Terraform, Ansible, or CloudFormation to eliminate manual tasks and improve system reliability and efficiency
Collaborate closely with software engineering teams to influence the design of new services and applications, ensuring they are scalable, reliable, and resilient from the outset
Participate in on-call rotations to respond to platform emergencies, alerts, and escalations, ensuring high service uptime
Analyze system performance and recommend optimizations for scalability, reliability, and efficiency
Implement and enforce best practices in deployment, monitoring, and incident management to continuously improve overall system reliability and reduce downtime
Develop and maintain internal tools that streamline complex operations, track bugs, manage CI/CD pipelines, and facilitate cross-team communication
Conduct post-incident reviews, documenting software problems and solutions in a shared knowledge base to prevent similar issues in the future
Assist with vulnerability management, system patching, and implementing security measures to protect the integrity and availability of services
Qualification
Required
5+ years of experience in site reliability engineering, DevOps, systems administration, or related roles
Proven track record of managing complex infrastructure, troubleshooting production issues, and optimizing system performance in high-scale environments
Strong experience with Linux/Unix administration and proficiency in scripting languages (e.g., Python, Bash, Go)
Deep understanding of cloud platforms (AWS, GCP, Azure) and related services (e.g., EC2, S3, Lambda, Kubernetes)
Experience with containerization and orchestration technologies like Docker and Kubernetes
Proficiency with monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, Dynatrace, ELK Stack)
Strong understanding of networking fundamentals (DNS, HTTP, TCP/IP), load balancing, and CDNs
Experience with CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI) and infrastructure automation
Familiarity with distributed systems and microservices architecture
Excellent problem-solving and troubleshooting skills
Strong analytical skills with the ability to identify Service Level Indicators (SLIs) and align efforts to meet availability and latency objectives
Ability to balance both development and support roles effectively
Strong interpersonal skills and excellent communication skills, with the ability to collaborate effectively across various teams
Experience in working on projects that involve business segments
Must be a US Citizen
Benefits
Competitive compensation packages
Comprehensive benefits
Company
Lovelace AI
Lovelace AI was born from the desire to apply state of the art AI and systems engineering to the question of human safety, especially in dangerous conditions such as conflict, disaster response, anti-terrorism and deterrence against AIs designed by adversaries to harm civilians.
Funding
Current Stage
Early StageTotal Funding
$16.2MKey Investors
RRE Ventures
2025-05-06Seed· $16.2M
Recent News
thesaasnews.com
2025-05-07
Company data provided by crunchbase