Lead Cloud Infrastructure Engineer / Site Reliability Engineer (SRE) jobs in United States
cer-icon
Apply on Employer Site
company-logo

Corelight · 2 days ago

Lead Cloud Infrastructure Engineer / Site Reliability Engineer (SRE)

Corelight is a cybersecurity company that transforms network and cloud activity into evidence for threat detection and response. The Lead Cloud Infrastructure Engineer / Site Reliability Engineer (SRE) will ensure the stability, performance, and security of the cloud platform, focusing on availability, performance optimization, and compliance with FedRAMP standards.

AnalyticsCyber SecurityNetwork SecuritySecuritySoftware
check
Growth Opportunities
badNo H1BnoteU.S. Citizen Onlynote

Responsibilities

Collaborate with software engineering teams to ensure the reliability, performance, and security of the Federal region's infrastructure
Design, deploy, and scale AI/ML/LLM infrastructure across cloud platforms (AWS, Azure, or GCP) ensuring high reliability and performance
Manage and optimize Kubernetes environments (EKS, AKS, GKE) for AI services, data pipelines, and model operations
Build and automate end-to-end data and model pipelines for fine-tuning, inference, and RAG workloads using Terraform, Python, and CI/CD tooling
Utilize automation tools such as GitOps, CI/CD pipelines, and containerization technologies (Docker, Kubernetes) to streamline ML/LLM tasks across the Large Language Model lifecycle
Implement monitoring, observability, and reliability best practices using Prometheus, Grafana, ELK/EFK, Langfuse, and SLI/SLO/SLA frameworks
Participate in 24x7 on-call rotations, leading incident response, performance tuning, and cost optimization across SaaS Platform and production workloads
Own infrastructure end to end, leading scaling initiatives, deployments, and automation, and providing technical leadership across the team

Qualification

KubernetesPythonInfrastructure-as-CodeCloud PlatformsCI/CDMonitoring ToolsAI/ML InfrastructureBashGoPowerShellGitOpsService MeshConfiguration ManagementOpen Source Contributions

Required

Bachelor's or Master's degree in Computer Science, Engineering, or related field, or equivalent experience
8+ years in SRE, DevOps, Platform Engineering, MLOps, or Cloud Infrastructure roles
4+ years of production experience with Kubernetes (EKS, GKE, AKS) and containerization tools like Docker
Strong programming skills in Python and proficiency in Bash, Go, or PowerShell
Proficiency with Infrastructure-as-Code tools (Terraform, CloudFormation)
Experience with Kubernetes Operators, Helm, GitOps (ArgoCD, Flux), or Service Mesh (Istio, Linkerd)
Exposure to serverless compute (AWS Lambda, Azure Functions)
Experience building or automating data and model pipelines for AI/ML/LLM workloads (e.g., RAG, fine-tuning, inference)
Strong understanding of observability and monitoring using Prometheus, Grafana, ELK/EFK, Langfuse, or similar platforms
Familiarity with SLI/SLO/SLA practices, incident response, and reliability engineering in production environments
U.S. citizenship at the time of hire
Residence within the contiguous United States
Willingness to undergo a Single Scope Background Investigation, if required

Preferred

Cloud certifications (AWS, Azure, or GCP – e.g., Solutions Architect, DevOps Engineer)
Experience with agentic AI frameworks (CrewAI, LangGraph, AutoGen)
Background in hybrid or on-prem AI deployments, including OpenShift or Rancher
Familiarity with configuration management (Ansible, Chef, Puppet)
Contributions to open-source AI/ML, DevOps, or platform tooling
Experience with multimodal AI or model observability platforms (RAGAS, AgentOps, Langtrace), Distributed Tracing, OpenTelemetry
Knowledge of performance tuning, cost efficiency, or capacity planning for AI/LLM infrastructure
Understanding of security controls and FedRAMP compliance for cloud and various workloads

Benefits

Equity
Additional benefits will also be awarded

Company

Corelight

twittertwittertwitter
company-logo
Corelight is a cybersecurity company specializing in network traffic analysis (NTA) solutions.

Funding

Current Stage
Late Stage
Total Funding
$309.2M
Key Investors
AccelEnergy Impact PartnersGeneral Catalyst
2024-04-30Series E· $150M
2021-09-02Series D· $75M
2019-10-17Series C· $50M

Leadership Team

leader-logo
Gregory Bell
Co-founder and Chief Strategy Officer
linkedin
leader-logo
Robin Sommer
Co-Founder
linkedin
Company data provided by crunchbase