FUSTIS LLC · 3 hours ago
Site Reliability Engineer-In-Person Interview
FUSTIS LLC is seeking a Site Reliability Engineer to support production-grade cloud infrastructure and maintain Kubernetes-based platforms. The role involves troubleshooting distributed systems, implementing monitoring solutions, and participating in on-call rotations.
Responsibilities
Hands-on experience supporting production-grade cloud infrastructure in at least one major cloud provider (AWS, GCP, or Azure)
Practical experience operating and maintaining Kubernetes-based platforms in production environments
Experience with Infrastructure as Code (IaC) tools such as Terraform, Helm, or CloudFormation
Working knowledge of CI/CD and GitOps practices, including automated testing and deployment pipelines
Experience implementing or supporting monitoring, alerting, and observability solutions (metrics, logs, traces)
Strong troubleshooting skills across distributed systems, including performance, availability, and reliability issues
Proficiency in at least one scripting or programming language (e.g., Python, Go, Bash)
Experience participating in on-call rotations, incident response, and root cause analysis
Qualification
Required
BS degree in Computer Science or related field plus 4 years of relevant technology experience or equivalent combination of education & education & experience in lieu of degree, 4+ years of relevant expertise is required
Hands-on experience supporting production-grade cloud infrastructure in at least one major cloud provider (AWS, GCP, or Azure)
Practical experience operating and maintaining Kubernetes-based platforms in production environments
Experience with Infrastructure as Code (IaC) tools such as Terraform, Helm, or CloudFormation
Working knowledge of CI/CD and GitOps practices, including automated testing and deployment pipelines
Experience implementing or supporting monitoring, alerting, and observability solutions (metrics, logs, traces)
Strong troubleshooting skills across distributed systems, including performance, availability, and reliability issues
Proficiency in at least one scripting or programming language (e.g., Python, Go, Bash)
Experience participating in on-call rotations, incident response, and root cause analysis
Preferred
Experience operating multi-cloud environments (AWS, GCP, Azure)
Experience with event streaming platforms such as Apache Kafka, Kafka Connect, or managed services (e.g., Amazon MSK)
Familiarity with service mesh technologies (e.g., Istio) and advanced traffic management patterns
Exposure to stream processing frameworks (e.g., Apache Flink) and CDC tools such as Debezium
Experience supporting MLOps or AI infrastructure, including ML pipelines, model deployment, or GenAI workloads
Familiarity with observability standards such as OpenTelemetry and Golden Signals (Latency, Traffic, Errors, Saturation)
Experience working in regulated environments and supporting compliance frameworks (HIPAA, SOC 2, ISO 27001)
Experience implementing security best practices for cloud-native platforms (IAM, secrets management, RBAC)
Prior experience in platform engineering or internal developer platforms
Exposure to cost optimization and FinOps practices in cloud environments