Apt · 18 hours ago
Cloud Engineer
Apt is seeking a seasoned reliability engineer to ensure their systems remain fast, stable, and resilient as they scale. The role involves monitoring platform health, collaborating with development teams, and driving improvements in operational practices. This position is critical for maintaining system reliability while supporting business growth.
Responsibilities
Keep a close pulse on platform health by collecting and interpreting system metrics to anticipate issues and fine-tune performance
Collaborate with development teams to harden release processes, validate system changes, and build resilient environments
Contribute to the design and evolution of platforms, ensuring capacity and scalability align with business growth
Drive the right balance between rapid delivery and platform reliability, aligning with agreed service objectives
Play an active role during incidents, rapidly restoring stability and identifying root causes
Apply strong diagnostic skills to troubleshoot complex issues across multiple layers of the stack
Protect systems from unwanted traffic patterns by implementing intelligent controls and defenses
Leverage observability tools to catch problems before they escalate, driving proactive improvements
Continuously refine and mature operational practices and technologies to reduce friction and strengthen system resilience
Take on additional engineering or operational initiatives as the team evolves
Qualification
Required
Bachelor's degree or equivalent experience with 2+ years in reliability, infrastructure, or platform engineering
Proven understanding of modern orchestration and container technologies (e.g., Kubernetes, clusters, autoscaling)
Deep familiarity with site reliability principles and practices
Practical experience with major cloud environments (GCP preferred)
Solid foundation in API-driven and microservices architectures
Strong troubleshooting capabilities across infrastructure, networking, databases, OS, and security layers
Proficiency in both Windows and Linux environments at an architecture level
Hands-on background supporting large-scale applications and production deployments, including observability tooling (Dynatrace or similar a plus)
Experience with modern CI/CD pipelines and tooling
Demonstrated strength in performance optimization, capacity strategy, and system tuning
Strong blend of software engineering mindset and operational know-how
Understanding of web technologies and protocols (e.g., HTTP, proxies, Java stacks)
Exposure to tools like Azure DevOps, Dynatrace, Prometheus, Terraform, and Grafana