GMI Cloud · 1 day ago
Infra Engineer - SRE(Kubernetes)
GMI Cloud is a fast-growing AI infrastructure startup based in Silicon Valley, working on cutting-edge technologies that power the future of artificial intelligence. They are seeking a dynamic and hands-on Site Reliability Engineer to ensure the stability, efficiency, and reliability of large-scale AI/ML clusters in their data center.
Responsibilities
Design, implement and maintain scalable AI/ML infrastructure solutions
Proactively monitor GPU cluster health, performance and troubleshoot issues across compute, accelerator, and storage systems
Automate deployment, configuration and management of infrastructure resources
Manage GPU node lifecycle workflows, including provisioning, scaling, maintenance, decommissioning and upgrades of GPU nodes
Implement CI/CD pipelines for infrastructure deployment and orchestration
Ensure security, compliance and best practices across infrastructure
Manage incident response related to Infrastructure resources (GPU, CPU, Storage, Network)
Handle customer provisioning requests for GPU resources, including onboarding, configuration and troubleshooting; resolve customer service requests related to GPU infrastructure, ensuring high customer satisfaction
Stay current with emerging GPU hardware and software technologies, integrating improvements as appropriate
Regional/international travel to GMI data center locations
Qualification
Required
Bachelor's degree in Computer Science or related field
Over 3+ years of experience in data center operations, infrastructure, or systems engineering
Proven experience in site reliability engineering and infrastructure automation (e.g. Ansible, Terraform)
Familiarity with containers orchestration platform (e.g. Kubernetes, Nvidia GPU operator, Nvidia Network operator, CNI, CSI) and job scheduling systems (e.g. Slurm)
Familiarity with Linux system administration and scripting (Python, Bash)
Familiarity with logging and monitoring tools such as Prometheus, Grafana, Loki
Strong troubleshooting skills and ability to analyze system logs and performance metrics
Excellent communication and teamwork abilities
Preferred
Good knowledge of GPU architecture, Nvidia CUDA, NCCL, or related AI/ML frameworks - added advantage
Company
GMI Cloud
GMI Cloud provides GPU cloud access for generative AI applications.
H1B Sponsorship
GMI Cloud has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (2)
Funding
Current Stage
Growth StageTotal Funding
$82MKey Investors
Headline Asia (formerly Infinity Ventures)Banpu NEXT
2024-10-29Series A· $15M
2024-10-29Debt Financing· $67M
2024-07-16Corporate Round
Recent News
Morningstar.com
2025-11-20
Company data provided by crunchbase