Cloud Platform Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

SambaNova · 2 weeks ago

Cloud Platform Engineer

SambaNova is a company that is building the future of AI computing, focusing on generative AI solutions for enterprise and government organizations. As a Cloud Platform Engineer, you will specialize in AI Inferencing Service, ensuring reliability, performance, and scalability while bridging software development and operations. Your role will involve maintaining exceptional uptime and low-latency response times for inference endpoints, directly impacting customer experience and AI product success.

AnalyticsArtificial Intelligence (AI)Machine LearningSemiconductorSoftware
check
H1B Sponsor Likelynote

Responsibilities

Take shared ownership of the production inferencing service, including its availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning across multiple regions
Participate in a balanced on-call rotation to provide 24/7 support for the service
Lead the response to incidents affecting the inferencing service, driving blameless post-mortems and implementing corrective actions to prevent recurrence
Develop and maintain advanced monitoring, alerting, and dashboarding (using tools like Prometheus, Grafana, Datadog) to gain deep insights into service health, model performance (e.g., latency, throughput, error rates), and accelerator utilization
Proactively identify and eliminate performance bottlenecks
Design and implement auto-scaling policies to handle variable inference loads cost-effectively
Manage and evolve our cloud infrastructure (on AWS, GCP, and/or Azure along with on-prem) using tools like Terraform and Ansible, ensuring it is secure, repeatable, and scalable
Champion automation by building and improving CI/CD pipelines for the seamless and safe deployment of new model versions and service updates
Forecast infrastructure needs based on product roadmaps and usage trends
Define, measure, and report on Service Level Objectives (SLOs) and Indicators (SLIs) for the inferencing platform, using data to drive prioritization and reliability investments

Qualification

Cloud infrastructure managementMonitoringContainerization technologiesProgramming/scripting skillsObservabilityInfrastructure as CodeCI/CD principlesPerformance optimizationLinux/Unix administrationMLOps principlesProblem-solving skills

Required

Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience
3-5+ years of experience in a Site Reliability Engineer, DevOps, or related role supporting a large-scale, customer-facing service in a public cloud environment (AWS, GCP, Azure)
Strong programming/scripting skills in languages like Python, Go, or Java
Proven experience with containerization and orchestration technologies (Docker, Kubernetes)
Deep understanding of monitoring and observability principles and tools (e.g., Prometheus, Grafana, ELK Stack, Datadog)
Solid experience with Infrastructure as Code (e.g., Terraform, CloudFormation)
Familiarity with CI/CD principles and tools (e.g., Jenkins, GitHub Actions, ArgoCD)
Excellent problem-solving skills and a systematic approach to troubleshooting complex distributed systems

Preferred

Experience in a hybrid environment bridging cloud and on-premise/data center infrastructure
Direct experience supporting ML/AI inferencing services in production
Familiarity with GPU-accelerated computing and optimizing workloads for NVIDIA GPUs for purposes of mapping to RDUs
Knowledge of model serving frameworks like vLLM, SGLang or Ray
Understanding of MLOps principles and practices
Experience with managing and tuning databases (SQL or NoSQL) and caching systems (Redis, Memcached)
Strong Linux/Unix system administration fundamentals

Benefits

95% premium coverage for employee medical insurance
77% premium coverage for dependents
Health Savings Account (HSA) with employer contribution
Dental insurance
Vision insurance
Short/Long term Disability insurance
Basic Life insurance
Voluntary Life insurance
AD&D insurance plans
Flexible Spending Account (FSA) options like Health Care, Limited Purpose, and Dependent Care
Full subscription to Headspace
Gympass+ membership with access to physical gyms
One Medical membership
Counseling services with an Employee Assistance Program

Company

SambaNova

twittertwittertwitter
company-logo
SambaNova develops software and hardware for artificial intelligence and machine learning applications.

H1B Sponsorship

SambaNova has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (31)
2024 (27)
2023 (37)
2022 (41)
2021 (35)
2020 (29)

Funding

Current Stage
Late Stage
Total Funding
$1.14B
Key Investors
SoftBank Vision FundBlackRockIntel Capital
2023-10-01Secondary Market
2021-04-13Series D· $676M
2020-02-25Series C· $250M

Leadership Team

leader-logo
Rodrigo Liang
Founder & CEO
linkedin
leader-logo
Annie Weckesser
CMO
linkedin
Company data provided by crunchbase