SambaNova · 2 weeks ago
Cloud Platform Engineer
SambaNova is a company that is building the future of AI computing, focusing on generative AI solutions for enterprise and government organizations. As a Cloud Platform Engineer, you will specialize in AI Inferencing Service, ensuring reliability, performance, and scalability while bridging software development and operations. Your role will involve maintaining exceptional uptime and low-latency response times for inference endpoints, directly impacting customer experience and AI product success.
AnalyticsArtificial Intelligence (AI)Machine LearningSemiconductorSoftware
Responsibilities
Take shared ownership of the production inferencing service, including its availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning across multiple regions
Participate in a balanced on-call rotation to provide 24/7 support for the service
Lead the response to incidents affecting the inferencing service, driving blameless post-mortems and implementing corrective actions to prevent recurrence
Develop and maintain advanced monitoring, alerting, and dashboarding (using tools like Prometheus, Grafana, Datadog) to gain deep insights into service health, model performance (e.g., latency, throughput, error rates), and accelerator utilization
Proactively identify and eliminate performance bottlenecks
Design and implement auto-scaling policies to handle variable inference loads cost-effectively
Manage and evolve our cloud infrastructure (on AWS, GCP, and/or Azure along with on-prem) using tools like Terraform and Ansible, ensuring it is secure, repeatable, and scalable
Champion automation by building and improving CI/CD pipelines for the seamless and safe deployment of new model versions and service updates
Forecast infrastructure needs based on product roadmaps and usage trends
Define, measure, and report on Service Level Objectives (SLOs) and Indicators (SLIs) for the inferencing platform, using data to drive prioritization and reliability investments
Qualification
Required
Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience
3-5+ years of experience in a Site Reliability Engineer, DevOps, or related role supporting a large-scale, customer-facing service in a public cloud environment (AWS, GCP, Azure)
Strong programming/scripting skills in languages like Python, Go, or Java
Proven experience with containerization and orchestration technologies (Docker, Kubernetes)
Deep understanding of monitoring and observability principles and tools (e.g., Prometheus, Grafana, ELK Stack, Datadog)
Solid experience with Infrastructure as Code (e.g., Terraform, CloudFormation)
Familiarity with CI/CD principles and tools (e.g., Jenkins, GitHub Actions, ArgoCD)
Excellent problem-solving skills and a systematic approach to troubleshooting complex distributed systems
Preferred
Experience in a hybrid environment bridging cloud and on-premise/data center infrastructure
Direct experience supporting ML/AI inferencing services in production
Familiarity with GPU-accelerated computing and optimizing workloads for NVIDIA GPUs for purposes of mapping to RDUs
Knowledge of model serving frameworks like vLLM, SGLang or Ray
Understanding of MLOps principles and practices
Experience with managing and tuning databases (SQL or NoSQL) and caching systems (Redis, Memcached)
Strong Linux/Unix system administration fundamentals
Benefits
95% premium coverage for employee medical insurance
77% premium coverage for dependents
Health Savings Account (HSA) with employer contribution
Dental insurance
Vision insurance
Short/Long term Disability insurance
Basic Life insurance
Voluntary Life insurance
AD&D insurance plans
Flexible Spending Account (FSA) options like Health Care, Limited Purpose, and Dependent Care
Full subscription to Headspace
Gympass+ membership with access to physical gyms
One Medical membership
Counseling services with an Employee Assistance Program
Company
SambaNova
SambaNova develops software and hardware for artificial intelligence and machine learning applications.
H1B Sponsorship
SambaNova has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (31)
2024 (27)
2023 (37)
2022 (41)
2021 (35)
2020 (29)
Funding
Current Stage
Late StageTotal Funding
$1.14BKey Investors
SoftBank Vision FundBlackRockIntel Capital
2023-10-01Secondary Market
2021-04-13Series D· $676M
2020-02-25Series C· $250M
Recent News
2025-12-29
2025-12-14
2025-12-13
Company data provided by crunchbase