Baseten · 11 hours ago
Engineering Manager - Forward Deployed SRE
Baseten powers mission-critical inference for the world's most dynamic AI companies, and they are seeking an Engineering Manager focused on Support and Customer Engineering. The role involves leading a team responsible for the performance and reliability of large-scale ML workloads while also providing hands-on technical support and coaching.
Artificial Intelligence (AI)Developer ToolsMachine LearningSoftwareSoftware Engineering
Responsibilities
Lead, mentor, and scale a team of Support Engineers specializing in AI and ML production environments, fostering technical depth, accountability, and a customer-first mindset
Serve as a player-coach, directly contributing to complex troubleshooting, inference optimization, and incident resolution for high-value enterprise customers
Diagnose and resolve runtime issues impacting model performance, such as latency spikes, memory pressure, GPU scheduling, and concurrency management
Debug Kubernetes infrastructure (pods, controllers, networking) and observability stacks using tools like Grafana, Loki, and Prometheus
Own critical incidents end-to-end — coordinating across Engineering, Product, and Sales to ensure timely resolution, transparent communication, and SLA compliance
Drive continuous improvement by enhancing diagnostic runbooks, refining alerting strategies, and developing internal automation for faster root-cause analysis
Collaborate with product and platform teams to surface insights from production issues — shaping roadmap priorities around reliability, inference efficiency, and operational scalability
Lead initiatives that enhance observability, monitoring, and alerting for AI workloads across distributed compute environments
Balance tactical execution with strategic vision, ensuring your team not only resolves today’s issues but also builds systems that prevent tomorrow’s
Qualification
Required
Proven experience leading or mentoring technical teams in Support Engineering, Infrastructure, or Site Reliability within production AI/ML or distributed systems environments
Deep Kubernetes troubleshooting expertise, including advanced resource debugging, runtime performance analysis, and observability-driven diagnostics
Hands-on experience managing distributed systems or AI products at scale — optimizing GPU/CPU utilization, batch sizing, concurrency, and memory efficiency
Expertise with observability and monitoring tools (Grafana, Prometheus, Loki) and alerting best practices
Skilled in incident management and customer escalation handling, with a proven ability to drive clarity and confidence in high-stakes situations
Demonstrated project management and organizational skills, capable of orchestrating multi-stakeholder efforts from incident triage through resolution and RCA
Preferred
Experience implementing or managing incident-response and ticketing systems (e.g., Zendesk, Pylon)
Benefits
Competitive compensation, including meaningful equity.
100% coverage of medical, dental, and vision insurance for employee and dependents
Generous PTO policy including company wide Winter Break (our offices are closed from Christmas Eve to New Year's Day!)
Paid parental leave
Company-facilitated 401(k)
Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
Company
Baseten
Baseten is an AI infrastructure company that integrates machine learning into business operations, production, and processes.
H1B Sponsorship
Baseten has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (6)
2024 (8)
2023 (1)
2020 (1)
Funding
Current Stage
Late StageTotal Funding
$285MKey Investors
BondGreylock
2025-09-05Series D· $150M
2025-02-19Series C· $75M
2024-03-04Series B· $40M
Recent News
2025-12-13
Tech Startups - Tech News, Tech Trends & Startup Funding
2025-12-11
Tech Startups - Tech News, Tech Trends & Startup Funding
2025-12-11
Company data provided by crunchbase