Senior Staff Software Engineer, SRE, ML Fleet Systems jobs in United States
cer-icon
Apply on Employer Site
company-logo

Google · 7 hours ago

Senior Staff Software Engineer, SRE, ML Fleet Systems

Google is a leading technology company, and they are seeking a Senior Staff Software Engineer in the ML Fleet Systems team. The role involves shaping the architecture and implementation of systems that ensure the scalable and efficient deployment of machine learning resources, while also tackling complex challenges that influence teams across the organization.

AppsArtificial Intelligence (AI)Cloud StorageSearch EngineSEO
check
Growth Opportunities
check
H1B Sponsor Likelynote

Responsibilities

Define and drive the long-term technical outlook, strategy, and roadmap for critical software systems that manage Alphabet's ML fleet. This includes capacity management for all ML resources such as TPUs, GPUs, compute, storage, and networking
Act as the Technical Lead for the internal Capacity Management Business team within ML Fleet, providing technical direction, mentorship, and guidance to build and evolve our capacity management solutions from operations to robust engineered solutions
Collaborate closely with engineering partners (e.g., Onefleet, Spatial Flex, Operational Data Store (ODS)) to design and deliver joint engineered solutions to our customers
Identify, scope, and solve broad and ambiguous challenges that impact the efficiency, reliability, and cost-effectiveness of the entire ML fleet. Turn these challenges into strategic opportunities and actionable plans

Qualification

Distributed systemsInfrastructure optimizationMachine Learning hardwareTechnical leadershipProgramming languagesResource management systemsCommunicationCollaboration skills

Required

Bachelor's degree in Computer Science, a related field, or equivalent practical experience
8 years of experience with software development in one or more programming languages
4 years of experience leading projects, and providing technical leadership
3 years of experience in designing, analyzing, and troubleshooting distributed systems

Preferred

Master's degree or PhD in Computer Science, or a related technical field
Experience with infrastructure optimization, performance analysis, and cost reduction in large-scale environments
Experience with colossus and other relevant Google storage systems (e.g., Bigtable, Spanner, Woodshed)
Understanding of resource management systems (e.g., Borg, Kubernetes, Flex), cluster management, and scheduling algorithms
Familiarity with Machine Learning hardware accelerators (e.g., TPUs, GPUs) and their lifecycle management
Excellent communication and collaboration skills, with the ability to build consensus across organizational boundaries

Benefits

Bonus
Equity
Benefits

Company

Google specializes in internet-related services and products, including search, advertising, and software. It is a sub-organization of Alphabet.

H1B Sponsorship

Google has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (8763)
2024 (8872)
2023 (9682)
2022 (11626)
2021 (9109)
2020 (9785)

Funding

Current Stage
Public Company
Total Funding
$26.1M
Key Investors
Andy Bechtolsheim
2004-08-19IPO
1999-06-07Series Unknown· $25M
1998-11-01Angel· $1M

Leadership Team

leader-logo
Sundar Pichai
CEO
linkedin
leader-logo
Thomas Kurian
CEO - Google Cloud
linkedin
Company data provided by crunchbase