Member of Technical Staff, AI Reliability & Monitoring Engineering Lead jobs in United States
cer-icon
Apply on Employer Site
company-logo

Postman · 1 week ago

Member of Technical Staff, AI Reliability & Monitoring Engineering Lead

Postman is the world’s leading API platform, used by over 40 million developers and 500,000 organizations. They are seeking an experienced AI Systems Reliability Engineer to define, build, and maintain the infrastructure ensuring the reliability and performance of AI-powered systems.

Developer APIsDeveloper ToolsEnterprise SoftwareSaaS
check
H1B Sponsor Likelynote

Responsibilities

Develop and manage reliability metrics (SLOs) for AI-driven API services and agentic AI platform features
Implement comprehensive observability and monitoring systems for real-time performance and fault detection
Design and drive automated failover, recovery, and incident response strategies for high-availability AI infrastructure
Optimize resource utilization, particularly GPU/accelerator efficiency, ensuring cost-effective AI system operation
Collaborate closely with engineering, platform, and product teams to align reliability efforts with broader organizational goals
Lead efforts to build internal tooling and automation focused on AI system stability and operational excellence
Drive continuous improvement in deployment practices, monitoring approaches, and incident management processes

Qualification

AI reliability engineeringSREDevOpsCloud platformsMonitoring toolsIncident response automationGPU optimizationObservability toolsContinuous improvementCollaboration

Required

Have a strong background in AI reliability engineering, SRE, or DevOps for distributed systems
Understand the unique challenges of maintaining large-scale AI systems and integrating AI-specific metrics into reliability frameworks
Are experienced with cloud platforms, monitoring tools, and incident response automation
Are comfortable collaborating across teams to influence best practices for AI system reliability and operational health
Thrive in dynamic, fast-paced environments focusing on delivering reliable, safe AI-powered services
Develop and manage reliability metrics (SLOs) for AI-driven API services and agentic AI platform features
Implement comprehensive observability and monitoring systems for real-time performance and fault detection
Design and drive automated failover, recovery, and incident response strategies for high-availability AI infrastructure
Optimize resource utilization, particularly GPU/accelerator efficiency, ensuring cost-effective AI system operation
Collaborate closely with engineering, platform, and product teams to align reliability efforts with broader organizational goals
Lead efforts to build internal tooling and automation focused on AI system stability and operational excellence
Drive continuous improvement in deployment practices, monitoring approaches, and incident management processes

Preferred

Hands-on experience with AI/ML infrastructure, including GPU/xPU optimization and scaling
Familiarity with API platform operations and large-scale distributed services
Prior experience building or operating observability tools tailored for AI and agentic systems
Contribution to open-source projects or reliability engineering thought leadership

Benefits

Full medical coverage
Flexible PTO
Wellness reimbursement
Monthly lunch stipend
Wellness programs
Frequent and fascinating team-building events
Donation-matching program

Company

Postman is a software company that offers a platform for the users to design, develop, test, and organize custom APIs.

H1B Sponsorship

Postman has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (22)
2024 (6)
2023 (4)
2022 (5)
2021 (2)
2020 (1)

Funding

Current Stage
Late Stage
Total Funding
$433M
Key Investors
Insight PartnersCRVNexus Venture Partners
2021-08-18Series D· $225M
2020-06-11Series C· $150M
2019-06-19Series B· $50M

Leadership Team

leader-logo
Abhinav Asthana
Founder and CEO
linkedin
leader-logo
Abhijit Kane
Co-Founder
linkedin
Company data provided by crunchbase