Senior Machine Learning Platform Engineer @ Stability AI | Jobright.ai
JOBSarrow
RecommendedLiked
0
Applied
0
External
0
Senior Machine Learning Platform Engineer jobs in United States
58 applicants
company-logo

Stability AI · 4 days ago

Senior Machine Learning Platform Engineer

ftfMaximize your interview chances
Artificial Intelligence (AI)Generative AI

Insider Connection @Stability AI

Discover valuable connections within the company who might provide insights and potential referrals.
Get 3x more responses when you reach out via email instead of LinkedIn.

Responsibilities

Design, develop, and maintain robust APIs that facilitate communication and data exchange between cloud-based services, particularly AWS, and HPC environments
Collaborate with cross-functional teams to understand the unique requirements of both cloud based services and HPC systems, ensuring that the APIs developed meet the specific needs of these environments
Implement best practices for API design, including security, scalability, and performance optimization to ensure efficient interaction between cloud services and HPC clusters
Utilize services such as Cloudflare to enhance API performance, security, and reliability in the cloud-to-HPC communication, optimizing for speed and resilience
Work closely with HPC engineers to identify and address integration challenges, striving for seamless connectivity between diverse systems and cloud-based platforms
Drive innovation by proposing and implementing new API strategies, enhancing the efficiency and functionality of data exchange between AWS, Cloudflare workers, on-premise HPC environments
Create comprehensive documentation and provide training to internal teams on the use and integration of developed APIs, focusing on AWS and Cloudflare environments
Monitor API performance and address issues related to data transfer, ensuring reliability and consistent operation between AWS, Cloudflare, and HPC systems (Slurm/AWS HyperPod)
Collaborate with the security team to ensure that the APIs comply with industry standards and best practices for data privacy and protection, especially in AWS and Cloudflare environments
Participating in incident management and root cause analysis to improve system reliability
Build containers with REST APIs for Gen AI functionality and host them on AWS and Azure

Qualification

Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.

Cloud ComputingAPI DevelopmentHigh-Performance ComputingAWSHPC Cluster ManagementPythonTypescriptAPI DesignDockerKubernetesCI/CD AutomationOAuthJWTCollaboration Skills

Required

8 years of experience in cloud computing, API development, and a deep understanding of High-Performance Computing environments, particularly in an AWS setting
Strong knowledge of HPC cluster management and job scheduling with Slurm and AWS HyperPod
Proficiency in programming languages such as Python and Typescript, essential for API development and integration within AWS and/or Cloudflare worker environments
Demonstrated expertise in API design, implementation, and maintenance, ensuring security and performance best practices within AWS and Cloudflare
Knowledge of containerization technologies (e.g., Docker, Kubernetes) for deployment of APIs within AWS, Cloudflare, and HPC systems
Experience with automating CI/CD pipelines
Familiarity with authentication and authorization protocols (e.g., OAuth, JWT) to ensure secure data exchange between AWS, Cloudflare, and HPC environments
Strong problem-solving skills and the ability to troubleshoot complex issues related to API integrations in a hybrid cloud-HPC setup, particularly in AWS and Cloudflare environments
Excellent communication and collaboration skills to work effectively with diverse teams and stakeholders in AWS and Cloudflare ecosystems

Benefits

Stock options
Benefits

Company

Stability AI

twittertwittertwitter
company-logo
Stability AI is an artificial intelligence-driven visual art startup that designs and implements open AI tools.

Funding

Current Stage
Growth Stage
Total Funding
$256M
Key Investors
Intel
2024-06-25Series Unknown· $80M
2023-11-09Convertible Note· $50M
2023-05-01Convertible Note· $25M

Leadership Team

leader-logo
Hanno Basse
CTO
linkedin
leader-logo
Emad Mostaque
Founder
Company data provided by crunchbase
logo

Orion

Your AI Copilot