Site Reliability Engineer SRE – ML platform jobs in United States
cer-icon
Apply on Employer Site
company-logo

Galent · 8 hours ago

Site Reliability Engineer SRE – ML platform

Galent is seeking a Site Reliability Engineer specializing in Machine Learning platforms. The role involves designing and implementing cloud solutions, building MLOps pipelines, and ensuring the reliability of ML systems.

Computer Software
Hiring Manager
Magesh Babu
linkedin

Responsibilities

6+ years of experience in ML Ops with strong knowledge in Kubernetes, Python, MongoDB and AWS
Good understanding of Apache SOLR
Proficient with Linux administration
Knowledge of ML models and LLM
Ability to understand tools used by data scientists and experience with software development and test automation
Ability to design and implement cloud solutions and ability to build MLOps pipelines on cloud solutions (AWS)
Experience working with cloud computing and database systems
Experience building custom integrations between cloud-based systems using APIs
Experience developing and maintaining ML systems built with open-source tools
Experience with MLOps Frameworks like Kubeflow, MLFlow, DataRobot, Airflow etc., experience with Docker and Kubernetes
Experience developing containers and Kubernetes in cloud computing environments
Familiarity with one or more data-oriented workflow orchestration frameworks (Kubeflow, Airflow, Argo, etc.)
Ability to translate business needs to technical requirements
Strong understanding of software testing, benchmarking, and continuous integration
Exposure to machine learning methodology and best practices
Good communication skills and ability to work in a team

Qualification

ML OpsKubernetesPythonAWSMongoDBApache SOLRLinux administrationMLOps FrameworksAPIsSoftware developmentTest automationCommunication skillsTeamwork

Required

6+ years of experience in ML Ops with strong knowledge in Kubernetes, Python, MongoDB and AWS
Good understanding of Apache SOLR
Proficient with Linux administration
Knowledge of ML models and LLM
Ability to understand tools used by data scientists and experience with software development and test automation
Ability to design and implement cloud solutions and ability to build MLOps pipelines on cloud solutions (AWS)
Experience working with cloud computing and database systems
Experience building custom integrations between cloud-based systems using APIs
Experience developing and maintaining ML systems built with open-source tools
Experience with MLOps Frameworks like Kubeflow, MLFlow, DataRobot, Airflow etc., experience with Docker and Kubernetes
Experience developing containers and Kubernetes in cloud computing environments
Familiarity with one or more data-oriented workflow orchestration frameworks (Kubeflow, Airflow, Argo, etc.)
Ability to translate business needs to technical requirements
Strong understanding of software testing, benchmarking, and continuous integration
Exposure to machine learning methodology and best practices
Good communication skills and ability to work in a team

Company

Galent

twitter
company-logo
Galent is an AI-native digital engineering firm at the forefront of the AI revolution, dedicated to delivering unified, enterprise-ready AI solutions that transform businesses and industries.

Funding

Current Stage
Late Stage
Company data provided by crunchbase