Arcee.ai · 1 day ago
Machine Learning Infrastructure Engineer
Maximize your interview chances
Artificial Intelligence (AI)Generative AI
Insider Connection @Arcee.ai
Get 3x more responses when you reach out via email instead of LinkedIn.
Responsibilities
Design and implement scalable, efficient, and reliable machine learning infrastructure (e.g., containerization, orchestration, and cloud services).
Develop and maintain infrastructure as code (IaC) using tools like Terraform, AWS CloudFormation, or Google Cloud Deployment Manager.
Design and implement model serving platforms (e.g., TensorFlow Serving, AWS SageMaker, or Azure Machine Learning) for efficient model deployment and management.
Develop and maintain automated model deployment pipelines using tools like Jenkins, GitLab CI/CD, or CircleCI.
Collaborate with data engineers to design and implement data pipelines that feed machine learning models.
Ensure data quality, integrity, and security throughout the data lifecycle.
Develop and implement monitoring and logging solutions (e.g., Prometheus, Grafana, or ELK Stack) to track model performance, latency, and system health.
Optimize infrastructure resources and model performance using techniques like hyperparameter tuning, model pruning, and knowledge distillation.
Work closely with data scientists, engineers, and researchers to identify infrastructure needs and develop solutions.
Communicate technical information effectively to both technical and non-technical stakeholders.
Stay current with industry trends, emerging technologies, and best practices in machine learning infrastructure.
Participate in conferences, meetups, and online forums to expand knowledge and network with peers.
Qualification
Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.
Required
Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
3+ years of experience in machine learning infrastructure, DevOps, or a related field.
Experience with cloud providers (e.g., AWS, GCP, or Azure) and containerization (e.g., Docker).
Proficiency in programming languages like Python, Java, or C++.
Experience with machine learning frameworks like TensorFlow, PyTorch, or Scikit-learn.
Familiarity with infrastructure as code (IaC) tools like Terraform or CloudFormation.
Knowledge of container orchestration tools like Kubernetes or Docker Swarm.
Excellent communication, collaboration, and problem-solving skills.
Ability to work in a fast-paced environment and prioritize tasks effectively.
Preferred
Cloud provider certifications (e.g., AWS Certified DevOps Engineer or GCP Professional Cloud Developer).
Machine learning certifications (e.g., TensorFlow Certified Developer or PyTorch Certified Engineer).
Experience with model serving platforms like TensorFlow Serving or AWS SageMaker.
Automated model deployment pipelines using tools like Jenkins or GitLab CI/CD.
Monitoring and logging solutions like Prometheus or ELK Stack.
Model explainability and interpretability techniques.
Data privacy and security best practices.
Benefits
Health, dental, and vision insurance
401(k)
Opportunities for growth, training, and conference attendance
A dynamic, diverse team that values innovation and open communication
Company
Arcee.ai
Arcee.ai develops context-adapted LLMs through their domain-adapted language model system (DALM).