Apply on Employer Site

BlackLine · 3 hours ago

Principal Engineer

Pleasanton, CA

Full-time

Hybrid

Lead/Staff

$251K/yr - $314K/yr

10+ years exp

BlackLine is a leading provider of cloud software that automates and controls the entire financial close process. The Principal AI/ML Operations Engineer will lead the architecture, automation, and operationalization of machine learning and AI systems at scale, collaborating across teams to ensure reliability, efficiency, and compliance of AI systems in production.

Computer Software

H1B Sponsor Likely

Responsibilities

Define enterprise-level standards and reference architectures for ML-Ops and AIOps systems

Partner with data science, security, and product teams to set evaluation and governance standards (Guardrails, Bias, Drift, Latency SLAs)

Mentor senior engineers and drive design reviews for ML pipelines, model registries, and agentic runtime environments

Lead incident response and reliability strategies for ML/AI systems

Lead the deployment of AI models and systems in various environments

Collaborate with development teams to integrate AI solutions into existing workflows and applications

Ensure seamless integration with different platforms and technologies

Define and manage MCP Registry for agentic component onboarding, lifecycle versioning, and dependency governance

Build CI/CD pipelines automating LLM agent deployment, policy validation, and prompt evaluation of workflows

Develop and operationalize experimentation frameworks for agent evaluations, scenario regression, and performance analytics

Implement logging, metering, and auditing for agent behavior, function calls, and compliance alignment

Create scalable observability systems—tracking conversation outcomes, factual accuracy, latency, escalation patterns, and safety events

Architect end-to-end guardrails for AI agents including prompt injection protection, identity-aware routing, and tool usage authorization

Collaborate cross-functionally to standardize authentication, authorization, and session governance for multi-agent runtimes

Architect and standardize model registries and feature stores to support version tracking, lineage, and reproducibility across environments

Lead the deployment of machine learning models into production environments, ensuring scalability, reliability, and efficiency

Collaborate with software engineers to integrate machine learning models into existing applications and systems

Implement and maintain APIs for model inference

Design and manage training infrastructure including distributed training orchestration, GPU/TPU resource allocation, and automatic scaling

Implement CI/CD for model workflows using pipelines integrated with model validation, bias checks, and rollback automation

Build standardized experimentation frameworks for reproducible training, tuning, and deployment cycles (MLflow, W&B, Kubeflow)

Manage and optimize the infrastructure required for machine learning operations in cloud

Work closely with other teams to ensure the availability, security, and performance of machine learning systems

Implement robust monitoring solutions for deployed machine learning models to detect issues and ensure performance

Collaborate with data scientists and engineers to address and resolve model performance and data quality issues

Conduct regular system maintenance, updates, and optimizations to ensure optimal performance of machine learning solutions

Develop and maintain automation scripts and tools for managing machine learning workflows

Implement orchestration systems to streamline the end-to-end machine learning lifecycle, from data preparation to model deployment

Collaborate with data scientists to understand model requirements and constraints for deployment

Facilitate the transition of machine learning models from research to production, ensuring scalability and efficiency

Identify and implement optimizations to enhance the performance and efficiency of machine learning models in production

Conduct performance analysis and implement improvements based on resource utilization of metrics

Implement security measures to protect machine learning systems and data

Ensure compliance with regulatory requirements and industry standards related to machine learning and data privacy

Integrate audit controls, metadata storage, and lineage tracking across ML and AI workflows

Ensure complete monitoring and feedback loops including event logs, evaluations, and automated retraining triggers

Enforce secure deployment patterns with Infrastructure-as-Code and cloud-native secrets management

Define SLAs, error budgets, and compliance reporting mechanisms for ML and AI systems

Qualification

ML-Ops architectureAI system deploymentMachine learning frameworksCloud ecosystemsCI/CD pipelinesPythonDevOps practicesContainerization technologiesObservability stacksProblem-solvingCritical thinkingAdaptability

Required

Bachelor's or Master's degree in Computer Science, Machine Learning, Data Science, or a related field

10+ years in ML infrastructure, DevOps, and software system architecture; 4+ years in leading MLOps or AI Ops platforms

Strong programming skills in languages such as Python, Java, or Scala

Expertise in ML frameworks (TensorFlow, PyTorch, scikit-learn) and orchestration tools (Airflow, Kubeflow, Vertex AI, MLflow)

Proven experience operating production pipelines for ML and LLM-based systems across cloud ecosystems (GCP, AWS, Azure)

Deep familiarity with LangChain, LangGraph, ADK or similar agentic system runtime management

Strong competencies in CI/CD, IaC, and DevSecOps pipelines integrating testing, compliance, and deployment automation

Hands-on with observability stacks (Prometheus, Grafana, Newrelic) for model and agent performance tracking

Understanding of governance frameworks for Responsible AI, auditability, and cost metering across training and inference workloads

Proficiency in containerization technologies (e.g., Docker, Kubernetes)

Proficient in scripting languages (e.g., Bash, python) for automation

Experience with workflow orchestration tools (e.g., Apache Airflow)

Expertise in managing and optimizing cloud-based infrastructure

Familiarity with DevOps practices and tools for automated deployment

Understanding of network configurations and security protocols

Ability to define problems, collect and analyze data, and propose innovative solutions

Strong critical thinking skills to evaluate models, identify limitations

Comfortable working in a fast-paced, rapidly evolving environment

Proactive in staying up to date with the latest trends, techniques, and technologies in AI/data science

Benefits

Short-term and long-term incentive programs

A robust offering of benefit and wellness plans

Company

BlackLine

Glassdoor3.3

Companies turn to BlackLine (Nasdaq: BL) to help solve their most complex finance and accounting challenges.

Founded in 2001

Woodland Hills, CA, US

1001-5000 employees

https://www.blackline.com

H1B Sponsorship

BlackLine has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (41)

2024 (32)

2023 (41)

2022 (50)

2021 (40)

2020 (41)