Senior ML/Data Ops Engineer II jobs in United States
cer-icon
Apply on Employer Site
company-logo

Tabby | تابي · 2 hours ago

Senior ML/Data Ops Engineer II

Tabby is a financial technology company that reshapes how people shop, earn, and save. They are seeking a Senior ML/Data Ops Engineer II to manage model serving, optimize data pipelines, and ensure infrastructure reliability for their innovative payment solutions.

Artificial Intelligence (AI)BillingFinanceFinancial ServicesFinTechPayments
check
Growth Opportunities
badNo H1Bnote

Responsibilities

Deep expertise in high-throughput serving using vLLM, NVIDIA TensorRT-LLM, and sglang to minimize latency and maximize hardware efficiency
Hands-on experience deploying and optimizing large-scale open-weights models, specifically DeepSeek 3.1/3.2, Qwen, and GPT-OSS variants
Advanced optimization and security hardening of Docker specifically for GPU environments
Managing model weights and orchestration within Kubernetes (GKE) environments
Designing and maintaining high-throughput CDC (Change Data Capture) pipelines using the Apache ecosystem (e.g., Debezium, Kafka) to sync data from Cloud PostgreSQL
Deploying and tuning ClickHouse for real-time analytics, ML feature storage, and high-speed logging
Orchestrating complex ML data workflows using Airflow (Google Cloud Composer) to ensure data reliability
Strong Linux systems expertise including internals, networking, and performance tuning for large-scale distributed systems
Experience with Istio service mesh to manage microservices communication and traffic
Provisioning and maintaining dedicated GPU nodes (A100/H100/H200/B200), including driver management and OS-level tuning using Ansible
Solid Kubernetes expertise: controllers, CRDs, CNI, and Ingress
Implementing pipelines as code within GitLab CI, managing runners, caching, and security scanning
Infrastructure as Code with Terraform and Terragrunt
Proficiency in Python/Bash for building custom automation and AI Agent tooling
Conducting rigorous load testing for GenAI applications, focusing on metrics like TTFT, TPS, and RPS
Deploying and managing LiteLLM Gateway for unified API access, load balancing, and cost tracking
Experience with Datadog for monitoring GPU utilization, inference health, and log pipelines
Strong ownership mindset: balancing speed, reliability, and cost
Comfortable working cross-functionally with developers, security, and compliance
Excellent sense of responsibility and accountability

Qualification

VLLMNVIDIA TensorRT-LLMKubernetesApache CDCDockerPythonTerraformClickHouseAirflowAnsibleDatadogIstioBashEnglish B2

Required

Deep expertise in high-throughput serving using vLLM, NVIDIA TensorRT-LLM, and sglang to minimize latency and maximize hardware efficiency
Hands-on experience deploying and optimizing large-scale open-weights models, specifically DeepSeek 3.1/3.2, Qwen, and GPT-OSS variants
Advanced optimization and security hardening of Docker specifically for GPU environments
Managing model weights and orchestration within Kubernetes (GKE) environments
Designing and maintaining high-throughput CDC (Change Data Capture) pipelines using the Apache ecosystem (e.g., Debezium, Kafka) to sync data from Cloud PostgreSQL
Deploying and tuning ClickHouse for real-time analytics, ML feature storage, and high-speed logging
Orchestrating complex ML data workflows using Airflow (Google Cloud Composer) to ensure data reliability
Strong Linux systems expertise including internals, networking, and performance tuning for large-scale distributed systems
Experience with Istio service mesh to manage microservices communication and traffic
Provisioning and maintaining dedicated GPU nodes (A100/H100/H200/B200), including driver management and OS-level tuning using Ansible
Solid Kubernetes expertise: controllers, CRDs, CNI, and Ingress
Implementing pipelines as code within GitLab CI, managing runners, caching, and security scanning
Infrastructure as Code with Terraform and Terragrunt
Proficiency in Python/Bash for building custom automation and AI Agent tooling
Conducting rigorous load testing for GenAI applications, focusing on metrics like TTFT, TPS, and RPS
Deploying and managing LiteLLM Gateway for unified API access, load balancing, and cost tracking
Experience with Datadog for monitoring GPU utilization, inference health, and log pipelines
Strong ownership mindset: balancing speed, reliability, and cost
Comfortable working cross-functionally with developers, security, and compliance
Excellent sense of responsibility and accountability
English B2 or higher

Preferred

Experience with PCI-DSS, SOC2, or regulations compliance environments

Benefits

Full-time B2B contract
Fully remote setup, work from anywhere in Europe
Up to 20% tax allowance
22 paid leave days annually
Stock options (ESOP) in a fast-scaling, pre-IPO company
Flexi benefits you can use for wellness, travel, or learning
Relocation support is available to our hubs in Armenia, Georgia, Serbia, and Spain, including flights, temporary accommodation, and legal setup.

Company

Tabby | تابي

twittertwittertwitter
company-logo
Tabby is a financial technology company that helps millions of people in the Middle East to stay in control of their spending and make the most out of their money.

Funding

Current Stage
Late Stage
Total Funding
$1.85B
Key Investors
Hassana Investment Company (HIC)JP MorganWellington Management
2025-10-27Secondary Market
2025-02-12Series E· $160M
2023-12-21Series D· $50M

Leadership Team

leader-logo
Daniil Barkalov
Co-founder and COO
linkedin
Company data provided by crunchbase