Apply on Employer Site

Tabby | تابي · 2 hours ago

Senior ML/Data Ops Engineer II

United States

Full-time

Remote

Senior Level

Tabby is a financial technology company that reshapes how people shop, earn, and save. They are seeking a Senior ML/Data Ops Engineer II to manage model serving, optimize data pipelines, and ensure infrastructure reliability for their innovative payment solutions.

Artificial Intelligence (AI)BillingFinanceFinancial ServicesFinTechPayments

Growth Opportunities

No H1B

Responsibilities

Deep expertise in high-throughput serving using vLLM, NVIDIA TensorRT-LLM, and sglang to minimize latency and maximize hardware efficiency

Hands-on experience deploying and optimizing large-scale open-weights models, specifically DeepSeek 3.1/3.2, Qwen, and GPT-OSS variants

Advanced optimization and security hardening of Docker specifically for GPU environments

Managing model weights and orchestration within Kubernetes (GKE) environments

Designing and maintaining high-throughput CDC (Change Data Capture) pipelines using the Apache ecosystem (e.g., Debezium, Kafka) to sync data from Cloud PostgreSQL

Deploying and tuning ClickHouse for real-time analytics, ML feature storage, and high-speed logging

Orchestrating complex ML data workflows using Airflow (Google Cloud Composer) to ensure data reliability

Strong Linux systems expertise including internals, networking, and performance tuning for large-scale distributed systems

Experience with Istio service mesh to manage microservices communication and traffic

Provisioning and maintaining dedicated GPU nodes (A100/H100/H200/B200), including driver management and OS-level tuning using Ansible

Solid Kubernetes expertise: controllers, CRDs, CNI, and Ingress

Implementing pipelines as code within GitLab CI, managing runners, caching, and security scanning

Infrastructure as Code with Terraform and Terragrunt

Proficiency in Python/Bash for building custom automation and AI Agent tooling

Conducting rigorous load testing for GenAI applications, focusing on metrics like TTFT, TPS, and RPS

Deploying and managing LiteLLM Gateway for unified API access, load balancing, and cost tracking

Experience with Datadog for monitoring GPU utilization, inference health, and log pipelines

Strong ownership mindset: balancing speed, reliability, and cost

Comfortable working cross-functionally with developers, security, and compliance

Excellent sense of responsibility and accountability

Qualification

VLLMNVIDIA TensorRT-LLMKubernetesApache CDCDockerPythonTerraformClickHouseAirflowAnsibleDatadogIstioBashEnglish B2

Required

Deep expertise in high-throughput serving using vLLM, NVIDIA TensorRT-LLM, and sglang to minimize latency and maximize hardware efficiency

Hands-on experience deploying and optimizing large-scale open-weights models, specifically DeepSeek 3.1/3.2, Qwen, and GPT-OSS variants

Advanced optimization and security hardening of Docker specifically for GPU environments

Managing model weights and orchestration within Kubernetes (GKE) environments

Designing and maintaining high-throughput CDC (Change Data Capture) pipelines using the Apache ecosystem (e.g., Debezium, Kafka) to sync data from Cloud PostgreSQL

Deploying and tuning ClickHouse for real-time analytics, ML feature storage, and high-speed logging

Orchestrating complex ML data workflows using Airflow (Google Cloud Composer) to ensure data reliability

Strong Linux systems expertise including internals, networking, and performance tuning for large-scale distributed systems

Experience with Istio service mesh to manage microservices communication and traffic

Provisioning and maintaining dedicated GPU nodes (A100/H100/H200/B200), including driver management and OS-level tuning using Ansible

Solid Kubernetes expertise: controllers, CRDs, CNI, and Ingress

Implementing pipelines as code within GitLab CI, managing runners, caching, and security scanning

Infrastructure as Code with Terraform and Terragrunt

Proficiency in Python/Bash for building custom automation and AI Agent tooling

Conducting rigorous load testing for GenAI applications, focusing on metrics like TTFT, TPS, and RPS

Deploying and managing LiteLLM Gateway for unified API access, load balancing, and cost tracking

Experience with Datadog for monitoring GPU utilization, inference health, and log pipelines

Strong ownership mindset: balancing speed, reliability, and cost

Comfortable working cross-functionally with developers, security, and compliance

Excellent sense of responsibility and accountability

English B2 or higher

Preferred

Experience with PCI-DSS, SOC2, or regulations compliance environments

Benefits

Full-time B2B contract

Fully remote setup, work from anywhere in Europe

Up to 20% tax allowance

22 paid leave days annually

Stock options (ESOP) in a fast-scaling, pre-IPO company

Flexi benefits you can use for wellness, travel, or learning

Relocation support is available to our hubs in Armenia, Georgia, Serbia, and Spain, including flights, temporary accommodation, and legal setup.

Company

Tabby | تابي

Tabby is a financial technology company that helps millions of people in the Middle East to stay in control of their spending and make the most out of their money.

Founded in 2019

Riyadh, Ar Riyad, SAU

1001-5000 employees

http://www.tabby.ai

Funding

Current Stage

Late Stage

Total Funding

$1.85B

Key Investors

Hassana Investment Company (HIC)JP MorganWellington Management

2025-10-27Secondary Market

2025-02-12Series E· $160M

2023-12-21Series D· $50M

Leadership Team

Daniil Barkalov

Co-founder and COO

Recent News

SecneNow

Saudi Arabia Leads Middle East Venture Funding With $1.72B in 2025

2026-01-16

SecneNow

The Startup Investments That Turned Heads Across MENA in 2025

2025-12-30

SecneNow

The Saudi Startups That Scored Big in 2025

2025-12-28

Company data provided by crunchbase