Software Engineer, Training & Inference Infrastructure jobs in United States
cer-icon
Apply on Employer Site
company-logo

DatologyAI · 6 months ago

Software Engineer, Training & Inference Infrastructure

DatologyAI is a company focused on optimizing training data for machine learning models. They are seeking a Senior Software Engineer to design, implement, and maintain large-scale training and inference infrastructure, working closely with researchers and product engineers.

Artificial Intelligence (AI)Data CenterData IntegrationDatabaseInformation Technology
check
H1B Sponsor Likelynote

Responsibilities

Architect and maintain training infrastructure that are reliable, scalable, and cost-efficient
Build robust model serving infrastructure for low-latency, high-throughput inference across heterogeneous hardware
Automate resource orchestration and fault recovery across GPUs, networking, OS, drivers, and cloud environments
Partner with researchers to productionize new models and features quickly and safely
Optimize training and inference pipelines for performance, reliability, and cost
Ensure all infrastructure meets the highest bar for reliability, security, and observability

Qualification

PythonDeep learning frameworksLarge-scale systemsNVIDIA GPUsML architecturesInference toolingHPC technologiesCollaborativeOwnership mindset

Required

At least 5 years of professional software engineering experience
Expertise in Python
Understanding of modern ML architectures and an intuition for how to optimize their performance, particularly for training and/or inference
Proven experience designing and running large-scale training or inference systems in production
Commitment to engineering excellence: strong design, testing, and operational discipline
Collaborative, humble, and motivated to help the team succeed
Ownership mindset: you're comfortable learning fast and tackling problems end-to-end

Preferred

Experience with deep learning frameworks (PyTorch preferred)
Familiarity with inference tooling like vLLM, SGLang, or custom model parallel systems
Familiarity with PyTorch, NVidia GPUs and the software stacks that optimize them (e.g. NCCL, CUDA), as well as HPC technologies such as InfiniBand, NVLink, AWS EFA etc

Benefits

100% covered health benefits (medical, vision, and dental).
401(k) plan with a generous 4% company match.
Unlimited PTO policy
Annual $2,000 wellness stipend.
Annual $1,000 learning and development stipend.
Daily lunches and snacks are provided in our office!
Relocation assistance for employees moving to the Bay Area.

Company

DatologyAI

twittertwittertwitter
company-logo
DatologyAI is an AI-data curation startup that develops deep learning tools for automatic selection in data training.

H1B Sponsorship

DatologyAI has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (4)
2024 (2)

Funding

Current Stage
Early Stage
Total Funding
$57.65M
Key Investors
FelicisAmplify Partners
2024-05-08Series A· $46M
2024-02-22Seed· $11.65M

Leadership Team

leader-logo
Ari Morcos
CEO and Co-Founder
linkedin
leader-logo
Bogdan Gaza
Co-Founder & CTO
linkedin

Recent News

Company data provided by crunchbase