DatologyAI · 6 months ago
Software Engineer, Training & Inference Infrastructure
DatologyAI is a company focused on optimizing training data for machine learning models. They are seeking a Senior Software Engineer to design, implement, and maintain large-scale training and inference infrastructure, working closely with researchers and product engineers.
Artificial Intelligence (AI)Data CenterData IntegrationDatabaseInformation Technology
Responsibilities
Architect and maintain training infrastructure that are reliable, scalable, and cost-efficient
Build robust model serving infrastructure for low-latency, high-throughput inference across heterogeneous hardware
Automate resource orchestration and fault recovery across GPUs, networking, OS, drivers, and cloud environments
Partner with researchers to productionize new models and features quickly and safely
Optimize training and inference pipelines for performance, reliability, and cost
Ensure all infrastructure meets the highest bar for reliability, security, and observability
Qualification
Required
At least 5 years of professional software engineering experience
Expertise in Python
Understanding of modern ML architectures and an intuition for how to optimize their performance, particularly for training and/or inference
Proven experience designing and running large-scale training or inference systems in production
Commitment to engineering excellence: strong design, testing, and operational discipline
Collaborative, humble, and motivated to help the team succeed
Ownership mindset: you're comfortable learning fast and tackling problems end-to-end
Preferred
Experience with deep learning frameworks (PyTorch preferred)
Familiarity with inference tooling like vLLM, SGLang, or custom model parallel systems
Familiarity with PyTorch, NVidia GPUs and the software stacks that optimize them (e.g. NCCL, CUDA), as well as HPC technologies such as InfiniBand, NVLink, AWS EFA etc
Benefits
100% covered health benefits (medical, vision, and dental).
401(k) plan with a generous 4% company match.
Unlimited PTO policy
Annual $2,000 wellness stipend.
Annual $1,000 learning and development stipend.
Daily lunches and snacks are provided in our office!
Relocation assistance for employees moving to the Bay Area.
Company
DatologyAI
DatologyAI is an AI-data curation startup that develops deep learning tools for automatic selection in data training.
H1B Sponsorship
DatologyAI has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (4)
2024 (2)
Funding
Current Stage
Early StageTotal Funding
$57.65MKey Investors
FelicisAmplify Partners
2024-05-08Series A· $46M
2024-02-22Seed· $11.65M
Recent News
felicis.com
2025-12-30
Company data provided by crunchbase