TikTok · 8 hours ago
Senior Platform Engineer - USDS
TikTok is the leading destination for short-form mobile video, and they are seeking a hands-on Platform Engineer to architect, build, and operate the greenfield on-premise infrastructure that powers their next-generation AI initiatives. The role involves collaborating cross-functionally to enhance security tools, manage high-performance workloads, and optimize resources for AI and LLM applications.
Content CreatorsContent DiscoveryMedia and EntertainmentSocial MediaVideo
Responsibilities
Architect and operate highly available, on-premise Kubernetes-based GPU compute cluster. You will manage the scheduling and orchestration of high performance workloads. Build robust CI/CD pipelines specifically for LLM applications
Infrastructure as Code (IaC): Lead the design and implementation of IaC (using Terraform, Ansible or Saltstack) to fully automate the provisioning bare metal servers, network and storage layers, and ensuring the environment is reproducible and idempotent
Lead and perform hands-on technical work, including architecture design and code development for an on-premise, highly scalable, and parallelized infrastructure. The role includes developing internal tools to manage the entire lifecycle of a large scale RAG pipeline
Architect, implement, and manage a high-performance compute cluster for LLM workloads. This involves the selection and configuration of specialized hardware like GPUs, as well as the design of a robust network fabric to facilitate efficient inter-node communication for parallel processing
Implement security best practices for a private data center environment. This includes configuring network firewalls, managing access controls, and encrypting data at rest and in transit
Establish comprehensive monitoring and alerting systems to track the health and performance of the compute cluster and LLM workloads. This involves analyzing metrics related to GPU utilization, memory usage, network throughput, and model inference latency. You will proactively resolve performance issues to enhance platform reliability and operational support for internal teams
Collaborate with internal stakeholders to optimize resource utilization and improve the platform's efficiency. You'll work closely with data scientists and machine learning engineers to understand their compute needs and ensure the infrastructure is optimized for their specific workloads
Qualification
Required
Bachelor's degree in Computer Science, Information Technology, or a related field, with 5+ years of experience in platform, systems, or infrastructure engineering
Proven expertise in infrastructure automation using tools like Terraform, Ansible or Saltstack, with strong hands-on experience in automating deployments and managing bare-metal hardware and virtual machines
Deep experience with on-premises infrastructure, with a solid understanding of large scale data processing, distributed computing and other big data infrastructure
Strong grasp of networking, storage, and virtualization technologies, with practical knowledge of building and supporting complex distributed systems for parallel processing of LLM workloads
Hands-on experience with container technologies (e.g., Docker) and managing on-premises Kubernetes clusters in production environments
Proficiency in scripting and automation using languages like Python and Go, and experience with full-stack development in languages such as Rust to build internal platforms
Preferred
Demonstrated ability to work collaboratively with cross-functional teams to build scalable, reliable, and high-performance internal platforms and services specifically tailored for AI and LLM use cases
Relevant certifications such as Oracle Cloud Infrastructure Architect, CKA (Certified Kubernetes Administrator), RHCE (Red Hat Certified Engineer), or other Network Engineering certifications
Technical writing and communication skills that enable effective problem-solving and strengthen interpersonal relationships
Critical thinking and architectural decision making skills to support effective, collaborative leadership in a fast-paced, dynamic environment
Strong team and XFN co-ordination & management abilities
Experience with cloud platforms other than OCI e.g., AWS, Azure, GCP
Benefits
Medical, dental, and vision insurance
401(k) savings plan with company match
Paid parental leave
Short-term and long-term disability coverage
Life insurance
Wellbeing benefits
10 paid holidays per year
10 paid sick days per year
17 days of Paid Personal Time (prorated upon hire with increasing accruals by tenure)
Company
TikTok
TikTok is a short-form video entertainment app and social network platform. It is a sub-organization of ByteDance.
H1B Sponsorship
TikTok has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (979)
2024 (601)
2023 (387)
2022 (322)
2021 (133)
2020 (72)
Funding
Current Stage
Late StageRecent News
2026-01-14
https://fastcompanyme.com
2026-01-14
Company data provided by crunchbase