Senior Platform Engineer - USDS jobs in United States
cer-icon
Apply on Employer Site
company-logo

TikTok · 12 hours ago

Senior Platform Engineer - USDS

TikTok is the leading destination for short-form mobile video, and they are seeking a hands-on Platform Engineer to architect, build, and operate the greenfield on-premise infrastructure that powers their next-generation AI initiatives. The role involves collaborating cross-functionally to enhance security tools, manage high-performance workloads, and optimize resources for AI and LLM applications.

Content CreatorsContent DiscoveryMedia and EntertainmentSocial MediaVideo
check
H1B Sponsor Likelynote

Responsibilities

Architect and operate highly available, on-premise Kubernetes-based GPU compute cluster. You will manage the scheduling and orchestration of high performance workloads. Build robust CI/CD pipelines specifically for LLM applications
Infrastructure as Code (IaC): Lead the design and implementation of IaC (using Terraform, Ansible or Saltstack) to fully automate the provisioning bare metal servers, network and storage layers, and ensuring the environment is reproducible and idempotent
Lead and perform hands-on technical work, including architecture design and code development for an on-premise, highly scalable, and parallelized infrastructure. The role includes developing internal tools to manage the entire lifecycle of a large scale RAG pipeline
Architect, implement, and manage a high-performance compute cluster for LLM workloads. This involves the selection and configuration of specialized hardware like GPUs, as well as the design of a robust network fabric to facilitate efficient inter-node communication for parallel processing
Implement security best practices for a private data center environment. This includes configuring network firewalls, managing access controls, and encrypting data at rest and in transit
Establish comprehensive monitoring and alerting systems to track the health and performance of the compute cluster and LLM workloads. This involves analyzing metrics related to GPU utilization, memory usage, network throughput, and model inference latency. You will proactively resolve performance issues to enhance platform reliability and operational support for internal teams
Collaborate with internal stakeholders to optimize resource utilization and improve the platform's efficiency. You'll work closely with data scientists and machine learning engineers to understand their compute needs and ensure the infrastructure is optimized for their specific workloads

Qualification

KubernetesInfrastructure as CodeGPU computingTerraformAnsiblePythonGoDockerNetworkingCloud platformsTeam coordinationTechnical writingCritical thinking

Required

Bachelor's degree in Computer Science, Information Technology, or a related field, with 5+ years of experience in platform, systems, or infrastructure engineering
Proven expertise in infrastructure automation using tools like Terraform, Ansible or Saltstack, with strong hands-on experience in automating deployments and managing bare-metal hardware and virtual machines
Deep experience with on-premises infrastructure, with a solid understanding of large scale data processing, distributed computing and other big data infrastructure
Strong grasp of networking, storage, and virtualization technologies, with practical knowledge of building and supporting complex distributed systems for parallel processing of LLM workloads
Hands-on experience with container technologies (e.g., Docker) and managing on-premises Kubernetes clusters in production environments
Proficiency in scripting and automation using languages like Python and Go, and experience with full-stack development in languages such as Rust to build internal platforms

Preferred

Demonstrated ability to work collaboratively with cross-functional teams to build scalable, reliable, and high-performance internal platforms and services specifically tailored for AI and LLM use cases
Relevant certifications such as Oracle Cloud Infrastructure Architect, CKA (Certified Kubernetes Administrator), RHCE (Red Hat Certified Engineer), or other Network Engineering certifications
Technical writing and communication skills that enable effective problem-solving and strengthen interpersonal relationships
Critical thinking and architectural decision making skills to support effective, collaborative leadership in a fast-paced, dynamic environment
Strong team and XFN co-ordination & management abilities
Experience with cloud platforms other than OCI e.g., AWS, Azure, GCP

Benefits

Medical, dental, and vision insurance
401(k) savings plan with company match
Paid parental leave
Short-term and long-term disability coverage
Life insurance
Wellbeing benefits
10 paid holidays per year
10 paid sick days per year
17 days of Paid Personal Time (prorated upon hire with increasing accruals by tenure)

Company

TikTok is a short-form video entertainment app and social network platform. It is a sub-organization of ByteDance.

H1B Sponsorship

TikTok has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (979)
2024 (601)
2023 (387)
2022 (322)
2021 (133)
2020 (72)

Funding

Current Stage
Late Stage

Leadership Team

leader-logo
N Ali Mohamed
CEO
linkedin
leader-logo
Blake Chandlee
VP Global Business Solutions
linkedin
Company data provided by crunchbase