SIGN IN
LLM AIOps Development Engineer - Data Center Networking jobs in United States
cer-icon
Apply on Employer Site
company-logo

TikTok · 10 hours ago

LLM AIOps Development Engineer - Data Center Networking

TikTok is the leading destination for short-form mobile video, and they are seeking a passionate development engineer to join their Network Observation team. The role involves designing and implementing AIOps solutions for hyperscale data center networks, focusing on building intelligent, autonomous systems that enhance network operations through AI technologies.
Social MediaVideoMedia and EntertainmentMarketingContent CreatorsContent Discovery
check
H1B Sponsor Likelynote

Responsibilities

Build a Panoramic Network Observability Platform: Develop a streaming telemetry data pipeline for both physical and virtual networks, integrating multi-source data from gNMI, Netconf, IPFIX/NetFlow, and SNMP to provide a high-quality, real-time data foundation for AIOps
Develop an Intelligent Diagnostics and Root Cause Analysis System: Apply machine learning and deep learning algorithms to perform anomaly detection, correlation analysis, and intelligent noise reduction on massive volumes of network metrics, logs, and events. Swiftly pinpoint root causes of failures across the entire stack, from optical transceivers and switch hardware to protocol adjacencies and application traffic
Explore Innovative Applications of LLMs and Agents: Intelligent Operations Assistant: Build a conversational chatbot powered by Retrieval-Augmented Generation (RAG) that understands natural language queries, automatically queries knowledge bases and monitoring data, and provides precise troubleshooting guidance and network status reports
Automated Remediation and Smart Runbooks: Train operational Agents to safely and controllably invoke network change tools and APIs. Empower them to autonomously generate, recommend, or even execute remediation plans and emergency runbooks based on their understanding of failure scenarios
Establish Capacity and Risk Prediction Capabilities: Forecast network capacity bottlenecks, high-risk links, and "sub-healthy" devices based on historical data and business growth models, enabling proactive scaling and preventative maintenance
Forge a Rock-Solid Engineering System: Adhere to engineering best practices to design and develop a highly available and scalable AIOps platform. Guarantee the stability and performance of the entire pipeline, from data collection and model training to online inference and automated closed-loop actions

Qualification

Data Center NetworkingMachine LearningGolangPythonAIOpsBig Data ProcessingObservability TechnologiesSoft Skills

Required

Solid Fundamentals in Computer Science and Networking: A deep understanding of data center network architectures (e.g., Spine-Leaf Fabric), and proficiency in key protocols such as EVPN/VXLAN and BGP/OSPF. In-depth knowledge of the Linux network stack is essential
Excellent Software Engineering Skills: Mastery of Golang or Python with outstanding coding and system design abilities. Familiarity with modern software development workflows, including microservices, containerization (Docker/Kubernetes), and CI/CD
Rich Platform Development Experience: Practical experience in one or more of the following areas is highly desirable: Big Data Processing: Familiarity with Kafka, Flink, ClickHouse/TSDB, and experience building real-time data pipelines and analytics systems
Observability Technologies: Experience with Prometheus/OpenTelemetry, graph databases (e.g., Neo4j), and developing alert and event platforms
A Passion for AIOps/ML/LLM Practices: A keen interest in the latest advancements in Large Models and Agent technologies, with thoughtful insights or hands-on experience in their application to operations (e.g., RAG, tool use, safety evaluation)

Preferred

Experience in operating or developing for hyperscale (100,000+ servers) data center networks
Proven experience leading or making significant contributions to an LLM/Agent-based intelligent operations project with measurable business impact
Active contributions to open-source communities such as SONiC, P4/PINS, eBPF, Prometheus, or OpenTelemetry
In-depth research or practical experience in high-performance networking (RDMA/RoCE), SmartNICs (NIC Offload), or DPDK/eBPF
Experience building network configuration and control systems (e.g., based on SONiC, gNMI, Netconf)

Benefits

Employees have day one access to medical, dental, and vision insurance
A 401(k) savings plan with company match
Paid parental leave
Short-term and long-term disability coverage
Life insurance
Wellbeing benefits
10 paid holidays per year
10 paid sick days per year
17 days of Paid Personal Time (prorated upon hire with increasing accruals by tenure)

Company

TikTok is a short-form video entertainment app and social network platform. It is a sub-organization of ByteDance.

H1B Sponsorship

TikTok has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (979)
2024 (601)
2023 (387)
2022 (322)
2021 (133)
2020 (72)

Funding

Current Stage
Late Stage

Leadership Team

leader-logo
N Ali Mohamed
CEO
linkedin
leader-logo
Blake Chandlee
VP Global Business Solutions
linkedin
Company data provided by crunchbase