Advanced Microdevices Pvt. Ltd. (India) · 6 hours ago
Principal Networking Engineer - QoS / Networking
Advanced Micro Devices, Inc is a leader in next-generation computing experiences, focusing on innovation and collaboration. They are seeking a Principal Networking Engineer to own the end-to-end QoS strategy and implementation across data center SmartNICs/DPUs, ensuring predictable performance for AI/ML and latency-sensitive services.
BiopharmaBiotechnologyIndustrialManufacturing
Responsibilities
Own QoS architecture across network tiers (host
NIC/DPU including classification, policing, shaping, queue mapping, and scheduling strategies for mixed workloads (AI collectives, storage, RPC, control plane)
Design and implement SmartNIC QoS: map DSCP/PCP to NIC traffic classes, configure hardware TX/RX queues, rate limiters, WFQ/DRR schedulers, and offload paths for RDMA/TCP/UDP
Switch QoS policy design: configure PFC, ETS, ECN/RED/WRED, buffer pools, queue thresholds, shared vs. dedicated buffers, and congestion control across multiple ASICs (e.g., Broadcom, NVIDIA/Mellanox, Marvell)
RDMA/RoCE tuning end‑to‑end: lossless/loss‑tolerant modes, CNP/ECN parameters, RNR/retry behavior, MTU/Jumbo frames, and scalable multi‑tenant profiles
Performance engineering: build test plans and run micro/macro benchmarks (e.g., ib_send_lat/ib_write_bw, RCCL/NCCL, iperf, switch counters/telemetry) to validate latency, throughput, tail performance, and fairness
Instrumentation & observability: define SLI/SLOs for QoS (tail latency, drops, PFC events, ECN marks, queue depth, buffer occupancy); integrate with streaming telemetry (gNMI/INT/SFlow) and develop dashboards and alerts
Troubleshoot complex incidents: incast, PFC deadlocks, microbursts, head‑of‑line blocking, unfair scheduling, and noisy neighbors; lead root‑cause analysis and corrective actions
Scale & automation: deliver declarative QoS via intent‑based configs and CI/CD (e.g., Ansible/Salt, NAPALM, gNMI/gNOI, Netconf/YANG), including pre‑deployment simulation and automated canary/rollback
Documentation & standards: author design docs, runbooks, and guidance for tenant teams; contribute to internal standards and vendor requirements
Qualification
Required
Strong experience datacenter networking or systems engineering, with direct ownership of QoS on switches and/or SmartNICs/DPUs
Deep knowledge of QoS mechanisms: classification/marking (DSCP/PCP), policing, shaping, queueing (PRIO, WRR/WFQ/DRR), scheduling hierarchies, and buffer management
Hands‑on with PFC, ETS, ECN/WRED, explicit buffer tuning, and RDMA/RoCE performance/correctness in production
Experience configuring merchant switch silicon (e.g., Broadcom Trident/Tomahawk, NVIDIA Spectrum, Marvell Teralynx) via NOS CLIs/SDKs (e.g., SONiC, Cumulus, NX‑OS, EOS, Onyx)
SmartNIC/DPU experience (e.g., NVIDIA BlueField, Intel IPU, AMD Pensando, Netronome/Agilio): queue configuration, rate limiting, hardware offloads, and host‑NIC QoS mapping
Proficiency with Linux networking (TC, qdisc, mqprio, XDP/eBPF), ethtool, RDMA tools (perftest, rdma-core utilities), and packet/flow analysis (tcpdump, Wireshark, INT/sFlow)
Strong automation skills: Python and/or Go for network automation, telemetry pipelines, and CI/CD integration; Git‑based workflows
Demonstrated ability to debug low‑level performance issues (NIC queues, IRQ affinity, NUMA, PCIe/xGMI topology, driver/firmware interactions)
Excellent written/verbal communication; strong design documentation and cross‑team leadership
Preferred
Large‑scale operations experience (10K+ servers or multi‑region fabrics) with QoS at fleet scale and multi‑tenant isolation
Practical experience with AI/ML workloads (RCCL/NCCL AllReduce, parameter servers, distributed training) and storage (NVMe‑oF, NFS, SMB, object) QoS trade‑offs
Experience with traffic engineering and congestion control in Clos fabrics; familiarity with INT, gNMI, Inband telemetry, and P4 concepts
Contributions to SONiC, DPDK, eBPF/XDP, or OpenConfig; experience with YANG/Netconf, gNOI
Vendor engagement/bring‑up: working with ASIC/NIC vendors on buffer models, scheduling algorithms, and firmware roadmaps
Security awareness for multi‑tenant environments (DSCP abuse, QoS starvation, control‑plane protection, CoPP/ACL integration)
Benefits
AMD benefits at a glance.
Company
Advanced Microdevices Pvt. Ltd. (India)
Advanced Microdevices (mdi) is a leader in innovative membrane technologies.
Funding
Current Stage
Late StageLeadership Team
Nalini Kant Gupta
Founder & Managing Director
Recent News
2024-10-18
2024-10-16
Company data provided by crunchbase