Apply on Employer Site

SK hynix America · 1 month ago

AI/ML Cluster System Design Engineer, Contractor

San Jose, CA

Contract

Onsite

Senior Level, Lead/Staff

$150K/yr - $200K/yr

8+ years exp

SK hynix America is a leader in semiconductor innovation, developing advanced memory solutions. They are seeking an AI/ML Cluster System Design Engineer to design and optimize large-scale GPU clusters for AI/ML workloads, ensuring performance and operational efficiency.

Semiconductors

H1B Sponsor Likely

Responsibilities

Architect robust, scalable, and efficient computing clusters that maximize AI workload performance while meeting operational and budgetary constraints

Collaborate across hardware capabilities and AI/ML framework requirements, translating model training needs and inference performance targets into concrete system specifications

Design end-to-end cluster architectures that encompass compute resources, networking fabric, storage subsystems, and power/cooling integration

Select appropriate GPU platforms based on workload characteristics, designing network topologies that minimize communication bottlenecks in distributed training scenarios

Architect storage solutions that can sustain the high-throughput demands of large-scale AI operations

Conduct detailed performance modeling and capacity planning exercises, predicting cluster behavior under various workload scenarios and identifying potential bottlenecks before deployment

Guide decisions on cluster topology, including considerations for rail-optimized designs, spine-leaf architectures, and direct GPU-to-GPU connectivity technologies such as NVLink and InfiniBand configurations

Understand and plan for the infrastructure requirements that support cluster operations, which includes calculating aggregate power requirements based on GPU selection and cluster scale, specifying cooling capacity needed to maintain optimal operating temperatures, determining network bandwidth requirements for different training paradigms, and identifying facility-level dependencies that impact cluster deployment feasibility

Contribute your expertise by conducting architecture reviews, optimize existing cluster configurations, and prototype new design approaches

Provide technical guidance on emerging technologies in AI accelerators, networking, and infrastructure, evaluate vendor solutions against architectural requirements, and benchmark alternative designs

Contribute insights that shape both immediate deployment plans and long-term infrastructure strategy and ensure AI computing capabilities remain competitive, efficient, and future-ready

Qualification

AI/ML cluster designGPU architecture expertiseHigh-performance networkingAI/ML frameworks knowledgePerformance optimizationNetwork designCapacity planningCollaboration skillsProblem-solving skillsCommunication skills

Required

Proven experience designing and deploying large-scale AI/ML clusters in production environments, including clusters with 100+ GPUs

Direct involvement in hardware selection, network design, and performance optimization for AI workloads

Hands-on expertise with modern GPU architectures from NVIDIA or AMD, plus familiarity with emerging AI accelerator technologies

Comprehensive knowledge of AI/ML frameworks and their infrastructure requirements, including PyTorch and distributed training libraries such as DeepSpeed, Megatron-LM, and Ray

Understanding of how framework-specific optimizations impact cluster design decisions and how architectural choices affect model training efficiency and scalability

Strong background in high-performance networking, including designing low-latency, high-bandwidth network fabrics (e.g., InfiniBand, RoCE, or proprietary interconnects)

Understanding of network topology implications for distributed training patterns, including all-reduce operations, parameter server architectures, and pipeline parallelism

Practical experience integrating cluster design decisions with facility requirements, including Power density considerations based on GPU selection, cooling architecture for varying cluster sizes, and space optimization and data center infrastructure alignment

Ability to collaborate effectively with facility engineers to ensure clusters are operationally feasible

Preferred

Bachelor's degree in engineering and science discipline with training that matches standard college level training for computer engineering

8+ years of professional experience in systems architecture

Minimum 3 years dedicated to AI/ML infrastructure design and deployment

Track record of designing clusters supporting diverse workloads from large language model training, to high performance computing and/or computer vision applications

Deep understanding of how workload characteristics influence architectural decisions

Proven ability to balance technical performance with practical constraints such as budget, timeline, and operational feasibility

Company

SK hynix America

Semiconductors are essential to all IT products, and its performance often determines the performance of the final products.

Founded in 1983

San Jose, CA, US

201-500 employees

https://www.skhynix.com/eng/index.jsp

H1B Sponsorship

SK hynix America has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (2)

2024 (16)

2023 (3)

2022 (3)

2021 (2)

2020 (2)

Funding

Current Stage

Growth Stage

Leadership Team

Jennifer Lee

Director of Technology / Evangelist : Pathfinding & Partnerships

Company data provided by crunchbase