AI/HPC Infrastructure Software Engineer @ Hewlett Packard Enterprise | Jobright.ai
JOBSarrow
RecommendedLiked
0
Applied
0
External
0
AI/HPC Infrastructure Software Engineer jobs in Milpitas, CA
Be an early applicantLess than 25 applicants
company-logo

Hewlett Packard Enterprise · 5 hours ago

AI/HPC Infrastructure Software Engineer

ftfMaximize your interview chances
Data CenterEnterprise Software
check
Actively Hiring

Insider Connection @Hewlett Packard Enterprise

Discover valuable connections within the company who might provide insights and potential referrals.
Get 3x more responses when you reach out via email instead of LinkedIn.

Responsibilities

Engage and work with the GPU/CPU vendors, customers, AI ISV and open-source SW communities to validate, tune, and enable high performance AI applications on the Slingshot Ethernet fabric.
Work on partner engagements for the leading communication libraries, middleware and frameworks used in AI development today (NCCL, RCCL, UCX, OneCCL. Pytorch, etc.).
Design, implement and maintain system software that enables communication between GPUS, CPUs, and storage in scale out AI and HPC systems. Work with all the leading architectures and vendors in the AI and Data Center markets – Nvidia, AMD, Intel.
Work with the OEM, ODM, and VAR channels vendors on bring Slingshot to a broader set of customers. Validate and tune applications driving those engagements.
Develop and own HPE product usage support, upstreaming and community engagements, and internal testing and infrastructure.
Work with cross-disciplinary teams to understand business requirements and align software direction to meet those needs.

Qualification

Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.

Artificial IntelligenceHigh-Performance ComputingNetworking Software DevelopmentKubernetesLinux/Unix AdministrationPythonGoBash ScriptingCluster ManagementLow-latency InfrastructureHPC Storage PrinciplesMonitoring SolutionsContainer TechnologiesNetwork AutomationDistributed ComputingCI/CD SystemsNetwork Services

Required

Bachelor’s/master's degree in computer science, engineering, or related field
5+ years of relevant experience with a background in networking and communications software development and/or architecture in the Data Center, university, government lab, or AI-centric environments.
Experience in scripting programming using languages such as Python, Go, and bash scripting.
Experience with HPC(High-Performance Computing), concepts, and the applications that drive this field.
Experience with large cluster deployment and management
Familiarity with low-latency/high-bandwidth, interconnected infrastructure (including InfiniBand, Ethernet, RDMA/RoCE, and others).
Knowledge of HPC storage (FC, SAS) principles, file systems (NFS, Lustre, ZFS, etc.), and compute node storage, Network Attached Storage.
Experience in setting up and managing monitoring solutions (Elasticsearch, Logstash, Prometheus, Grafana, Kibana, etc.)
In-depth understanding of container technologies like Docker, Containerd, Singularity, Podman
Good Understanding of Kubernetes environments (deployments, storage, services, jobs, ingress, egress, etc)
Experience with Kubernetes extensions (device-plugins, CRD, CNIs, and CSIs)
Good Understanding of Kubernetes design patterns (operators, helm charts, kustomize, etc)
Experience deploying and managing HPC workloads (Slurms, PBS, etc), including system architecture, job scheduling, and queue management.
Experience with Network Automation (Ansible, Terraform, etc.)
Expertise with administration, monitoring, and maintaining secure Linux/Unix operating systems (SLES, RHEL, Ubuntu).
Excellent communication, interpersonal, and customer collaboration skills

Preferred

Familiarity with HPC distributed computing
Familiarity with Open Fabric Interface (libfabric)
Familiarity with MPI (CrayMPI, OMPI, IntelMPI)
Familiarity with Collective Communications Libraries (NCCL, RCCL, etc)
Data Center planning - rack elevations, cabling plan, cables/transceivers.
Familiarity with various CI/CD systems (Jenkins, GitHub Actions, etc).
Network services such as DDI, Firewalls, and Load Balancers
IP planning

Benefits

Health & Wellbeing
Personal & Professional Development
Diversity, Inclusion & Belonging

Company

Hewlett Packard Enterprise

twittertwittertwitter
company-logo
Hewlett Packard Enterprise is an edge-to-cloud company that uses comprehensive solutions to accelerate business outcomes.

Funding

Current Stage
Public Company
Total Funding
$1.35B
2024-09-10Post Ipo Equity· $1.35B
2015-11-02IPO· undefined

Leadership Team

leader-logo
Antonio Neri
President & CEO
linkedin
leader-logo
Irv Rothman
President & CEO
linkedin
Company data provided by crunchbase
logo

Orion

Your AI Copilot