Hewlett Packard Enterprise · 5 hours ago
AI/HPC Infrastructure Software Engineer
Maximize your interview chances
Data CenterEnterprise Software
Actively Hiring
Insider Connection @Hewlett Packard Enterprise
Get 3x more responses when you reach out via email instead of LinkedIn.
Responsibilities
Engage and work with the GPU/CPU vendors, customers, AI ISV and open-source SW communities to validate, tune, and enable high performance AI applications on the Slingshot Ethernet fabric.
Work on partner engagements for the leading communication libraries, middleware and frameworks used in AI development today (NCCL, RCCL, UCX, OneCCL. Pytorch, etc.).
Design, implement and maintain system software that enables communication between GPUS, CPUs, and storage in scale out AI and HPC systems. Work with all the leading architectures and vendors in the AI and Data Center markets – Nvidia, AMD, Intel.
Work with the OEM, ODM, and VAR channels vendors on bring Slingshot to a broader set of customers. Validate and tune applications driving those engagements.
Develop and own HPE product usage support, upstreaming and community engagements, and internal testing and infrastructure.
Work with cross-disciplinary teams to understand business requirements and align software direction to meet those needs.
Qualification
Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.
Required
Bachelor’s/master's degree in computer science, engineering, or related field
5+ years of relevant experience with a background in networking and communications software development and/or architecture in the Data Center, university, government lab, or AI-centric environments.
Experience in scripting programming using languages such as Python, Go, and bash scripting.
Experience with HPC(High-Performance Computing), concepts, and the applications that drive this field.
Experience with large cluster deployment and management
Familiarity with low-latency/high-bandwidth, interconnected infrastructure (including InfiniBand, Ethernet, RDMA/RoCE, and others).
Knowledge of HPC storage (FC, SAS) principles, file systems (NFS, Lustre, ZFS, etc.), and compute node storage, Network Attached Storage.
Experience in setting up and managing monitoring solutions (Elasticsearch, Logstash, Prometheus, Grafana, Kibana, etc.)
In-depth understanding of container technologies like Docker, Containerd, Singularity, Podman
Good Understanding of Kubernetes environments (deployments, storage, services, jobs, ingress, egress, etc)
Experience with Kubernetes extensions (device-plugins, CRD, CNIs, and CSIs)
Good Understanding of Kubernetes design patterns (operators, helm charts, kustomize, etc)
Experience deploying and managing HPC workloads (Slurms, PBS, etc), including system architecture, job scheduling, and queue management.
Experience with Network Automation (Ansible, Terraform, etc.)
Expertise with administration, monitoring, and maintaining secure Linux/Unix operating systems (SLES, RHEL, Ubuntu).
Excellent communication, interpersonal, and customer collaboration skills
Preferred
Familiarity with HPC distributed computing
Familiarity with Open Fabric Interface (libfabric)
Familiarity with MPI (CrayMPI, OMPI, IntelMPI)
Familiarity with Collective Communications Libraries (NCCL, RCCL, etc)
Data Center planning - rack elevations, cabling plan, cables/transceivers.
Familiarity with various CI/CD systems (Jenkins, GitHub Actions, etc).
Network services such as DDI, Firewalls, and Load Balancers
IP planning
Benefits
Health & Wellbeing
Personal & Professional Development
Diversity, Inclusion & Belonging
Company
Hewlett Packard Enterprise
Hewlett Packard Enterprise is an edge-to-cloud company that uses comprehensive solutions to accelerate business outcomes.
Funding
Current Stage
Public CompanyTotal Funding
$1.35B2024-09-10Post Ipo Equity· $1.35B
2015-11-02IPO· undefined
Recent News
2024-11-24
Company data provided by crunchbase