Senior Infrastructure Engineer (On-Prem & GPU) @ Intuitive.Cloud | Jobright.ai
JOBSarrow
RecommendedLiked
0
Applied
0
External
0
Senior Infrastructure Engineer (On-Prem & GPU) jobs in United States
77 applicants
company-logo

Intuitive.Cloud ยท 1 day ago

Senior Infrastructure Engineer (On-Prem & GPU)

ftfMaximize your interview chances
Information Technology & Services
check
Growth Opportunities
check
H1B Sponsor Likelynote
Hiring Manager
Mitesh Kumar
linkedin

Insider Connection @Intuitive.Cloud

Discover valuable connections within the company who might provide insights and potential referrals.
Get 3x more responses when you reach out via email instead of LinkedIn.

Responsibilities

Architect and deploy high-performance computing clusters with multi-GPU support for AI/ML workloads.
Implement and optimize GPU resource scheduling, job queuing, and distributed training setups.
Leverage NVIDIA CUDA to optimize performance for AI/ML models and workloads.
Fine-tune GPU configurations for multi-GPU systems, ensuring maximum throughput and minimal latency.
Build Infrastructure as Code (IaC) solutions using Terraform to automate the provisioning and management of on-premise infrastructure.
Create scalable templates for consistent resource deployment.
Deploy and manage container orchestration systems (e.g., Kubernetes, Docker Swarm) to run scalable GPU-accelerated workloads.
Monitor and troubleshoot issues in distributed systems with tools like NVIDIA DCGM, Prometheus, or similar.
Optimize AI/ML pipelines for distributed training across multi-GPU nodes.
Develop strategies to efficiently utilize NVLink, NCCL, and other NVIDIA technologies.
Set up robust monitoring and alerting systems to track GPU utilization, node health, and workload performance.
Collaborate with MLOps teams to integrate GPU clusters into CI/CD pipelines.
Implement security best practices for sensitive AI/ML workloads in an on-premise environment.
Ensure compliance with organizational policies and industry standards.

Qualification

Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.

NVIDIA CUDAGPU-accelerated systemsInfrastructure as Code (IaC)TerraformKubernetesMulti-GPU systemsNVIDIA toolsLibrariesDistributed training frameworksContainer orchestrationPerformance tuningPythonBashNVIDIA DCGMSlurmStorageSystem architecture

Required

7+ years in infrastructure engineering, with at least 5 years of direct experience in GPU-accelerated systems and NVIDIA CUDA.
Proven experience in deploying and managing multi-GPU systems for AI/ML workloads.
Proficiency with NVIDIA CUDA for GPU programming and performance tuning.
Hands-on experience with NVIDIA tools and libraries, including NVLink, NCCL, and cuDNN.
Familiarity with MIG (Multi-Instance GPU) configurations and multi-GPU scaling techniques.
Advanced knowledge of Terraform and scripting languages like Python or Bash for automation.
Proficiency with container orchestration tools like Kubernetes or similar.
Expertise in workload management systems and GPU monitoring tools (e.g., NVIDIA DCGM, Slurm).
Experience in deploying and optimizing distributed training frameworks (e.g., TensorFlow MultiWorkerMirroredStrategy, PyTorch DDP).
Strong understanding of networking, storage, and system architecture for high-performance compute environments.
Strong problem-solving abilities and critical thinking skills.
Excellent communication skills for cross-functional collaboration.
Leadership capabilities to guide junior engineers and manage projects.

Preferred

Experience with hybrid cloud and on-prem integration strategies.
Familiarity with NVIDIA A100, H100, or similar GPUs.
Knowledge of distributed file systems like NFS, Lustre, or Ceph.
Background in optimizing AI/ML pipelines for inference and training.

Company

Intuitive.Cloud

twittertwitter
company-logo
Intuitive.Cloud is one of the fastest-growing (INC 5000, CRN) Cloud & SDx solutions and services providers supporting enterprise customers on a global scale.

H1B Sponsorship

Intuitive.Cloud has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2023 (4)
2022 (1)
2021 (3)
2020 (4)

Funding

Current Stage
Late Stage

Leadership Team

leader-logo
Jay Modh
Founder and CEO
linkedin
Company data provided by crunchbase
logo

Orion

Your AI Copilot