GCP Supercomputer Solutions Support jobs in United States
info-icon
This job has closed.
company-logo

Echo IT Solutions · 5 months ago

GCP Supercomputer Solutions Support

Echo IT Solutions is a supplier providing engineering, maintenance, and enhancement services for Google Cloud Platform's Supercomputer Solutions. The role involves ongoing operational tasks, testing, documentation, and specific development deliverables for the Cluster Toolkit and HyperCompute Cluster Service.

AnalyticsArtificial Intelligence (AI)Cloud ComputingConsultingInformation TechnologyMachine LearningWeb Development
check
H1B Sponsor Likelynote

Responsibilities

The contractor must provide ongoing maintenance and enhancements for all 6 projects covered under the original Statement of Work
Stability Testing: Test the stability of new products, beginning with A3U. This includes: Building NVIDIA Collective Communications Library (NCCL) tests on a Slurm cluster. Setting up and running pairwise tests to identify and report bad nodes
Integration Test Triage: Perform rotational duties to manage and triage integration test failures. This includes: Monitoring daily failure chats and flake tools. Reporting on failures and performing advanced handling, such as creating new bug reports and categorizations
Documentation: Improve, organize, and maintain the Cluster Toolkit documentation. This process involves: Gathering existing documents and identifying information gaps. Creating new documentation and updating existing materials. Organizing the information in g3docs, consolidating it in a team Google Drive, and establishing a review process
Project Cleanup: Once a week, clean up the 'hpc-toolkit-dev' project by identifying and deleting unused resources
Security: Triage and address security alerts by checking for them, creating PageRanks (PRs) to resolve them, and applying the necessary updates
HPC VM Image Releases: Deliver 4-6 High-Performance Computing Virtual Machine (HPC VM) image releases during 2025
Software Widget Releases: Release new software widgets every two weeks during 2025, including managing any necessary hotfixes
API Integration Testing: Add comprehensive integration tests for all HCS Application Programming Interface (API) surfaces. Coverage must include: HypercomputeClusters: Create, Delete, Update, Get, and List requests and responses. Network: NetworkInitialize params. Storage: StorageInitialize, FileStoreInitialize, Filestore tier, ParallelstoreInitialize, and GcsInitialize params. Compute: Resource request, Guest accelerator, Disk, Provisioning model, Reservation affinity and type, Orchestrator, Slurm, Node test, Storage configuration, and Slurm partition
Critical User Journey (CUJ) Validation: Add integration tests to validate the following critical user journeys: Creating a cluster that consumes a reservation. Creating a cluster with a new network and new storage. Creating a cluster using a pre-existing network and storage created both outside of HCS and by a previous HCS deployment. Destroying all components of an HCS-created cluster. Destroying a cluster while leaving the network and storage intact. Updating a Slurm cluster to add a new reservation to both new and existing partitions

Qualification

GCPHigh-Performance ComputingAPI Integration TestingNVIDIA NCCLSlurmSecurity TriageDocumentationTeam Collaboration

Required

Experience with Google Cloud Platform (GCP)
Proficiency in high-performance computing (HPC), artificial intelligence (AI), and machine learning (ML) workloads
Experience with stability testing and integration test triage
Familiarity with NVIDIA Collective Communications Library (NCCL)
Ability to improve, organize, and maintain technical documentation
Experience with project cleanup and resource management
Knowledge of security alert triage and resolution
Experience with API integration testing
Ability to validate critical user journeys in a cloud environment
Familiarity with Slurm cluster management

Company

Echo IT Solutions

twittertwitter
company-logo
Echo IT Solutions provides IT consulting, managed services, cloud, cybersecurity, data, and custom software development.

H1B Sponsorship

Echo IT Solutions has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (17)
2024 (16)
2023 (7)
2022 (16)
2021 (3)

Funding

Current Stage
Growth Stage
Company data provided by crunchbase