HPC Performance and Validation Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

NorthMark Strategies ยท 3 months ago

HPC Performance and Validation Engineer

NorthMark Strategies is a leader in high-performance computing and cloud infrastructure, dedicated to enhancing technology for scientific research and innovation. The HPC Performance and Validation Engineer will be responsible for developing and optimizing performance baselining frameworks for HPC workloads, ensuring system readiness and driving performance metrics for architectural decisions.

AdviceFinancial ServicesVenture Capital

Responsibilities

Architecting and implementing a validation framework to certify the readiness and utilization of GPU nodes across a large, distributed HPC environment
Defining methodologies to continually assess performance and optimising infrastructure across AI/ML workloads
Developing and executing comprehensive performance testing using industry and customer specific benchmarks, ensuring optimal performance across HPC compute, storage and networking
Contribute to research reports that will describe the discoveries of the benchmarking, evaluating the complete HW performance and efficiency
Leading efforts to debug, identify and then resolve bottlenecks in system performance
Building robust, scalable tools for automated validation and testing, utilising Python, Go, Kubernetes and CI/CD pipelines to streamline continuous validation and benchmarking processes
Implementing monitoring solutions using Prometheus, Grafana and other modern monitoring technologies to track performance metrics and real-time health of the cluster
Defining and implementing best practice for continuous performance validation, ensuring that the infrastructure remains reliable and efficient as new technologies emerge
Staying informed on industry trends and advancements to ensure long-term strategic alignment
Working cross-functionally with engineering, infrastructure and research teams to align validation efforts with the broader business objectives, ensuring that the platform meets evolving research demands

Qualification

HPC performance engineeringGPU optimizationAutomation tools developmentMonitoring solutions implementationNetworking performance optimizationStorage performance optimizationSystem benchmarkingData-driven performance metricsEmerging technologies assessmentTechnical project leadershipCross-functional collaboration

Required

Accelerator performance experience, including profiling and tuning with large-scale GPU clusters
In-depth understanding of NVIDIA ClusterKit, Nsight and Validation Suite, MLPerf and DCGM tools for GPU and DPUs
Networking & Storage performance experience, including profiling and optimisation with NVIDIA ClusterKit, iPerf or equivalent across InfiniBand/RoCe network implementations
System benchmarking experience across Linux and familiarity with the Phronix suite or equivalent
Experience with HPC workloads across distributed global locations, bringing data driven performance data to compliment key architectural decisions
Strong proficiency in developing automation tools and micro benchmarking frameworks for validation using Python, Go, and Kubernetes in a Ubuntu Linux environment
Expertise with key monitoring platforms including OTEL, Prometheus, ELK and Grafana and in definition and implementing the overall observability strategy for HPC validation and performance monitoring
A deep understanding of emerging technologies, architectures and strategies, with the ability to assess their potential impact on infrastructure and adopt them as part of a long-term plan
Proven ability to lead complex technical projects, influence decisions and engage with stakeholders across technical and research teams

Company

NorthMark Strategies

twittertwitter
company-logo
NorthMark Strategies is a multi-strategy investment firm managing diverse portfolios and offering advisory services across sectors.

Funding

Current Stage
Growth Stage
Company data provided by crunchbase