Base-2 Solutions ยท 3 months ago
Graphics Processing Unit (GPU) Engineer
Base-2 Solutions is seeking a highly skilled Systems Engineer with deep expertise in operating systems, hardware, GPU, and high-speed networking. In this role, you will design, develop, and optimize GPU clusters that power enterprise AI for the mission customers.
Big DataCloud ComputingSoftware EngineeringTechnical Support
Responsibilities
GPU Cluster Engineering: Design, configure, and maintain GPU Clusters. Collaborate with a multidisciplinary team to define and optimize architectures, ensuring they meet performance, power efficiency, and feature requirements
Operating System Integration: Work closely with AI/ML engineers to ensure smooth GPU integration with Linux-based systems. Optimize GPU drivers for compatibility, reliability, and performance. Provide regular maintenance and updates
Performance Optimization: Analyze GPU performance, identify bottlenecks, and develop strategies to improve efficiency across hardware and software layers
Tooling and Automation: Build and maintain debugging tools, profiling utilities, and performance analysis software for Linux environments. Leverage scripting and configuration tools such as Bash, Python, Ansible, Puppet, and Salt
Compliance & Documentation: Maintain technical documentation, architectural specifications, and Linux best practices. Support ATO (Authority to Operate) and ensure compliance with federal security standards
Qualification
Required
Bachelor's or higher degree in Computer Science, Electrical Engineering, or a related field
10+ years of relevant systems engineering experience
Experience in managing NVIDIA GPU data center platforms. (DGX, HGX, H200, H100, L4s)
Knowledge of enterprise server components (storage/network controllers, HBA, SSDs)
Strong expertise with Linux distributions. (RHEL, Ubuntu, Oracle, and Rocky)
Excellent problem-solving skills and the ability to collaborate within a team
Candidate must, at a minimum, meet DoD 8570.11 - IAT Level II certification requirements (currently Security+ CE, CCNA-Security, GICSP, GSEC, or SSCP along with an appropriate computing environment (CE) certification). An IAT Level III certification would also be acceptable (CASP+, CCNP Security, CISA, CISSP, GCED, GCIH, CCSP)
TS/SCI clearance with Polygraph required or a TS/SCI and willingness to obtain a Polygraph prior to starting
Preferred
Experience with Kubernetes cluster management and AI/ML workflow orchestration (Argo, Airflow, and Kubeflow)
Familiarity with GPU virtualization and cloud computing
Experience with Prometheus/Grafana for monitoring
Knowledge of distributed resource scheduling systems (Slurm (preferred), LSF, etc.)
Benefits
100% paid premiums for health insurance. Choose from over 80 gold-level medical plans from Aetna, CareFirst, Kaiser and UnitedHealthcare. Choose from PPO, EPO, POS, HMO, and HSA-compatible.
HSA and FSA options.
100% paid premiums for dental insurance.
100% paid premiums for vision insurance.
100% paid premiums for short-term disability.
100% paid premiums for long-term disability.
100% paid premiums for accidental death & dismemberment.
100% paid premiums for life insurance with a $200,000 max benefit.
8% company contribution to 401k with immediate vesting.
401k pre-tax and Roth options.
Up to 20 days of flexible paid time off (PTO).
11 days of paid floating holidays.
Flexible work schedules including flex time and compressed work period.
Remote work including partial or fully remote (contract and project-dependent).