Apply on Employer Site

Hydra Host · 2 weeks ago

AI Solutions Engineer at Hydra Host

Miami, United States

Full-time

Hybrid

Mid, Senior Level

$150K/yr - $225K/yr

Hydra Host is a Founders Fund-backed NVIDIA cloud partner building infrastructure for AI at scale. The AI Solutions Engineer will ensure an exceptional technical experience for AI Platform and Enterprise customers, working on proof-of-concept AI platforms and collaborating with various teams to optimize performance and enhance customer enablement.

Artificial Intelligence (AI)Cloud InfrastructureDeveloper APIsWeb Hosting

Responsibilities

Prototype and operate proof-of-concept AI platforms and neo-clouds on top of Hydra using the Brokkr API — to validate the developer experience

Build and maintain an open-source “neo-cloud in a box” reference implementation that demonstrates multi-tenancy, spin servers up & down based demand, expose containerized or virtualized GPU access

Dogfood Hydra’s API's and infrastructure, and tooling to continuously, find gaps, sharp edges, and failure modes before customers do, and working with product and engineering to resolve them

Work closely with the API and monetization teams by incorporating direct customer feedback into feature prioritization, pricing models, and API design

Run and validate the latest AI platforms, inference stacks, and orchestration frameworks on Hydra to ensure first-class support

Collaborate closely with product and engineering to turn learnings into productized workflows, defaults, automations

Create targeted provisioning templates (e.g., self-managed Kubernetes, specialized inference engines, custom OS images) by researching common software stacks, licenses, and dependencies used by AI platforms

Provide developers with high-quality technical enablement: code samples, SDK contributions, reference implementations, and clear documentation

Act as a technical voice for Hydra’s developer ecosystem: host webinars, write technical content, run demos, participate in events, and support hackathons showcasing what’s possible on Hydra

Document best practices and standardize configurations to scale customer success globally

Qualification

NVIDIA GPU StackBare Metal LinuxAI WorkloadsWorkload OrchestrationScriptingNetworkingMonitoringContainer RuntimesCloud ProvisioningObservabilityHPC ClustersTEEStorage SystemsBMC ProvisioningCustomer ObsessionPrincipled ThinkingTechnical CuriositySystems Thinking

Required

NVIDIA GPU Stack — Deep knowledge of NVIDIA hardware (drivers, firmware, NVLink, NCCL, CUDA, libraries), and how stack compatibility impacts performance

Bare Metal Linux — Strong experience in bare-metal Linux systems administration, driver stacks, and kernel options to use

AI Workloads — Proficiency running many various Hugging Face, PyTorch, model deployment frameworks, vLLM, and large-scale inference/training

AI Benchmarking - Hands-on experience benchmarking AI workloads like Megatron, etc

Workload Orchestration — Experience running Kubernetes clusters (CAPI), Slurm, and Ansible tools for cluster automation and workload management

Scripting - Solid scripting skills (e.g., shell scripts, Perl, Ruby, Python)

Networking — OSI Layer 2/Layer 3 fundamentals (TCP/IP, DNS), VLANs, Bonding

East / West — RoCE or Infiniband familiarity

Observability and Monitoring - nvidia-smi profiling, Prometheus/ Grafana or ELK stack

Container Runtimes - Containers like Docker, Podman, Singularity

Cloud Provisioning - Terraform, Cloud-init, etc