Apply on Employer Site

Brightvision · 16 hours ago

AI Infrastructure Engineer

United States

Full-time

Remote

Mid Level

Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. They are seeking a skilled AI Infrastructure Engineer to design and manage AI/ML infrastructure, optimize workloads across cloud platforms, and implement orchestration tools.

AdvertisingB2BMarketing

Responsibilities

Design and manage AI/ML Infrastructure optimized for GPU Computing using NVIDIA CUDA, enabling high-throughput training and inference workloads

Develop and automate scalable environments with Python scripting on Linux, leveraging Docker for containerization and Kubernetes for orchestration

Deploy and optimize AI workloads across Cloud Platforms (AWS, Azure, GCP), configuring GPU clusters for cost-effective scaling

Implement AI Workload Orchestration tools to schedule, manage, and monitor distributed training jobs across multi-node setups

Build High-Performance Computing (HPC) systems with Distributed Systems expertise, focusing on low-latency Storage & Networking for AI (e.g., NVMe, InfiniBand)

Provision infrastructure using Infrastructure as Code (Terraform), ensuring reproducible and version-controlled deployments

Establish CI/CD pipelines with Git integration for automated building, testing, and rollout of AI infrastructure components

Set up Monitoring & Observability stacks (e.g., Prometheus, Grafana) to track GPU utilization, cluster health, and performance bottlenecks

Collaborate in Agile methodologies, delivering iterative improvements to AI infrastructure through sprints and cross-functional teamwork

Optimize resource allocation for AI pipelines, reducing costs while maximizing throughput for large-scale model training and serving

Qualification

NVIDIA CUDAPython scriptingCloud PlatformsInfrastructure as CodeDockerKubernetesHigh-Performance ComputingCI/CD pipelinesMonitoring & ObservabilityAgile methodologiesDistributed SystemsResource allocation

Required