AI Infrastructure Engineer jobs in United States
cer-icon
Apply on Employer Site
company-logo

Brightvision · 16 hours ago

AI Infrastructure Engineer

Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. They are seeking a skilled AI Infrastructure Engineer to design and manage AI/ML infrastructure, optimize workloads across cloud platforms, and implement orchestration tools.

AdvertisingB2BMarketing

Responsibilities

Design and manage AI/ML Infrastructure optimized for GPU Computing using NVIDIA CUDA, enabling high-throughput training and inference workloads
Develop and automate scalable environments with Python scripting on Linux, leveraging Docker for containerization and Kubernetes for orchestration
Deploy and optimize AI workloads across Cloud Platforms (AWS, Azure, GCP), configuring GPU clusters for cost-effective scaling
Implement AI Workload Orchestration tools to schedule, manage, and monitor distributed training jobs across multi-node setups
Build High-Performance Computing (HPC) systems with Distributed Systems expertise, focusing on low-latency Storage & Networking for AI (e.g., NVMe, InfiniBand)
Provision infrastructure using Infrastructure as Code (Terraform), ensuring reproducible and version-controlled deployments
Establish CI/CD pipelines with Git integration for automated building, testing, and rollout of AI infrastructure components
Set up Monitoring & Observability stacks (e.g., Prometheus, Grafana) to track GPU utilization, cluster health, and performance bottlenecks
Collaborate in Agile methodologies, delivering iterative improvements to AI infrastructure through sprints and cross-functional teamwork
Optimize resource allocation for AI pipelines, reducing costs while maximizing throughput for large-scale model training and serving

Qualification

NVIDIA CUDAPython scriptingCloud PlatformsInfrastructure as CodeDockerKubernetesHigh-Performance ComputingCI/CD pipelinesMonitoring & ObservabilityAgile methodologiesDistributed SystemsResource allocation

Required

Design and manage AI/ML Infrastructure optimized for GPU Computing using NVIDIA CUDA, enabling high-throughput training and inference workloads
Develop and automate scalable environments with Python scripting on Linux, leveraging Docker for containerization and Kubernetes for orchestration
Deploy and optimize AI workloads across Cloud Platforms (AWS, Azure, GCP), configuring GPU clusters for cost-effective scaling
Implement AI Workload Orchestration tools to schedule, manage, and monitor distributed training jobs across multi-node setups
Build High-Performance Computing (HPC) systems with Distributed Systems expertise, focusing on low-latency Storage & Networking for AI (e.g., NVMe, InfiniBand)
Provision infrastructure using Infrastructure as Code (Terraform), ensuring reproducible and version-controlled deployments
Establish CI/CD pipelines with Git integration for automated building, testing, and rollout of AI infrastructure components
Set up Monitoring & Observability stacks (e.g., Prometheus, Grafana) to track GPU utilization, cluster health, and performance bottlenecks
Collaborate in Agile methodologies, delivering iterative improvements to AI infrastructure through sprints and cross-functional teamwork
Optimize resource allocation for AI pipelines, reducing costs while maximizing throughput for large-scale model training and serving

Company

Brightvision

twittertwittertwitter
company-logo
Brightvision is a lead generation agency for B2B tech companies.

Funding

Current Stage
Growth Stage
Company data provided by crunchbase