MultiOn · 3 days ago
Senior Infrastructure Software Engineer
Wonder how qualified you are to the job?
Software Development
Insider Connection @MultiOn
Responsibilities
Proactively identify opportunities to introduce innovative technology and automation solutions that enhance our infrastructure's efficiency, effectiveness, and scalability
Oversee the provisioning, monitoring, and maintenance of hardware, software, and networks in new data centers
Conduct architecture and research work for distributed AI workloads
Collaborate with vendors to acquire, debug, and maintain next-generation hardware and software optimized for our workloads
Partner with stakeholders to make strategic hardware decisions
Provide technical leadership and guidance during deployment activities
Develop and maintain comprehensive documentation, including plans, SOPs, MOPs, etc.
Qualification
Find out how your skills align with this job's requirements. If anything seems off, you can easily click on the tags to select or unselect skills to reflect your actual expertise.
Required
Minimum of 5 years of experience in DevOps and production-grade software infrastructure
Advanced software development skills in C++, Go, Rust, or similar system languages
Proficiency in Python at an intermediate level
Extensive experience in maintaining production Linux systems, including the setup, management, and maintenance of networking, monitoring, and storage
Experience in Linux systems administration, preferably with contributions to open source projects
Strong expertise in network services, including REST APIs and HTTP
Significant experience in developing tooling and automation solutions
Knowledge of network fundamentals: subnetting, custom routing, firewalls, IPv6
Experience with continuous/rapid release engineering
Proficiency with infrastructure-as-code systems such as Terraform and Pulumi
Solid understanding of low-level operating systems concepts, including multi-threading, memory management, networking, storage, performance, and scale
Experience in managing a production-grade issue response process using tools like PagerDuty, ensuring adherence to uptime SLAs
Familiarity with Kubernetes and containerization, VPNs, GPU workloads
Experience with machine learning frameworks such as PyTorch or TensorFlow
Preferred
GPU programming and CUDA knowledge are advantageous