Axiomatic_AI · 4 hours ago
Senior Platform Engineer
Axiomatic AI is building a new class of AI systems designed to reason with the rigor of the scientific method. As a Senior Platform Engineer, you will own the reliability, deployment, and operational excellence of our AI platform, focusing on infrastructure, CI/CD, and operations.
Computer Software
Responsibilities
Lead deployment strategies and CI/CD pipelines across multiple environments
Architect and maintain multi-cloud infrastructure (Azure, AWS, GCP) and on-premise deployments
Own infrastructure as code using Terraform to automate provisioning and configuration
Build comprehensive observability systems: monitoring, metrics, logging, and alerting
Implement security controls, compliance frameworks, and data governance policies
Develop automation tools, APIs, and scripts (Python) to improve operational efficiency
Ensure system reliability, performance, and scalability
Drive incident response, postmortems, and continuous improvement
Troubleshoot infrastructure and application issues across multiple environments
Design and implement deployment pipelines for multi-environment releases (dev, staging, production)
Own the full deployment lifecycle: build, test, release, and rollback strategies
Implement blue-green deployments, canary releases, and progressive rollouts
Build automated deployment tooling and workflows
Ensure zero-downtime deployments and rollback capabilities
Optimize build and deployment performance
Manage artifact repositories and container registries
Design and operate multi-cloud infrastructure across Azure, AWS, and GCP
Architect and deploy on-premise solutions for enterprise customers (Linux-based)
Manage Kubernetes clusters, container orchestration, and networking
Implement disaster recovery, backup strategies, and business continuity
Optimize cloud costs and resource utilization
Define and track SLIs, SLOs, and error budgets for critical services
Write and maintain Terraform modules for infrastructure provisioning
Implement GitOps workflows for infrastructure changes
Automate infrastructure scaling, updates, and operations
Ensure reproducible and version-controlled infrastructure
Design comprehensive monitoring, logging, and alerting (Prometheus, Grafana, Datadog, or similar)
Build dashboards for system health, performance, and business metrics
Implement distributed tracing for microservices
Conduct capacity planning and performance analysis
Drive reliability improvements through data-driven insights
Implement security best practices: identity management, secrets management, network policies
Work towards or maintain security certifications (SOC 2, ISO 27001, or similar)
Conduct security audits and vulnerability remediation
Implement data governance policies for AI pipelines and user data
Ensure compliance with data privacy regulations (GDPR, CCPA)
Write automation scripts and tools in Python for operational tasks
Build internal tooling for deployments, monitoring, and incident response
Develop runbooks, automation, and self-healing systems
Create APIs for infrastructure operations when needed
Maintain high code quality and testing standards for tooling
Participate in on-call rotation and lead incident response
Conduct blameless postmortems and drive action items
Build and maintain incident response playbooks
Improve system resilience and failure modes
Partner with engineering teams on deployment strategies and architecture
Work with security team on compliance and governance
Mentor engineers on operational best practices
Document systems, procedures, and runbooks
Qualification
Required
7+ years of experience in Platform Engineering, Site Reliability Engineering, DevOps, or Infrastructure Engineering roles
Deployment expert: Deep experience with CI/CD pipelines, release strategies, and production deployments at scale
Multi-cloud expertise: Hands-on experience with Azure and AWS required (GCP is a plus)
On-premise deployment experience: Linux system administration, bare-metal provisioning, networking
Terraform expert: Deep experience writing and maintaining infrastructure as code
Observability systems: Proven track record building monitoring, alerting, and metrics platforms
Security mindset: Experience implementing security controls and best practices. Security certification preferred (CISSP, CEH, AWS/Azure Security Specialty, or similar)
Data governance: Understanding of data privacy, residency requirements, and governance frameworks
Backend/scripting skills: Python (preferred) or Go for automation, tooling, and operational scripts
Kubernetes and container orchestration in production
Strong Linux/Unix administration and scripting (Bash, Python)
CI/CD platforms: GitHub Actions, GitLab CI, Jenkins, or similar
Version control and GitOps practices
Strong problem-solving and debugging skills
Fluent in English (Spanish is a plus)
Preferred
Python proficiency for automation and internal tooling
Experience with cloud AI platforms (Vertex AI, Azure ML, AWS SageMaker)
Service mesh experience (Istio, Linkerd) or API gateways
Experience with GPU workloads and ML infrastructure
FinOps and cloud cost optimization
Compliance frameworks experience (SOC 2, ISO 27001, HIPAA, FedRAMP)
Database operations: PostgreSQL, Redis administration
Experience with FastAPI or similar frameworks for internal tools
Contributions to open-source infrastructure projects
Background in hardware or semiconductor industries
Company
Axiomatic_AI
Axiomatic_AI is readying to launch with the aim to accelerate R&D by "Automated Interpretable Reasoning" (AIR) -- a verifiably truthful AI model built for reasoning in science and engineering.
Funding
Current Stage
Early StageCompany data provided by crunchbase