🐶🐱 Staff Platform Engineer (Remote) jobs in United States
cer-icon
Apply on Employer Site
company-logo

Radimal · 1 day ago

🐶🐱 Staff Platform Engineer (Remote)

Radimal is a veterinary radiology and AI diagnostics platform delivering 24/7 imaging insights to hospitals nationwide. The Staff Platform Engineer will own the technical foundations that enable Radimal’s engineering teams to move quickly and reliably, focusing on platform architecture, infrastructure, and production systems.

Artificial Intelligence (AI)Computer VisionHealth CareHealth DiagnosticsPetVeterinary

Responsibilities

Own the core platform foundations that support all product and AI development
Build shared infrastructure, libraries, and patterns that make it easier to ship safely
Establish clear interfaces and ownership boundaries so teams can move independently
Improve developer experience through better CI/CD, local tooling, and observability
Raise the overall operational maturity of the engineering organization
Own and evolve Radimal’s AWS and Terraform footprint
Lead deployments across ECS, Fargate, EC2, containerized services, and GPU workloads
Manage and improve workloads running on Render and Modal
Make architectural decisions for scale, reliability, and cost efficiency
Reduce operational burden on product and AI engineers by owning reliability and tooling
Create guardrails that increase safety without slowing development
Enable engineers to self-serve infrastructure and diagnostics where appropriate
Own production uptime, SLOs, and operational health
Design and own on-call coverage and escalation models
Serve as senior escalation during incidents while building systems that minimize the need for escalation
Lead incident response and post-incident reviews with clear accountability
Eliminate ambiguity around who owns production at all times
Operate and extend Grafana and Prometheus monitoring stacks
Improve alerting, diagnostics, and operational visibility
Build high-availability and fault-tolerant architectures
Implement caching, CDN, and performance strategies for global scale
Investigate production issues across infrastructure, backend services, data pipelines, AI inference workflows, and frontend behavior
Trace request flow end to end across GraphQL APIs, Python services, and React applications
Read and debug React code as needed to understand client-side behavior and API usage
Form and test hypotheses during incidents to drive fast, accurate resolution
Know when to dive deep personally and when to pull in specialists
Understand ML Ops fundamentals including model deployment, versioning, and monitoring
Support GPU-backed inference workloads and AI service reliability
Partner with AI engineers to ensure models are observable, debuggable, and production-ready
Identify and mitigate operational risks related to model performance, latency, and failures
Partner with the CEO and VP of Engineering on platform strategy and architectural tradeoffs
Provide clear, grounded assessments of platform risk and readiness
Act as a trusted technical owner during high-impact decisions and incidents
Align platform reliability with product and business goals
Strengthen infrastructure security and access controls
Support enterprise security reviews, penetration testing, and SOC 2 readiness
Improve auditability, monitoring, and operational hygiene

Qualification

AWSTerraformPythonGrafanaPrometheusDockerCI/CDPostgresDistributed systemsML OpsAI inferenceGraphQLCommunicationOwnership mindset

Required

7+ years operating production systems at scale
Strong Python experience for automation and backend tooling
Deep AWS experience (ECS, Fargate, EC2, ECR, RDS, CloudFront, IAM)
Strong Terraform, Docker, CI/CD, and infrastructure-as-code expertise
Hands-on experience with Grafana and Prometheus
Experience with Postgres and modern backend architectures
Strong understanding of distributed systems, caching, and performance
Comfort debugging across GraphQL APIs, Python services, and React frontends
Working knowledge of ML Ops concepts and AI inference systems
Clear communicator with a strong ownership mindset

Preferred

Medical imaging or DICOM workflows
GPU compute, AI inference, or ML pipeline integration
Enterprise security reviews or penetration testing
GraphQL or Hasura-based platforms

Company

Radimal

twittertwitter
company-logo
Radimal provides veterinarians with instant, AI-generated radiology reports to assist in treating patients

Funding

Current Stage
Early Stage

Leadership Team

leader-logo
Alan Weissman
Founder, CEO
linkedin
leader-logo
Andrew Weissman VMD, DACVR (he, him, his)
Co-Founder and Head Radiologist
linkedin

Recent News

Today's Veterinary Business
Company data provided by crunchbase