Apply on Employer Site

Radimal · 1 day ago

🐶🐱 Staff Platform Engineer (Remote)

United States

Full-time

Remote

Senior Level, Lead/Staff

$175K/yr - $225K/yr

7+ years exp

Radimal is a veterinary radiology and AI diagnostics platform delivering 24/7 imaging insights to hospitals nationwide. The Staff Platform Engineer will own the technical foundations that enable Radimal’s engineering teams to move quickly and reliably, focusing on platform architecture, infrastructure, and production systems.

Artificial Intelligence (AI)Computer VisionHealth CareHealth DiagnosticsPetVeterinary

Responsibilities

Own the core platform foundations that support all product and AI development

Build shared infrastructure, libraries, and patterns that make it easier to ship safely

Establish clear interfaces and ownership boundaries so teams can move independently

Improve developer experience through better CI/CD, local tooling, and observability

Raise the overall operational maturity of the engineering organization

Own and evolve Radimal’s AWS and Terraform footprint

Lead deployments across ECS, Fargate, EC2, containerized services, and GPU workloads

Manage and improve workloads running on Render and Modal

Make architectural decisions for scale, reliability, and cost efficiency

Reduce operational burden on product and AI engineers by owning reliability and tooling

Create guardrails that increase safety without slowing development

Enable engineers to self-serve infrastructure and diagnostics where appropriate

Own production uptime, SLOs, and operational health

Design and own on-call coverage and escalation models

Serve as senior escalation during incidents while building systems that minimize the need for escalation

Lead incident response and post-incident reviews with clear accountability

Eliminate ambiguity around who owns production at all times

Operate and extend Grafana and Prometheus monitoring stacks

Improve alerting, diagnostics, and operational visibility

Build high-availability and fault-tolerant architectures

Implement caching, CDN, and performance strategies for global scale

Investigate production issues across infrastructure, backend services, data pipelines, AI inference workflows, and frontend behavior

Trace request flow end to end across GraphQL APIs, Python services, and React applications

Read and debug React code as needed to understand client-side behavior and API usage

Form and test hypotheses during incidents to drive fast, accurate resolution

Know when to dive deep personally and when to pull in specialists

Understand ML Ops fundamentals including model deployment, versioning, and monitoring

Support GPU-backed inference workloads and AI service reliability

Partner with AI engineers to ensure models are observable, debuggable, and production-ready

Identify and mitigate operational risks related to model performance, latency, and failures

Partner with the CEO and VP of Engineering on platform strategy and architectural tradeoffs

Provide clear, grounded assessments of platform risk and readiness

Act as a trusted technical owner during high-impact decisions and incidents

Align platform reliability with product and business goals

Strengthen infrastructure security and access controls

Support enterprise security reviews, penetration testing, and SOC 2 readiness

Improve auditability, monitoring, and operational hygiene

Qualification

AWSTerraformPythonGrafanaPrometheusDockerCI/CDPostgresDistributed systemsML OpsAI inferenceGraphQLCommunicationOwnership mindset

Required

7+ years operating production systems at scale

Strong Python experience for automation and backend tooling

Deep AWS experience (ECS, Fargate, EC2, ECR, RDS, CloudFront, IAM)

Strong Terraform, Docker, CI/CD, and infrastructure-as-code expertise

Hands-on experience with Grafana and Prometheus

Experience with Postgres and modern backend architectures

Strong understanding of distributed systems, caching, and performance

Comfort debugging across GraphQL APIs, Python services, and React frontends

Working knowledge of ML Ops concepts and AI inference systems

Clear communicator with a strong ownership mindset

Preferred

Medical imaging or DICOM workflows

GPU compute, AI inference, or ML pipeline integration

Enterprise security reviews or penetration testing

GraphQL or Hasura-based platforms

Company

Radimal

Radimal provides veterinarians with instant, AI-generated radiology reports to assist in treating patients

Founded in 2020

San Francisco, California, USA

2-10 employees

http://www.radimal.ai

Funding

Current Stage

Early Stage

Leadership Team

Alan Weissman

Founder, CEO

Andrew Weissman VMD, DACVR (he, him, his)

Co-Founder and Head Radiologist

Recent News

PR Newswire

Cornell Chooses Radimal as Preferred Imaging Provider

2025-10-07

Today's Veterinary Business

Unleashed Potential

2023-12-24

Pet Food Processing

14 startups tapped to present at Petcare Innovation USA

2023-12-24

Company data provided by crunchbase