Harrison Clarke · 5 hours ago
Artificial Intelligence Engineer
Harrison Clarke is seeking an AI Infrastructure / Inference Engineer to build and scale systems that support real-time and batch AI workloads in production. The role involves deploying, optimizing, and operating inference services to ensure models are fast, reliable, and cost-efficient.
ConsultingDevOpsHuman ResourcesInformation TechnologyStaffing Agency
Responsibilities
Design, build, and run production inference platforms (real-time, streaming, and batch)
Optimize end-to-end inference performance: latency, throughput, memory, and cost
Build and maintain model serving and deployment pipelines (CI/CD for models, canarying, rollbacks)
Improve system reliability with monitoring, alerting, autoscaling, and incident response
Partner with ML researchers/engineers to translate models into robust production services
Implement tooling for experimentation, A/B testing, and safe rollout of model changes
Qualification
Required
Strong software engineering skills in Python and/or Go/C++
Experience running production services on Kubernetes and cloud platforms (AWS/GCP/Azure)
Solid understanding of distributed systems, performance profiling, and reliability engineering
Hands-on experience with at least one inference/serving stack (e.g., Triton, TorchServe, Ray Serve, KServe, BentoML, or similar)
Familiarity with GPU systems and acceleration (e.g., CUDA basics, TensorRT, mixed precision, batching)
Ability to design clear APIs and service boundaries; comfort working with product and platform teams
Preferred
Experience with LLM inference (token streaming, KV cache, batching, quantization, speculative decoding)
Knowledge of ONNX, TensorRT-LLM, vLLM, DeepSpeed, Accelerate, or similar tooling
Observability expertise (e.g., Prometheus/Grafana, OpenTelemetry, distributed tracing)
Experience with capacity planning, GPU scheduling, and cost optimization
Background in MLOps / ML Platform engineering and model lifecycle tooling
Benefits
Competitive compensation
Benefits
Flexible working options