Apply on Employer Site

GEICO · 3 months ago

Sr. Staff Software Engineer - AI/ML Infra

Chevy Chase, MD

Full-time

Hybrid

Senior Level, Lead/Staff

8+ years exp

GEICO is a leading insurance company that values innovation and quality coverage for its customers. They are seeking a Senior ML Platform Engineer to build and scale their machine learning infrastructure, focusing on Large Language Models and AI applications, while also providing technical leadership and mentoring to junior engineers.

Auto InsuranceFinancial ServicesGovernmentInsuranceInternetMobile

H1B Sponsor Likely

Responsibilities

Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)

Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization

Design, implement, and maintain feature stores for ML model training and inference pipelines

Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions

Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures

Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances

Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps

Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions

Ensure ML platforms meet enterprise security standards and regulatory compliance requirements

Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases

Design and maintain robust CI/CD pipelines for ML model deployment using Azure DevOps, GitHub Actions, and MLOps tools

Implement automated model training, validation, deployment, and monitoring workflows

Set up comprehensive observability using Prometheus, Grafana, Azure Monitor, and custom dashboards

Continuously optimize platform performance, reducing latency and improving throughput for ML workloads

Design and implement backup, recovery, and business continuity plans for ML platforms

Mentor junior engineers and data scientists on platform best practices, infrastructure design, and ML operations

Lead comprehensive code reviews focusing on scalability, reliability, security, and maintainability

Design and deliver technical onboarding programs for new team members joining the ML platform team

Establish and champion engineering standards for ML infrastructure, deployment practices, and operational procedures

Create technical documentation, runbooks, and deliver internal training sessions on platform capabilities

Work closely with data scientists to understand requirements and optimize workflows for model development and deployment

Collaborate with product engineering teams to integrate ML capabilities into customer-facing applications

Support research teams with infrastructure for experimenting with cutting-edge LLM techniques and architectures

Present technical solutions and platform roadmaps to leadership and cross-functional stakeholders

Qualification

Machine Learning InfrastructureLarge Language ModelsKubernetesAzure ServicesPythonCI/CD PipelinesTerraformDockerPrometheusDataRobotAnalytical SkillsLeadershipMentoringCommunication SkillsCollaboration

Required

Bachelor's degree in computer science, Engineering, or related technical field (or equivalent experience)

8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps

3+ years of hands-on experience with machine learning infrastructure and deployment at scale

2+ years of experience working with Large Language Models and transformer architectures

Proficient in Python; strong skills in Go, Rust, or Java preferred

Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)

Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling

Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)

Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)

Hands-on experience with inference optimization using vLLM, TensorRT-LLM, Triton Inference Server, or similar

Advanced experience with Azure DevOps, GitHub Actions, Jenkins, or similar CI/CD platforms

Proficiency with Terraform, ARM templates, Pulumi, or CloudFormation

Deep understanding of Docker, container optimization, and multi-stage builds

Experience with Prometheus, Grafana, ELK stack, Azure Monitor, and distributed tracing

Knowledge of both SQL and NoSQL databases, data warehousing, and vector databases

Demonstrated track record of mentoring engineers and leading technical initiatives

Experience leading design reviews with focus on compliance, performance, and reliability

Excellent ability to explain complex technical concepts to diverse audiences

Strong analytical and troubleshooting skills for complex distributed systems

Experience managing cross-functional technical projects and coordinating with multiple stakeholders

Preferred

Master's degree in computer science, Machine Learning, or related field

8+ years of platform engineering or infrastructure experience

Experience with Staff Engineer or Tech Lead roles in ML/AI organizations

Background in distributed systems and high-performance computing

Open-source contributions to ML infrastructure projects or LLM frameworks

Multi-Cloud Experience: Hands-on experience with Azure, AWS (SageMaker, EKS) and/or GCP (Vertex AI, GKE)

Experience with specialized hardware (A100s, H100s, TPUs, TEEs) and optimization

RLHF & Fine-tuning: Experience with Reinforcement Learning from Human Feedback and LLM fine-tuning workflows

Experience with Milvus, Pinecone, Weaviate, Qdrant, or similar vector storage solutions

Deep experience with MLflow, Kubeflow, DataRobot, or similar platforms

Understanding of AI safety principles, model governance, and regulatory compliance

Background in regulated industries with understanding of data privacy requirements

Experience supporting ML research teams and academic partnerships

Deep understanding of GPU optimization, memory management, and high-throughput systems

Benefits

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being.

Financial benefits including market-competitive compensation; a 401K savings plan vested from day one that offers a 6% match; performance and recognition-based incentives; and tuition assistance.

Access to additional benefits like mental healthcare as well as fertility and adoption assistance.

Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year.

Company

GEICO

Glassdoor2.7

GEICO, Government Employees Insurance Company, has been providing affordable auto insurance since 1936. It is a sub-organization of Berkshire Hathaway.

Founded in 1936

Chase, Maryland, USA

10001+ employees

http://www.geico.com

H1B Sponsorship

GEICO has a track record of offering H1B sponsorships. Please note that this does not guarantee sponsorship for this specific role. Below presents additional info for your reference. (Data Powered by US Department of Labor)

Distribution of Different Job Fields Receiving Sponsorship

Represents job field similar to this job

Trends of Total Sponsorships

2025 (128)

2024 (277)

2023 (338)

2022 (212)

2021 (148)

2020 (205)

Funding

Current Stage

Late Stage

Total Funding

unknown

1996-01-01Acquired

Leadership Team

Todd Combs

Chairman, President, and Chief Executive Officer

Clayton Johnson

Sr. Director of Product Management

Recent News

Business Wire

NerdWallet Announces its 2026 Best-Of Awards Winners

2026-01-07

Investing.com

Warren Buffett’s successor Greg Abel had steady rise at Berkshire

2025-12-15

The Motley Fool

Todd Combs, Key Investment Manager, Just Left Berkshire Hathaway for JPMorgan Chase. Does the Shakeup Bode Well For the Stock?

2025-12-15

Company data provided by crunchbase