Apply on Employer Site

AI Cybersecurity Company · 16 hours ago

Senior AI Infrastructure Engineer (LLMOps / MLOps)

San Jose, CA

Full-time

Onsite

Senior Level

5+ years exp

AI Cybersecurity Company is a cutting-edge AI startup focused on tackling challenges in the cybersecurity space. As a Senior AI Infrastructure Engineer, you will design, deploy, and scale AI infrastructure and production pipelines, bridging the gap between AI research and engineering to ensure reliable model performance in real-world applications.

Computer Software

Responsibilities

Own and manage the AI infrastructure stack — GPU clusters, vector databases, and model serving frameworks (vLLM, Triton, Ray, or similar)

Productionize LLMs and ML models developed by the AI team, deploying them into secure, monitored, and scalable environments

Design and maintain REST/gRPC APIs for inference and automation, integrating tightly with the core cybersecurity platform

Collaborate closely with AI scientists, backend engineers, and DevOps to streamline deployment workflows and ensure production reliability

Build and maintain infrastructure-as-code (IaC) setups using Terraform or Pulumi for reproducible environments

Implement observability and monitoring — latency, throughput, model drift, and uptime dashboards with Prometheus / Grafana / OpenTelemetry

Automate CI/CD pipelines for model training, validation, and deployment using GitHub Actions, ArgoCD, or similar tools

Architect scalable, hybrid AI systems across on-prem and cloud, enabling cost-effective compute scaling and fault tolerance

Enforce data privacy and compliance across AI pipelines (SOC2, encryption, access control, VPC isolation)

Manage data and model artifacts, including versioning, lineage tracking, and storage for models, checkpoints, and embeddings

Optimize inference latency, GPU utilization, and throughput, using batching, caching, or quantization techniques

Build fallback and failover mechanisms to maintain service reliability in case of model or API failure

Research and integrate emerging LLMOps and MLOps tools (e.g., LangGraph, Vertex AI, Ollama, Triton, Hugging Face TGI)

Create sandbox environments for AI researchers to experiment safely

Lead cost optimization and capacity planning, forecasting GPU and cloud needs

Document and maintain runbooks, architecture diagrams, and standard operating procedures

Mentor junior engineers and contribute to a culture of operational excellence and continuous improvement

Qualification

ML InfrastructureLLM servingGPU orchestrationPythonCloud platformsIaC toolsCI/CDAPIsMonitoringDockerContainerizationModel registryVector databasesPyTorchTensorFlowMLflowWeights & BiasesDistributed trainingLarge-scale inferenceObservability

Required

5+ years of experience in ML Infrastructure, MLOps, or AI Platform Engineering

Proven expertise with LLM serving, distributed systems, and GPU orchestration (e.g., Kubernetes, Ray, or vLLM)

Strong programming skills in Python and experience building APIs (FastAPI, Flask, gRPC)

Proficiency with cloud platforms (Azure, AWS, or GCP) and IaC tools (Terraform, Pulumi)

Solid understanding of CI/CD, Docker, containerization, and model registry practices

Experience implementing observability, monitoring, and fault-tolerant deployments

Preferred

Familiarity with vector databases (FAISS, Pinecone, Weaviate, Qdrant)

Exposure to security or compliance-focused environments

Experience with PyTorch / TensorFlow and MLflow / Weights & Biases

Knowledge of distributed training or large-scale inference optimization (DeepSpeed, TensorRT, Quantization)

Prior work at startups or fast-paced R&D-to-production environments

Benefits

Comprehensive health, dental, and vision insurance.

Wellness and professional development stipends.

Equity options — share in the company’s success.

Access to the latest tools and GPUs for AI/ML development.

Company

AI Cybersecurity Company

11-50 employees

Funding

Current Stage

Early Stage

Company data provided by crunchbase