Apply on Employer Site

Weights & Biases · 8 hours ago

AI Engineer- Gen AI/SWE- Weights & Biases

Sunnyvale, CA

Full-time

Hybrid

Senior Level

$188K/yr - $275K/yr

6+ years exp

Weights & Biases is part of CoreWeave, the AI Hyperscaler™, which aims to empower developers with tools and infrastructure for AI. The AI Engineer role involves designing, implementing, and evaluating LLM applications and agents, focusing on application rather than novel research, and ensuring responsible deployment and reproducibility.

Artificial Intelligence (AI)Data VisualizationDeveloper ToolsGenerative AIMachine Learning

Comp. & Benefits

No H1B

U.S. Citizen Only

Responsibilities

Ship end-to-end GenAI workflows (prompting → RAG → tools/agents → eval → serve) with reproducible repos, W&B Reports, and dashboards others can run

Build agentic systems (tool use, function calling, multi-step planners) with MCP servers/clients and secure tool/resource integrations

Design evaluation harnesses (RAG/agent evals, golden sets, regression tests, telemetry) and drive continuous improvement via offline + online metrics

Build in public: Publish engineering artifacts (code, docs, talks, tutorials) and engage with OSS and customer engineers; turn repeated patterns into reusable templates

Partner with product/solutions to launch LLM-powered features with clear latency/cost/SLO targets and safety/guardrail checks

Run growth experiments to track the usage of the Weights & Biases suite of products from the artifacts built

Qualification

PythonGenAI applicationsRAG techniquesLLM evaluationAgentic systemsProduction systemsOSS contributionsSoft skills

Required

Software engineering: 6+ years building production systems; strong Python or TypeScript + system design, testing, CI/CD, observability

GenAI apps: shipped LLM-powered features (tools/agents/function calling), with measurable impact (latency/cost/reliability)

Agentic patterns: implemented planners/executors, tool orchestration, sandboxing, and failure taxonomies; familiarity with agent infra concerns

RAG: pragmatic mastery of chunking, embeddings, vector/hybrid search, rerankers; experience with vector DBs/search indices and retrieval policy design

Evaluation: designed LLM/RAG/agent evals (offline golden sets, counterfactuals, user studies, guardrail tests); stats literacy (variance, CIs, power)

Serving & productization: comfortable with queueing, caching, streaming, and cost controls; can debug latency at model, retrieval, and network layers

Public signal: 2+ substantial OSS repos/blog posts/talks/videos with adoption (stars, forks, downloads, views) and reproducible artifacts