NVIDIA · 7 hours ago
Senior Prompt and Benchmark Engineer, Evaluation of World Models
NVIDIA is at the forefront of generative AI engineering, pushing the boundaries of multimodal learning and intelligent simulation. The Senior Prompt and Benchmark Engineer will focus on developing benchmarks for evaluating world models and will leverage prompt engineering techniques to enhance model evaluation processes.
Artificial Intelligence (AI)Consumer ElectronicsGPUHardwareSoftwareVirtual Reality
Responsibilities
Develop detailed, domain-specific benchmarks for evaluating world foundation models, especially generation and understanding world models that reason about video, simulation, and physical environments
Use sophisticated prompt engineering techniques to elicit structured, interpretable responses from a variety of foundation models
Build, refine, and maintain question banks, multiple-choice formats, and test suites to support both automated and human evaluation workflows
Employ multiple VLMs in parallel to explore ensemble evaluation methods such as majority voting, ranking agreement, and answer consensus
Make evaluation as automated and scalable as possible by encoding prompts and expected outputs into structured formats for downstream consumption
Interface directly with Cosmos researchers to translate their evaluation needs into scalable test cases
Collaborate with human annotators, providing clearly structured tasks, feedback loops, and quality control mechanisms to ensure dataset reliability
Meet regularly with domain experts in robotics, autonomous vehicles, and simulation to understand their internal benchmarks, derive transferable metrics, and co-develop standardized evaluation formats
Qualification
Required
Demonstrated experience with prompt engineering, including crafting, refining, and optimizing prompts
Strong attention to detail in designing natural language questions and formatting structured evaluations
Proven ability to reason about model capabilities, failure modes, and blind spots in real-world generative model deployments
Experience crafting or contributing to benchmarks or evaluation datasets, especially for multimodal or agentic systems
Familiarity with evaluating models via prompting, capturing structured outputs, and comparing across model families
Excellent communication and collaboration skills—you will regularly meet with researchers, annotators, and downstream users to iterate on benchmark design
A working understanding of how VLMs and foundation models function at inference time, including token-level outputs, autoregressive decoding, and model context windows
10+ years of experience in Machine Learning, NLP, Human-Computer Interaction, or related fields
BS, MS, or equivalent background. Prior experience in AI evaluation, annotation workflows, or research is highly valued
Preferred
Hands-on experience with multiple LLMs or VLMs (e.g., GPT, Claude, Gemini, Flamingo, Kosmos, IDEFICS, etc.) to compare outputs and engineer task-specific prompts
Prior work designing benchmarks for robotics, simulation, AV, or agentic tasks, especially in multimodal or video-based settings
Experience working with human annotation teams, building clear instructions and QA processes for large-scale labeling campaigns
Familiarity with using VLMs as evaluators, leveraging models for response scoring, ranking, or consensus aggregation
Deep curiosity about model behavior and a drive to test, interrogate, and stretch the limits of generative systems
Benefits
Equity
Benefits
Company
NVIDIA
NVIDIA is a computing platform company operating at the intersection of graphics, HPC, and AI.
H1B Sponsorship
NVIDIA has a track record of offering H1B sponsorships. Please note that this does not
guarantee sponsorship for this specific role. Below presents additional info for your
reference. (Data Powered by US Department of Labor)
Distribution of Different Job Fields Receiving Sponsorship
Represents job field similar to this job
Trends of Total Sponsorships
2025 (1877)
2024 (1355)
2023 (976)
2022 (835)
2021 (601)
2020 (529)
Funding
Current Stage
Public CompanyTotal Funding
$4.09BKey Investors
ARPA-EARK Investment ManagementSoftBank Vision Fund
2023-05-09Grant· $5M
2022-08-09Post Ipo Equity· $65M
2021-02-18Post Ipo Equity
Recent News
2026-01-01
2026-01-01
Business Insider
2026-01-01
Company data provided by crunchbase