Sandia National Laboratories · 8 hours ago
Senior/Principal Data Scientist- Artificial Intelligence, CA/NM- Hybrid
Sandia National Laboratories is the nation’s premier science and engineering lab for national security and technology innovation, and they are seeking a Senior/Principal Data Scientist to join their artificial intelligence team. The role involves designing, implementing, and operating an AI-ready data ecosystem that supports the U.S. Department of Energy's next-generation AI Platform, transforming various data sources into governed datasets for AI models and workflows.
GovernmentInformation TechnologyNational Security
Responsibilities
Build and operate an AI-Ready Lakehouse
Design and maintain a federated data lakehouse with full provenance/versioning, attribute-based access control, license/consent automation, and agent telemetry services
Implement automated, AI-mediated ingestion pipelines for heterogeneous sources (HPC simulation outputs, experimental instruments, robotics, sensor streams, satellite imagery, production logs)
Enforce Data Security & Assurance
Develop a Data Health & Threat program: dataset fingerprinting, watermarking, poisoning/anomaly detection, red-team sampling, and reproducible training manifests
Configure secure enclaves and egress processes for CUI, Restricted Data, and other sensitive corpora with attestation and differential-privacy where required
Define and Implement Data Governance
Establish FAIR-compliant metadata standards, data catalogs, and controlled-vocabulary ontologies
Automate lineage tracking, quality checks, schema validation, and leak controls at record-level granularity
Instrument AI Workflows with Standardized Telemetry
Deploy Agent Trace Schema (ATS) and Agent Run Record (ARR) frameworks to log tool calls, decision graphs, human hand-offs, and environment observations
Treat agent-generated artifacts (plans, memory, configurations) as first-class data objects
Collaborate Across Pillars
Work with Models and Interfaces teams to integrate data services into training, evaluation, and inference pipelines
Partner with Infrastructure engineers to optimize data movement, tiered storage, and high-bandwidth networking (ESnet) between HPC, cloud, and edge
Engage domain scientists and mission leads for agile deterrence, energy grid, and critical minerals use cases to curate problem-specific datasets
Support Continuous Acquisition & Benchmarking
Design edge-to-scale data acquisition systems with robotics and instrument integration
Develop data/AI benchmarks: datasets, tools, and metrics for pipeline performance, model evaluation, and mission KPIs
Author an AI-mediated parser for a new experimental instrument, automatically extracting and cataloging metadata
Implement an attribute-based policy that blocks unapproved data combinations in a classified enclave
Prototype a streaming pipeline that feeds live sensor data from a nuclear facility into an HPC training queue
Develop a dashboard that alerts on data drift, pipeline failures, or anomalous records
Collaborate with MLOps engineers to version datasets alongside model artifacts in CI/CD
Qualification
Required
Bachelor's degree in Computer Science, Data Science, Statistics, or a related STEM field, plus five (5) years of directly relevant experience, or an equivalent combination of education and experience
Ability to acquire and maintain a DOE Q clearance
Preferred
Graduate degree (M.S. or Ph.D.) with a significant data research component where an independent research project was a graduation requirement (e.g., independent project, thesis, or dissertation)
Experience in developing software for enterprise and national security applications
Experience acquiring, preparing, and analyzing real world data
Demonstrated software development skills and familiarity with modern software development practices
Proven ability to work and communicate effectively in a collaborative and interdisciplinary team environment, guiding technical decisions and mentoring junior staff
Graduate degree in Data Science, Informatics, Statistics, or a related STEM field with a significant data research component
Background in AI-mediated data curation: automated annotation, feature extraction, and dataset certification
Hands-on knowledge of data security and zero-trust principles, including secure enclaves, attribute-based access control, and data masking or differential privacy
Familiarity with FAIR (Findable, Accessible, Interoperable, Reusable) data practices
Curating and managing scientific or engineering datasets
Data architecture for HPC and edge-computing environments
Advanced data fusion techniques for heterogeneous and streaming data sources
Building data pipelines for feature stores, experiment tracking, and model drift monitoring
Designing and enforcing data policies for classified, export-controlled, or proprietary data
Collaborating on public-private partnerships or multi-lab federated data efforts
Demonstrated expertise in building and maintaining production data pipelines (ETL/ELT) and data warehouses or data lakes
Proficiency in programming languages such as Python, SQL, and experience with frameworks like Apache Spark or Dask
Familiarity with cloud platforms (AWS, Azure, or GCP) and container orchestration (Kubernetes)
Benefits
Generous vacation
Strong medical and other benefits
Competitive 401k
Learning opportunities
Relocation assistance
Amenities aimed at creating a solid work/life balance
Company
Sandia National Laboratories
Sandia is a conducts research and development into the non-nuclear components of nuclear weapons.
Funding
Current Stage
Late StageTotal Funding
$4.4MKey Investors
US Department of EnergyARPA-E
2023-09-21Grant· $0.5M
2023-07-27Grant
2023-01-10Grant· $3.7M
Leadership Team
Recent News
Inside HPC & AI News | High-Performance Computing & Artificial Intelligence
2026-01-13
2026-01-11
Inside HPC & AI News | High-Performance Computing & Artificial Intelligence
2026-01-09
Company data provided by crunchbase