Lawrence Berkeley National Lab

Select Current Research and Applied Projects

The Data Science and Technology has a number of projects. We list some of our projects that demonstrate the depth and breadth of our department. For more recent information on the projects, please visit the group pages.

Distributed Dynamic Data Analytics Infrastructure for Collaborative Environments

Deduce (Distributed Dynamic Data Analytics Infrastructure for Collaborative Environments) will address the capability gap between current dynamic distributed data resource infrastructure and end-user data integration needs to support data analyses and products. Our research will focus on three areas: 1) Detecting and measuring the impact of data change - the algorithms needed to quantify changes in distributed data and the impact of those changes on data analyses and products; 2) Distributed data semantics - the user, metadata, and provenance information to support dynamic data integration; and 3) Dynamic data lifecycle management - the mechanisms required to identify, store, and track changes in distributed data resources. We will design and prototype a framework with appropriate algorithms and mechanisms to track data changes, additions, versions, metadata and provenance, and federation of resources. We will leverage our current engagements with scientific groups and use real world use cases from environmental scientists, light source scientists, and biologists to derive requirements and test solutions. Deduce provides the foundation needed to build efficient and effective dynamic distributed resource management infrastructure for data integration. more...

Carbon Capture Simulation Initiative

The Carbon Capture Simulation Initiative (CCSI) is a partnership among national laboratories, industry and academic institutions that will develop and deploy state-of-the- art computational modeling and simulation tools to accelerate the commercialization of carbon capture technologies from discovery to development, demonstration, and ultimately the widespread deployment to hundreds of power plants. The CCSI Toolset will provide end users in industry with a comprehensive, integrated suite of scientifically validated models with uncertainty quantification, optimization, risk analysis and decision making capabilities.  more..

Usable Data Abstractions for Next-Generation Scientific Workflows

This project will create a fundamental shift in the design and development of tools for next-generation scientific data by focusing on efficiency and scientist productivity that will be important to tackle some of the nation’s biggest scientific challenges. Improved usability of tools to harness supercomputers is key to enabling scientific data understanding at scale and with efficiency. Traditional approaches to efficiency in supercomputers focus on the hardware and software of the machine and do not consider the user. This project takes a holistic approach to the development of tools and interfaces appropriate to the massive machines by employing a user-centered design process that will focus on developing a deep understanding of user workflows through empirical usability and ethnographic studies. The project will use insights from the user studies to develop new tools that are easier to use, helping the researchers focus on their science, rather than on the process of conducting that science. Users will be able to provide hints and requirements that are then used to ensure that the massive supercomputers operate at highest efficiency. The initial focus areas for this project will be climate sciences and combustion physics.  more..

Tigres: Template Interfaces for Agile Parallel Data-Intensive Science

The Tigres research project is developing a next generation of scientific workflow capabilities. This project is expanding on the idea of Hadoop and will design templates that fit high-level common computational patterns of scientific applications. The key challenges in developing these abstractions and underlying ecosystem are: identifying the right base abstractions to use, designing a portable API for the base abstractions and templates, implementing the API for key languages (e.g. Python) and execution environments (e.g. batch HPC, Hadoop/HDFS, SGE clusters), and tracking all aspects of the execution to enable building efficient and effective higher-level services like provenance tracking and fault-tolerance.  more..

FRIEDA: Flexible Robust Intelligent Elastic Data Management

This project investigates an infrastructure strategy that seeks to combine the different usage modalities present in DOE communities under one model that encourages collaborative sharing, supports community growth, accommodates emergent usage patterns such as on-demand computing, and lowers the entry barrier to the use of DOE facilities from desktop to exascale. We adapt and extend the recently emerged trends, in cloud computing and virtualization, to define, investigate, and develop tools allowing providers to turn a supercomputer into a cloud, and users to leverage the advantages that this model of provisioning offers, in particular elastic, auto-scaling and integrated compute and storage resources. Such infrastructure is essential to providing support for a broad scientific community.  more..

Materials Project Center for Functional Electronic Materials

The Materials Project is a high-throughput framework developed by MIT and LBNL and subsequently extended by collaborators at the Lawrence Berkeley Laboratory and National Energy Research Scientific Computing (NERSC). This Center, funded by the DOE BES Predictive Theory and Modeling for Materials and Chemical Sciences program, will extend the Materials Project with high-throughput calculations, state-of-the-art electronic structure methods as well as novel data mining algorithms for surface, defect, electronic and finite temperature property predictions -- to yield an unparalleled materials design environment.  more..

DALHIS Data Analysis on Large-scale Heterogeneous Infrastructures for Science

The DALHIS associate team is a collaboration between the Myriads Inria project-team (Rennes, France) and the LBNL Data Science and Technology (DST) department (Berkeley, USA). The objective of the collaboration is to create a software ecosystem to facilitate seamless data analysis across desktops, HPC and cloud environments. Specifically, our goal is to build a dynamic software stack that is user-friendly, scalable, energy-efficient and fault tolerant.  more..

Towards an End-to-End Solution to Light Source Data

Recent improvements in detector resolution and speed and in source luminosity are yielding unprecedented data rates at BES's national light source and neutron source facilities. Their data rates exceed the capabilities of data analysis approaches and computing resources utilized in the past, and will continue to outpace Moore's law scaling for the foreseeable future. We are investigating a comprehensive redesign of the light source analysis environment that can provide an end-to-end solution for data access, management, and analysis; will seamlessly integrate with simulation codes; and will present easy-to-use web interfaces with single sign-on tied to ALS users' unique identity.

AmeriFlux Network Management Project

AmeriFlux datasets provide the crucial linkage between organisms, ecosystems, and process-scale studies at climate-relevant scales of landscapes, regions, and continents, which can be incorporated into biogeochemical and climate models. When viewed as a whole, the network observations enable scaling of trace gas fluxes (CO2, water vapor) across a broad spectrum of times (hours, days, seasons, years, and decades) and space. AmeriFlux observations have been instrumental in defining the relationships between environmental drivers and responses of whole ecosystems, which can be spatialized using machine learning methods like neural networks or genetic algorithms informed by remote sensing products. The AmeriFlux Network Management Project will fund core AmeriFlux sites and will establish data management, and data QA/QC processes for those sites.  more..

Berkeley Data Cloud

Advanced multi-sensor radiation detection systems for both airborne (ARES) and mobile terrestrial (RadMAP) platforms are being deployed by the applied nuclear physics programs. These systems employ highly sensitive radiation detectors that enable gamma-ray energy and imaging analysis. These data streams are coupled to synchronous LiDAR and 3D imagery to provide the basis for high fidelity contextual annotation of the real world similar to Google Streetview. A state-of-the-art framework called the Berkeley Data Cloud is being developed by DST. BDC allows users from different research communities easy and fast access to nuclear radiation data and analyses.

International Soil Carbon Network (ISCN)

The International Soil Carbon Network (ISCN) is an community of scientists from academia, government, and the private sector working towards a large-scale synthesis of soil C research in the United States. The principal goals of the Network are to produce databases, models, and maps that enable users to understand: 1) how much C is stored in soils around the world, 2) how long this C remains in the soil, and 3) the factors that make this soil C vulnerable to being lost (i.e., emitted to the atmosphere). Overarching these goals is the need for a spatially explicit approach, since measurements of soil C storage, turnover, and vulnerability all vary at spatial scales from several meters to thousands of kilometers, as well as across different depths within an individual soil profile. The ISCN is compiling soil sampling, archiving and analysis protocols, sharing scientific, analytical and logistical infrastructure, synthesizing products beneficial to stakeholders and scientists, developing a community-accessible database.  more..

FLUXNET Carbon Flux Data Analysis and Collaboration Infrastructure

FLUXNET is a global network of over 400 carbon flux measurement sensor towers that provide a wealth of long-term carbon, water and energy flux data and metadata. The data from these towers is critical to understanding climate change through cross-site, regional, ecosystem, and global-scale analyses. During this project we have developed a data server to support collaborative analysis of carbon-climate data. That data server now contains global carbon flux data in two major releases. The LaThuile dataset was released in 2007 and has been in use by several hundred paper teams. The FLUXNET2015 dataset was released at the end of 2015 and will continue to be incremented over the course of 2016. These datasets support the global FLUXNET synthesis activity.  more..

Infrastructure Strategy to Support ATLAS Collaboration

The ASCR Collaboratories project "Infrastructure Strategy to Support Collaboration" provides a strategy to address the highly distributed nature of ATLAS computing. We are working to ensure the elastic data analysis use case is addressed through use of FRIEDA, we will provide the "Infrastructure Strategy" project with a set of virtual appliances that realize typical ATLAS data analysis clusters.

Discovery 2020: Big Data Analytics

Today, the tools available to the scientific community are undergoing a major revolution with a wide range of innovations that are enabling powerful new capabilities for knowledge discovery and analytics. The majority of these innovations are being motivated by major large scale science research initiatives. The foundation for these innovations consists of multiple new information technology architectures, tools, techniques and platforms. In this project, we explore existing tools and techniques and conduct a week long workshop exploring a range of different topics from clouds, analyses tools and software defined networking.

Toward a Hardware/Software Co-Design Framework for Ensuring the Integrity of Exascale Scientific Data

This project takes a broad look at several aspects of security and scientific integrity issues in HPC systems. Using several case studies as exemplars, and working closely with both domain scientists as well as facility staff, we will examine several initial concepts in existing scientific computing workflows, and analyze those models better characterize integrity-related computational behavior, and explore ways in which future HPC hardware and software solutions can be co-designed together with security and scientific computing integrity concepts designed and built into as much of the stack from the outset as possible.  more..

Cybersecurity via Inverter-Grid Automatic Reconfiguration (CIGAR)

This project is performing R&D to enable distribution grids to adapt to resist a cyber-attack by (1) developing adaptive control algorithms for DER, voltage regulation, and protection systems; (2) analyze new attack scenarios and develop associated defensive strategies.  more..

UC-Lab Center for Electricity Distribution Cybersecurity

This project will bring together a multi-disciplinary UC-Lab team of cybersecurity and electricity infrastructure experts to investigate, through both cyber and physical modeling and physics-aware cybersecurity analysis, the impact and significance of cyberattacks on electricity distribution infrastructure.  more..

Threat Detection and Response with Data Analytics

The goal of this project is to develop technologies and methodologies to protect the grid from advanced cyber and all-hazard threats through the collection of disparate data and the employment of advanced analytics for threat detection and response.  more..

Integrated Multi Scale Machine Learning for the Power Grid

The goal of this project is to create advanced, distributed data analytics capability to provide visibility and controllability to distribution grid operators.  more..

Medical Science DMZ

We have defined a Medical Science DMZ as a method that allows data flows at scale while simultaneously addressing the HIPAA Security Rule and related regulations governing biomedical data and appropriately managing risk.  more..

Trusted CI

The mission of Trusted CI — the National Science Foundation Cybersecurity of Excellence — is to improve the cybersecurity of NSF computational science and engineering projects, while allowing those projects to focus on their science endeavors.  more..

Past Research Projects

The Systems Biology Knowledgebase (KBase) User Interface Design

KBase is a collaborative effort designed to accelerate our understanding of microbes, microbial communities, and plants. It will be a community-driven, extensible and scalable open-source software framework and application system. Our team is helping with design and development of the KBase user interfaces through a user-centered design process. Our focus is on the "narrative" interface that will be many users' gateway to KBase functionality.

ASCEM: Advanced Simulation Capability for Environmental Management

The ASCEM: Advanced Simulation Capability for Environmental Management project will bring together significant National Laboratory expertise in environmental and computational systems science to develop an advanced modeling capability for resolving contaminant release, transport, fate, and remediation issues across the EM complex. The multidisciplinary, multi-institutional team will develop an open source modeling platform within a state-of-the-art high performance computing (HPC) framework to produce the next generation simulation software needed to address the prediction, risk reduction, and decision support challenges faced by DOE Environmental Management (EM) sites.  more..

Cloud Energy and Emissions Research (CLEER) Model

The C L E E R (Cloud Energy and Emissions Research) Model is a comprehensive user friendly open-access model for assessing the net energy and emissions implications of cloud services in different regions and at different levels of market adoption. The model aims to provide full transparency on calculations and input value assumptions so that its results can be replicated and its data and methods can be easily refined and improved by the research community.


PDG workspace

The Particle Data Group (PDG) is an international collaboration charged with summarizing Particle Physics, as well as related areas of Cosmology and Astrophysics. In 2008, the PDG consisted of 170 authors from 108 institutions in 20 countries. The summaries are published in even-numbered years as a now 1340-page book, the Review of Particle Physics. PDG distributes 16,000 copies of the book. The Review has been called the bible of particle physics; over the years, it has been cited in 30,000 papers. The review is also published and maintained as a fully hyper-linked web site. The computing infrastructure supporting the PDG was conceived and built in the late eighties and although it was modern for its time, it is no longer able to support the many participants in the review and the process of creating the review. We are working with the PDG group to research, design, and build a new interactive workspace that enables all the participants in the review process to collaborate and input data directly into the review infrastructure.  more..

Scientific Cloud Computing

Cloud computing is an emerging paradigm for enabling on-demand computing capabilities. The goal is to discover if the Cloud Computing offerings are suitable for running any of the scientific applications currently running on mid-range compute resources. The project is working with a broad range of science disciplines and cloud architectures to evaluate the utility and trade-off of these cloud architectures for different science disciplines and problems.


Stampede: Middleware for Monitoring and Troubleshooting of Large-Scale Applications on National Cyberinfrastructure

Many applications that use large distributed cyberinfrastructure resources involve a large numbers of inter-dependent jobs that are best represented as scientific workflows. For example, astronomers examining the structure of galaxies, bioinformaticians studying the underpinnings of complex diseases, and earthquake scientists simulating the impact of earthquakes in California. Although domain scientists have access to high-performance resources that allow them to scale-up their applications to hundreds of thousands of jobs, the associated tasks of managing these applications, monitoring their progress, and troubleshooting problems as they occur remain difficult. This work provides robust and scalable workflow monitoring services that can be used to track the progress of applications as they are executing on the distributed cyberinfrastructure. New anomaly detection and troubleshooting services are being developed to alert users to problems with the application and cyberinfrastructure services and allow them to quickly navigate and mine the application execution records to locate the source of the problem.  more..

Assembly likelihood estimation (ALE)

ALE is a probabalistic framework for determining the likelihood of an assembly given the data (raw reads) used to assemble it. It allows for the rapid discovery of errors and comparisons between similar assemblies. Currently ALE reports errors using a combined likelihood metric (the "ALE score"). This metric does not directly indicate which of a number of known, distinct types of assembly errors caused the outlier. We are working with the ALE developers at JGI to use machine learning techniques to predict the type of error (e.g. insertion or deletion) most likely to be associated with large ALE scores in a genome assembly.

Center for Enabling Distributed Petascale Science

The SciDAC-funded Center for Enabling Distributed Petascale Science (CEDPS) is focused on enabling high-performance distributed scientific applications to have fast, dependable data placement and the convenient construction of scalable services. CEDPS will also address the important problem of troubleshooting these and other related distributed activities.  more..

Digital Hydrology and Water Resources Engineering using the California Data Cube

The science of hydrology was formerly driven by the practical needs of the engineering profession where floods, droughts, and water quality could be estimated from prior but limited environmental data collected by various state and federal agencies. Population growth in the United States has placed demands on water resources that can not be met by traditional means and anticipated climate change is casting doubt on the prior assumptions that historical data can be used to predict future conditions. The science of hydrology has shifted to more sensor-based systems for stream flows and spatially distributed meteorological conditions obtained from ground and satellite systems. The sensor data are coupled with weather and climate models at finer and finer spatial resolution. The net result is a vastly increased data collection rate and the generation of model simulations that require storage, retrieval and archiving on a scale not previously contemplated.  more..

Knowledge Discovery and Dissemination - BLACKBOOK

The KDD BLACKBOOK is currently undergoing a significant redesign that will change its basic structure. The DST Department is helping to develop the testing framework for these new components.

Open Science Grid

Open Science Grid ( (OSG) is a national cyber infrastructure for scientific computing enabling geographically distributed collaborations (virtual organizations) to share and aggregate resources to advance the scale and timeliness of deriving scientific results from massive datasets. Impact of OSG - provides middleware and operational infrastructure for two dozen active virtual organizations (scientific collaborations) to do distributed computing around the U.S. on four dozen active sites consuming an average 20,000 CPU's over the past year.  more..

Semantics and Metadata for Ecoinformatics

This project creates tools, techniques, designs and standards on the leading edge of semantics and metadata technologies. Demonstrations are in the area of ecoinformatics (health and the environment). Results are broadly useful in many areas, including energy, defense and intelligence. The focus of the effort is to advance capabilities to link the semantics expressed in concept systems, metadata and data. Example application areas include the semantic web, the use of semantics in scientific models, and more generally, semantic computing.  more..

Integrated Modeling and Ontology Project

DOE NA-22 recently initiated planning for multi-lab assessment projects in the area of nuclear proliferation. The Ontology Development sub-team will develop and demonstrate approaches, techniques, methods and processes that utilize emerging technologies for the development of an ontological structure to support assessment and monitoring missions. The core ontology will have an overall structure that can be applied to the nuclear proliferation area, but the semiconductor device manufacturing industry will serve as a demonstration context with content that is not classified. LBNL is developing an initial ontology which will be released to successively larger audiences and iteratively developed as comments are received.  more..

On-Demand Overlays for Scientific Applications

Scientific computations and collaborations increasingly rely on the network to provide high-speed data transfer, dissemination of results, access to instruments, support for computational steering, etc. Networks with redundant high-speed paths need algorithms to effectively allocate bandwidth across multiple paths for a single application. This capability is required for two important communication classes within the DOE scientific community: large point-to-point transfers and periodic data dissemination from a single sender to multiple receivers. The overall goal of this project was to perform the research and development necessary to create proof-of-concept on-demand overlays for scientific applications that make efficient and effective use of the available network resources.  more..

Supernova Factory

The Nearby Supernova Factory (SNfactory) is an experiment to develop Type Ia supernovae as tools to measure the expansion history of the Universe and explore the nature of Dark Energy. It is the largest data volume supernova search currently in operation. The SNfactory is an international collaboration between several groups in the United States and France.  more..

Deep Sky

Deep Sky is an astronomical image database of unprecedented depth, temporal breadth, and sky coverage. Image data are gathered from the Near Earth Asteroid Tracking (NEAT) project from the 3-CCD and Quest112-CCD cameras on the Samuel Oschin telescope at the Palomar Observatory in San Diego County, California. Containing a total of eleven million images, or 70 terabytes of image data, Deep Sky covers nearly the entire northern sky.  more..

IceCube Data Acquisition

IceCube is an NSF funded, international collaboration building a high-energy neutrino detector at the South Pole. Neutrino's are detected in the clear deep ice below the South Pole via digital optical modules which send data to a surface counting house wherein a data-acquisition system processes the raw data through triggering algorithms into data files. These data files are further processed and filtered then sent north via satellite links for further physics analysis. The goal of the project is to gain a better understanding of the Universe via high-energy neutrinos. The detector is currently 2/3rd complete.


BOSS: Baryon Oscillation Spectroscopic Survey

The BOSS: Baryon Oscillation Spectroscopic Survey will map out the baryon acoustic oscillation (BAO) signature with unprecedented accuracy and greatly improve the constraints on the acceleration of the expansion rate of the Universe. The BOSS survey will use all of the dark and grey time at the Apache Point Observatory (APO) 2.5-m telescope for five years from 2009-2014, as part of Sloan Digital Sky Survey III.


Geospatial Image Verification and Validation

GSV is a project to create the theory, develop the methodology, and produce imagery for verifying and validating geospatial image processing algorithms. Geospatial imagery plays an important role in the detection and characterization of nuclear weapons proliferation. The amount of data produced by existing geospatial sensors overwhelms the abilities of human analysts, and future sensing capabilities will add to the torrent of data. The sheer magnitude of geospatial imagery has driven the requirements for automated image analysis, and subsequently, driven the development of geospatial image analysis algorithms. As the sophistication of these algorithms has increased, so has the need to verify and validate (V and V) the performance of the algorithms.


A Mathematical and Data-Driven Approach to Intrusion Detection for High-Performance Computing

The overall goals of the project were to develop mathematical and statistical methods to detect intrusions of high-performance computing systems. Our mathematical analysis is predicated on the fact that large HPC systems represent unique environments, quite unlike unspecialized systems or general Internet traffic. User behavior on HPC systems tends to be much more constrained (often driven by research deadlines and limited computational resources), and is generally limited to certain paradigms of computation (the set of codes performing the bulk of execution provide a rich source of information). In addition, the collaboration networks of users on an HPC system exhibits special characteristics than can be exploited to detect misuse or fraud. Notable outcomes of this project included the development and application of a technique of fingerprinting computation on HPC machines based on network theory and machine learning.


Application of Cyber Security Techniques in the Protection of Efficient Cyber-Physical Energy Generation Systems

In this project, we are designing and developing a security monitoring and analysis framework for control systems and smart grid technologies. This system is designed to enhance resiliency of the system by integrating traditional computer security and safety engineering techniques. The goal is to integrate the monitoring and analysis of IP network traffic, as well as serial communications and physical constraints within a single intrusion detection system (IDS) framework and provide capabilities for determining the physical safety of system operations by simultaneously examining behavior at multiple hierarchical layers and contexts.


Inferring Computing Activity Using Physical Sensors

This project involves using power data for monitoring use of computing systems, including supercomputers and large computing centers. By using power data, as opposed to data provided by the computing environment itself, the technology collects the data non-invasively.  more..

Supporting Cyber Security of Power Distribution Systems by Detecting Differences Between Real-time Micro-Synchrophasor Measurements and Cyber-Reported SCADA

The power distribution grid, like many cyber physical systems, was developed with careful consideration for safe operation. However, a number of features of the power system make it particularly vulnerable to cyber attacks via IP networks. Traditional IT security techniques tend to leave a gap in safety and protection when applied to cyber-physical devices because they do not consider physical information known about the cyber-physical device they are protecting. The goal of this project is to design and implement a measurement network, which can detect and report the resultant impact of cyber security attacks on the distribution system network. To do this, we use micro phasor measurement units to capture information about the physical state of the power distribution grid and combines this with SCADA command monitoring in real time. The project will build models of safe and unsafe states of the distribution grid so that certain classes cyber attacks can potentially be detected by their physical effects on the power distribution grid alone. The result will be a system that provides an independent, integrated picture of the distribution grid's physical state, which will be difficult for a cyber-attacker to subvert using data-spoofing techniques.  more..

An Automated, Disruption Tolerant Key Management System for the Power Grid

Current key management architectures are not designed for machine-to-machine communication, are designed around an "always online" mentality, and are often burdensome to manage (key distribution, revocation lists, governance, etc.). This project is designing and developing a key management system to meet the unique requirements of electrical distribution systems (EDSs). Namely it is disruption tolerant, scales well, is centrally managed, has policy enforcement and auditing, automates key management services for devices, etc...  more..

Detecting Distributed Denial of Service Attacks on Wide-Area Networks

A large scale distributed denial of service (DDoS) attack has the potential to not only impact the target site, but impact performance along the entire network path. Today, DDoS mitigation across DOE Sites is largely handled at the site borders using a combination of heuristic and filtering techniques, manual changes, and commercial services that can “absorb” attacks against specific sites. With the scale of DDoS increasing dramatically with little indication of slowing down, the task of DDoS detection and mitigation across ESnet’s extensive wide area network (WAN) becomes a higher priority, with increased complexity of detection and execution.  more..

NetSage - Network Measurement, Analysis and Visualization

NetSage is a network measurement, analysis and visualization service designed to address the needs of today's international networks.  more..