Privacy-Preserving Data Analysis for Scientific Discovery

Project Summary

Data is frequently not shared by organizations because that data is considered by the organization to be in some way sensitive. For example, there may be laws or regulations prohibiting sharing due to personal privacy or national security issues, or the organization owning the data may also consider that data to be a proprietary trade secret. In any case, that data cannot or will not be released in raw form, and so alternative approaches are needed if that data is to be shared at all.

Today, data is often not shared at all, or if it is shared, it is done so in ways that require people processing or analyzing that data to access the data in highly secured, non-networked environments set up to prevent any data from being exfiltrated either physically from a building or certainly from a network. This is the reason why much research is hindered. Sometimes data is shared through processes of “anonymization” in which data is typically either masked or made more general. Unfortunately, these techniques have repeatedly been shown to fail, typically by merging external information containing identifiable information with quasi-identifiers contained in the dataset in order to identify “anonymized” records in the dataset.

This project aims to develop a method of leveraging a variety of hardware and software apparoaches, in concert with privacy-preserving technologies, such as differential privacy, for the scientific analysis of sensitive data, in order to provide significantly greater confidence to the owner of a set of sensitive data that the data will not be exposed or altered, and also reduce the liability exposure of the data center to assertions of security negligence or insider attacks by providing an environment in which even they cannot access the raw data, all without significant negative impacts to usability or performance. The environment that we envision that is is both secure and usable, and also has protections against “insiders” such as system administrators leverages techniques that are relatively new, and just becoming practically useful for these purposes.

This project is supported by Berkeley Lab Contractor Supported Research funding.

Principal Investigator:

Sean Peisert (PI; LBNL)

Senior Personnel

Chen-nee Chuah
Jane Macfarlane
Michael Zhang

Graduate Students

Ammar Haydari

Past Researchers

Hamdy Elgammal
Reinhard Gentz

Past Students:

Archit Garg
Chitrabhanu Gupta
Jinyue Song
Jayneel Vora

Press regarding this project:

Scientific Data Division Summer Students Tackle Data Privacy - Sept. 15, 2022

Summer Students Tackle COVID-19 - Oct. 21, 2020

Publications resulting from this project:

Ammar Haydari, Chen-Nee Chuah, Michael Zhang, Jane Macfarlane, and Sean Peisert, “Differentially Private Map Matching for Mobility Trajectories,” Proceedings of the 2022 Annual Computer Security Applications Conference (ACSAC), Austin, TX, December 5-9, 2022.

Hector Garcia Martin, Tijana Radivojevic, Jeremy Zucker, Kristofer Bouchard, Jess Sustarich, Sean Peisert, Dan Arnold, Nathan Hillson, Gyorgy Babnigg, Jose Manuel Marti, Christopher J. Mungall, Gregg T. Beckham, Lucas Waldburger, James Carothers, ShivShankar Sundaram, Deb Agarwal, Blake A. Simmons, Tyler Backman, Deepanwita Banerjee, Deepti Tanjore, Lavanya Ramakrishnan, and Anup Singh, “Perspectives for Self-Driving Labs in Synthetic Biology,” arXiv preprint arXiv:2210.09085, 14 Oct 2022.

Ammar Haydari, Michael Zhang, Chen-Nee Chuah, Jane Macfarlane, and Sean Peisert, “Adaptive Differential Privacy Mechanism for Aggregated Mobility Dataset,” arXiv preprint arXiv:2112.08487, 10 Dec 2021.

Luca Pion-Tonachini, Kristofer Bouchard, Hector Garcia Martin, Sean Peisert, W. Bradley Holtz, Anil Aswani, Dipankar Dwivedi, Haruko Wainwright, Ghanshyam Pilania, Benjamin Nachman, Babetta L. Marrone, Nicola Falco, Prabhat, Daniel Arnold, Alejandro Wolf-Yadlin, Sarah Powers, Sharlee Climer, Quinn Jackson, Ty Carlson, Michael Sohn, Petrus Zwart, Neeraj Kumar, Amy Justice, Claire Tomlin, Daniel Jacobson, Gos Micklem, Georgios V. Gkoutos, Peter J. Bickel, Jean-Baptiste Cazier, Juliane Müller, Bobbie-Jo Webb-Robertson, Rick Stevens, Mark Anderson, Ken Kreutz-Delgado, Michael W. Mahoney, James B. Brown, “Learning from Learning Machines: a New Generation of AI Technology to Meet the Needs of Science,” arXiv preprint arXiv:2111.13786, 27 Nov 2021.

Presentations:

Sean Peisert, “Trustworthy Scientific Cyberinfrastructure,” NASEM Cyber Resilience Forum Summer 2023 Meeting, San Francisco, CA, August 31, 2023.

Keynote: “Usable Computer Security and Privacy to Enable Data Sharing in High-Performance Computing Environments,” 3rd High-Performance Computing Security Workshop, NIST National Cybersecurity Center of Excellence (NCCoE), Rockville, MD, March 16, 2023. NIST IR 8476 Workshop Report

Keynote: “Usable Computer Security and Privacy to Enable Data Sharing in High-Performance Computing Environments,” Interdisciplinary Symposium on Responsible Innovation: Intersection of Privacy and Artificial Intelligence, Center for Data Science and AI Research (CeDAR), University of California, Davis, March 10, 2023.

“Responsible Innovation at the Intersection of Privacy and Artificial Intelligence (AI),” (panel; with Eric Dang, Darci Sears, moderators; Tom Kemp, and Richard Arney) Interdisciplinary Symposium on Responsible Innovation: Intersection of Privacy and Artificial Intelligence, Center for Data Science and AI Research (CeDAR), University of California, Davis, March 10, 2023.

“Securing Edge-to-Center Computing with Trustworthy Data Domains,” Monterey Data Workshop, April 21, 2022.

Sean Peisert, “Securing Edge-to-Center Computing with Trustworthy Data Domains,” 2022 AFRL/AFOSR/DOE Energy Cost of Information Workshop, February 18, 2022.

Venkatesh Akella and Sean Peisert, “Usable Computer Security and Privacy to Enable Data Sharing for Scientific Research,” Trusted Computing Center of Excellence (TCCOE) Summit, February 1–3, 2022.

Sean Peisert, “The Social Dilemma,” (panel; with Safiya Noble, UCLA; Gillian Hayes UCI; Bryan Cunningham, UCI; Pegah Parsi, UCSD; and Allison Henry, UC Berkeley), University of California Data Privacy Day, January 28, 2022.

Keynote: “Usable Computer Security and Privacy to Enable Data Sharing for Scientific Research,” Second International Silicon Valley Cybersecurity Conference (SVCC), December 3, 2021.

Sean Peisert, “Advancing Cybersecurity as an Enabling Capability in High-Performance Computing Environments”, HPC User Forum, Sept. 7–9, 2021

Sean Peisert, “Cyber Privacy and Security Risks During the Pandemic” (panel - with Bart Preneel, KU Leuven; Kritika Bhardwaj, NLU Delhi; Margaret Bourdeaux, Harvard/Berkman Klein; Susan Landau, Tufts; and Smitha Prasad, NLU Delhi), Hewlett Foundation event hosted by the Fletcher School at Tufts University and the Centre for Communication Governance (CCG) at National Law University, Delhi, December 17, 2020.

Sean Peisert, “Fragility, Interdependence, and Tradeoffs — Cybersecurity and Privacy Lessons from the Pandemic,” Federal Cybersecurity R&D Interagency Working Group (CSIA IWG), NITRD, December 3, 2020.

Sean Peisert, “Scientific Computing and Sensitive Data,” DataLab Health Data Science and Systems Research and Learning Cluster, University of California, Davis, October 2, 2020.

Sean Peisert, “Privacy-Preserving Data Analysis in Scientific Computing Environments,” White House Office of Science & Technology Policy Workshop, Eisenhower Administration Building, Washington, D.C., Jan. 31, 2020.

Sean Peisert, “Privacy-Preserving Data Analysis for Energy Delivery Systems and Scientific Discovery,” Western Area Power Administration (WAPA), Golden, CO, November 5, 2019.

Sean Peisert, “Usable Computer Security and Privacy to Enable and Encourage Data Sharing for Scientific Research,”  National Academies of Sciences, Engineering, and Medicine Committee on Science, Engineering, Medicine, and Public Policy (COSEMPUP) Meeting, Washington, D.C., November 8, 2018.

Sean Peisert, “Cybersecurity Challenges and Opportunities in High-Performance Computing Environments,” International Supercomputing Conference (ISC), Frankfurt, Germany, June 26, 2018.

More information is available on other Berkeley Lab R&D projects focusing on cybersecurity in general, as well as specifically on cybersecurity for scientific and high-performance computing.