Publications

2024

Baek, Jinyoung, Jonathan Lawson, and Vasiliki Rahimzadeh. (2024) 2024. “Investigating the Roles and Responsibilities of Institutional Signing Officials After Data Sharing Policy Reform for Federally Funded Research in the United States: National Survey.”. JMIR Formative Research 8: e49822. https://doi.org/10.2196/49822.

BACKGROUND: New federal policies along with rapid growth in data generation, storage, and analysis tools are together driving scientific data sharing in the United States. At the same, triangulating human research data from diverse sources can also create situations where data are used for future research in ways that individuals and communities may consider objectionable. Institutional gatekeepers, namely, signing officials (SOs), are therefore at the helm of compliant management and sharing of human data for research. Of those with data governance responsibilities, SOs most often serve as signatories for investigators who deposit, access, and share research data between institutions. Although SOs play important leadership roles in compliant data sharing, we know surprisingly little about their scope of work, roles, and oversight responsibilities.

OBJECTIVE: The purpose of this study was to describe existing institutional policies and practices of US SOs who manage human genomic data access, as well as how these may change in the wake of new Data Management and Sharing requirements for National Institutes of Health-funded research in the United States.

METHODS: We administered an anonymous survey to institutional SOs recruited from biomedical research institutions across the United States. Survey items probed where data generated from extramurally funded research are deposited, how researchers outside the institution access these data, and what happens to these data after extramural funding ends.

RESULTS: In total, 56 institutional SOs participated in the survey. We found that SOs frequently approve duplicate data deposits and impose stricter access controls when data use limitations are unclear or unspecified. In addition, 21% (n=12) of SOs knew where data from federally funded projects are deposited after project funding sunsets. As a consequence, most investigators deposit their scientific data into "a National Institutes of Health-funded repository" to meet the Data Management and Sharing requirements but also within the "institution's own repository" or a third-party repository.

CONCLUSIONS: Our findings inform 5 policy recommendations and best practices for US SOs to improve coordination and develop comprehensive and consistent data governance policies that balance the need for scientific progress with effective human data protections.

Simeon-Dubach, Daniel, Zisis Kozlakidis, Juhi Tayal, Shannon J McCall, Wohaib Hasan, Fay Betsou, Jonathan Lawson, and Dominic Allen. (2024) 2024. “Experts Speak Forum: Implementation of the FAIR Principles in Biobanking Needs Fair Incentives.”. Biopreservation and Biobanking 22 (6): 557-62. https://doi.org/10.1089/bio.2024.0153.

While the FAIR (Findable, Accessible, Interoperable, and Reusable) principles are primarily concerned with data, samples can also be considered a distinct category of data. In light of these considerations, the FAIR principles represent a major challenge for biobanks, as discussed in detail in two recently published studies. We invited seven experts with diverse backgrounds to share their views on these studies and the FAIR principles in general. The contributions are written from different perspectives, including those from human biobanks operating globally, located in low- or middle-income countries or in high-income countries, as well as those from industrial or environmental biobanks. The last two contributions focused on technical feasibility and the necessary incentives. All authors agreed that while the FAIR principles present a challenge for biobanks, they also offer opportunities. Various useful instruments already exist, and more will follow. The key is to provide meaningful incentives.

2023

Lawson, Jonathan, Elena M Ghanaim, Jinyoung Baek, Harin Lee, and Heidi L Rehm. (2023) 2023. “Aligning NIH’s Existing Data Use Restrictions to the GA4GH DUO Standard.”. Cell Genomics 3 (9): 100381. https://doi.org/10.1016/j.xgen.2023.100381.

It is widely accepted that large-scale genomic data (e.g., whole-genome sequencing, whole-exome sequencing, and genome-wide association study data) be shared through a controlled-access mechanism. This protects the privacy of research participants and ensures downstream uses of data align with participants' informed consent regarding future sharing of their data. In 2019, GA4GH approved the Data Use Ontology (DUO) standard to define data use terms with machine-readable representations to represent how a dataset can be used. We endeavored to determine the parity of existing data use restrictions ("Data Use Limitations" [DULs]) for datasets registered in the National Institutes of Health database for Genotypes and Phenotypes (dbGaP) with the DUO standard. We found substantial (93%) parity between the dbGaP DULs (n = 3,575) and DUO. This study demonstrates the comprehensiveness of the DUO standard and encourages data stewards to standardize data use restrictions in machine-readable formats to facilitate data sharing.

2022

Schatz, Michael C, Anthony A Philippakis, Enis Afgan, Eric Banks, Vincent J Carey, Robert J Carroll, Alessandro Culotti, et al. (2022) 2022. “Inverting the Model of Genomics Data Sharing With the NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-Space.”. Cell Genomics 2 (1). https://doi.org/10.1016/j.xgen.2021.100085.

The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL; https://anvilproject.org) was developed to address a widespread community need for a unified computing environment for genomics data storage, management, and analysis. In this perspective, we present AnVIL, describe its ecosystem and interoperability with other platforms, and highlight how this platform and associated initiatives contribute to improved genomic data sharing efforts. The AnVIL is a federated cloud platform designed to manage and store genomics and related data, enable population-scale analysis, and facilitate collaboration through the sharing of data, code, and analysis results. By inverting the traditional model of data sharing, the AnVIL eliminates the need for data movement while also adding security measures for active threat detection and monitoring and provides scalable, shared computing resources for any researcher. We describe the core data management and analysis components of the AnVIL, which currently consists of Terra, Gen3, Galaxy, RStudio/Bioconductor, Dockstore, and Jupyter, and describe several flagship genomics datasets available within the AnVIL. We continue to extend and innovate the AnVIL ecosystem by implementing new capabilities, including mechanisms for interoperability and responsible data sharing, while streamlining access management. The AnVIL opens many new opportunities for analysis, collaboration, and data sharing that are needed to drive research and to make discoveries through the joint analysis of hundreds of thousands to millions of genomes along with associated clinical and molecular data types.

Rahimzadeh, Vasiliki, Jonathan Lawson, Greg Rushton, and Edward S Dove. (2022) 2022. “Leveraging Algorithms to Improve Decision-Making Workflows for Genomic Data Access and Management.”. Biopreservation and Biobanking 20 (5): 429-35. https://doi.org/10.1089/bio.2022.0042.

Studies on the ethics of automating clinical or research decision making using artificial intelligence and other algorithmic tools abound. Less attention has been paid, however, to the scope for, and ethics of, automating decision making within regulatory apparatuses governing the access, use, and exchange of data involving humans for research. In this article, we map how the binary logic flows and real-time capabilities of automated decision support (ADS) systems may be leveraged to accelerate one rate-limiting step in scientific discovery: data access management. We contend that improved auditability, consistency, and efficiency of the data access request process using ADS systems have the potential to yield fairer outcomes in requests for data largely sourced from biospecimens and biobanked samples. This procedural justice rationale reinforces a broader set of participant and data subject rights that data access committees (DACs) indirectly protect. DACs protect the rights of citizens to benefit from science by bringing researchers closer to the data they need to advance that science. DACs also protect the informational dignities of individuals and communities by ensuring the data being accessed are used in ways consistent with participant values. We discuss the development of the Global Alliance for Genomics and Health Data Use Ontology standard as a test case of ADS for genomic data access management specifically, and we synthesize relevant ethical, legal, and social challenges to its implementation in practice. We conclude with an agenda of future research needed to thoughtfully advance strategies for computational governance that endeavor to instill public trust in, and maximize the scientific value of, health-related human data across data types, environments, and user communities.

2021

Rehm, Heidi L, Angela J H Page, Lindsay Smith, Jeremy B Adams, Gil Alterovitz, Lawrence J Babb, Maxmillian P Barkley, et al. (2021) 2021. “GA4GH: International Policies and Standards for Data Sharing across Genomic Research and Healthcare.”. Cell Genomics 1 (2). https://doi.org/10.1016/j.xgen.2021.100029.

The Global Alliance for Genomics and Health (GA4GH) aims to accelerate biomedical advances by enabling the responsible sharing of clinical and genomic data through both harmonized data aggregation and federated approaches. The decreasing cost of genomic sequencing (along with other genome-wide molecular assays) and increasing evidence of its clinical utility will soon drive the generation of sequence data from tens of millions of humans, with increasing levels of diversity. In this perspective, we present the GA4GH strategies for addressing the major challenges of this data revolution. We describe the GA4GH organization, which is fueled by the development efforts of eight Work Streams and informed by the needs of 24 Driver Projects and other key stakeholders. We present the GA4GH suite of secure, interoperable technical standards and policy frameworks and review the current status of standards, their relevance to key domains of research and clinical care, and future plans of GA4GH. Broad international participation in building, adopting, and deploying GA4GH standards and frameworks will catalyze an unprecedented effort in data sharing that will be critical to advancing genomic medicine and ensuring that all populations can access its benefits.

Cabili, Moran N, Jonathan Lawson, Andrea Saltzman, Greg Rushton, Pearl O’Rourke, John Wilbanks, Laura Lyman Rodriguez, et al. (2021) 2021. “Empirical Validation of an Automated Approach to Data Use Oversight.”. Cell Genomics 1 (2): 100031. https://doi.org/10.1016/j.xgen.2021.100031.

The current paradigm for data use oversight of biomedical datasets is onerous, extending the timescale and resources needed to obtain access for secondary analyses, thus hindering scientific discovery. For a researcher to utilize a controlled-access dataset, a data access committee must review her research plans to determine whether they are consistent with the data use limitations (DULs) specified by the informed consent form. The newly created GA4GH data use ontology (DUO) holds the potential to streamline this process by making data use oversight computable. Here, we describe an open-source software platform, the Data Use Oversight System (DUOS), that connects with DUO terminology to enable automated data use oversight. We analyze dbGaP data acquired since 2006, finding an exponential increase in data access requests, which will not be sustainable with current manual oversight review. We perform an empirical evaluation of DUOS and DUO on selected datasets from the Broad Institute's data repository. We were able to structure 118/123 of the evaluated DULs (96%) and 52/52 (100%) of research proposals using DUO terminology, and we find that DUOS' automated data access adjudication in all cases agreed with the DAC manual review. This first empirical evaluation of the feasibility of automated data use oversight demonstrates comparable accuracy to human-based data access oversight in real-world data governance.

Voisin, Craig, Mikael Linden, Stephanie O M Dyke, Sarion R Bowers, Pinar Alper, Maxmillian P Barkley, David Bernick, et al. (2021) 2021. “GA4GH Passport Standard for Digital Identity and Access Permissions.”. Cell Genomics 1 (2): None. https://doi.org/10.1016/j.xgen.2021.100030.

The Global Alliance for Genomics and Health (GA4GH) supports international standards that enable a federated data sharing model for the research community while respecting data security, ethical and regulatory frameworks, and data authorization and access processes for sensitive data. The GA4GH Passport standard (Passport) defines a machine-readable digital identity that conveys roles and data access permissions (called "visas") for individual users. Visas are issued by data stewards, including data access committees (DACs) working with public databases, the entities responsible for the quality, integrity, and access arrangements for the datasets in the management of human biomedical data. Passports streamline management of data access rights across data systems by using visas that present a data user's digital identity and permissions across organizations, tools, environments, and services. We describe real-world implementations of the GA4GH Passport standard in use cases from ELIXIR Europe, National Institutes of Health, and the Autism Sharing Initiative. These implementations demonstrate that the Passport standard has provided transparent mechanisms for establishing permissions and authorizing data access across platforms.

Lawson, Jonathan, Moran N Cabili, Giselle Kerry, Tiffany Boughtwood, Adrian Thorogood, Pinar Alper, Sarion R Bowers, et al. (2021) 2021. “The Data Use Ontology to Streamline Responsible Access to Human Biomedical Datasets.”. Cell Genomics 1 (2): None. https://doi.org/10.1016/j.xgen.2021.100028.

Human biomedical datasets that are critical for research and clinical studies to benefit human health also often contain sensitive or potentially identifying information of individual participants. Thus, care must be taken when they are processed and made available to comply with ethical and regulatory frameworks and informed consent data conditions. To enable and streamline data access for these biomedical datasets, the Global Alliance for Genomics and Health (GA4GH) Data Use and Researcher Identities (DURI) work stream developed and approved the Data Use Ontology (DUO) standard. DUO is a hierarchical vocabulary of human and machine-readable data use terms that consistently and unambiguously represents a dataset's allowable data uses. DUO has been implemented by major international stakeholders such as the Broad and Sanger Institutes and is currently used in annotation of over 200,000 datasets worldwide. Using DUO in data management and access facilitates researchers' discovery and access of relevant datasets. DUO annotations increase the FAIRness of datasets and support data linkages using common data use profiles when integrating the data for secondary analyses. DUO is implemented in the Web Ontology Language (OWL) and, to increase community awareness and engagement, hosted in an open, centralized GitHub repository. DUO, together with the GA4GH Passport standard, offers a new, efficient, and streamlined data authorization and access framework that has enabled increased sharing of biomedical datasets worldwide.