DNA sample contamination is a major issue in clinical and research applications of whole-genome and -exome sequencing. Even modest levels of contamination can substantially affect the overall quality of variant calls and lead to widespread genotyping errors. Currently, popular tools for estimating the contamination level use short-read data (BAM/CRAM files), which are expensive to store and manipulate and often not retained or shared widely. We propose a metric to estimate DNA sample contamination from variant-level whole-genome and -exome sequence data called CHARR, contamination from homozygous alternate reference reads, which leverages the infiltration of reference reads within homozygous alternate variant calls. CHARR uses a small proportion of variant-level genotype information and thus can be computed from single-sample gVCFs or callsets in VCF or BCF formats, as well as efficiently stored variant calls in Hail VariantDataset format. Our results demonstrate that CHARR accurately recapitulates results from existing tools with substantially reduced costs, improving the accuracy and efficiency of downstream analyses of ultra-large whole-genome and exome sequencing datasets.
Publications
2023
Adipocyte function is a major determinant of metabolic disease, warranting investigations of regulating mechanisms. We show at single-cell resolution that progenitor cells from four human brown and white adipose depots separate into two main cell fates, an adipogenic and a structural branch, developing from a common progenitor. The adipogenic gene signature contains mitochondrial activity genes, and associates with genome-wide association study traits for fat distribution. Based on an extracellular matrix and developmental gene signature, we name the structural branch of cells structural Wnt-regulated adipose tissue-resident (SWAT) cells. When stripped from adipogenic cells, SWAT cells display a multipotent phenotype by reverting towards progenitor state or differentiating into new adipogenic cells, dependent on media. Label transfer algorithms recapitulate the cell types in human adipose tissue datasets. In conclusion, we provide a differentiation map of human adipocytes and define the multipotent SWAT cell, providing a new perspective on adipose tissue regulation.
Polygenic risk scores (PRSs) are expected to play a critical role in precision medicine. Currently, PRS predictors are generally based on linear models using summary statistics, and more recently individual-level data. However, these predictors mainly capture additive relationships and are limited in data modalities they can use. We developed a deep learning framework (EIR) for PRS prediction which includes a model, genome-local-net (GLN), specifically designed for large-scale genomics data. The framework supports multi-task learning, automatic integration of other clinical and biochemical data, and model explainability. When applied to individual-level data from the UK Biobank, the GLN model demonstrated a competitive performance compared to established neural network architectures, particularly for certain traits, showcasing its potential in modeling complex genetic relationships. Furthermore, the GLN model outperformed linear PRS methods for Type 1 Diabetes, likely due to modeling non-additive genetic effects and epistasis. This was supported by our identification of widespread non-additive genetic effects and epistasis in the context of T1D. Finally, we constructed PRS models that integrated genotype, blood, urine, and anthropometric data and found that this improved performance for 93% of the 290 diseases and disorders considered. EIR is available at https://github.com/arnor-sigurdsson/EIR.
We meta-analyzed array data imputed with the TOPMed reference panel and whole-genome sequence (WGS) datasets and performed the largest, rare variant (minor allele frequency as low as 5×10-5) GWAS meta-analysis of type 2 diabetes (T2D) comprising 51,256 cases and 370,487 controls. We identified 52 novel variants at genome-wide significance (p<5 × 10-8), including 8 novel variants that were either rare or ancestry-specific. Among them, we identified a rare missense variant in HNF4A p.Arg114Trp (OR=8.2, 95% confidence interval [CI]=4.6-14.0, p = 1.08×10-13), previously reported as a variant implicated in Maturity Onset Diabetes of the Young (MODY) with incomplete penetrance. We demonstrated that the diabetes risk in carriers of this variant was modulated by a T2D common variant polygenic risk score (cvPRS) (carriers in the top PRS tertile [OR=18.3, 95%CI=7.2-46.9, p=1.2×10-9] vs carriers in the bottom PRS tertile [OR=2.6, 95% CI=0.97-7.09, p = 0.06]. Association results identified eight variants of intermediate penetrance (OR>5) in monogenic diabetes (MD), which in aggregate as a rare variant PRS were associated with T2D in an independent WGS dataset (OR=4.7, 95% CI=1.86-11.77], p = 0.001). Our data also provided support evidence for 21% of the variants reported in ClinVar in these MD genes as benign based on lack of association with T2D. Our work provides a framework for using rare variant imputation and WGS analyses in large-scale population-based association studies to identify large-effect rare variants and provide evidence for informing variant pathogenicity.
Sequencing has revealed hundreds of millions of human genetic variants, and continued efforts will only add to this variant avalanche. Insufficient information exists to interpret the effects of most variants, limiting opportunities for precision medicine and comprehension of genome function. A solution lies in experimental assessment of the functional effect of variants, which can reveal their biological and clinical impact. However, variant effect assays have generally been undertaken reactively for individual variants only after and, in most cases long after, their first observation. Now, multiplexed assays of variant effect can characterise massive numbers of variants simultaneously, yielding variant effect maps that reveal the function of every possible single nucleotide change in a gene or regulatory element. Generating maps for every protein encoding gene and regulatory element in the human genome would create an 'Atlas' of variant effect maps and transform our understanding of genetics and usher in a new era of nucleotide-resolution functional knowledge of the genome. An Atlas would reveal the fundamental biology of the human genome, inform human evolution, empower the development and use of therapeutics and maximize the utility of genomics for diagnosing and treating disease. The Atlas of Variant Effects Alliance is an international collaborative group comprising hundreds of researchers, technologists and clinicians dedicated to realising an Atlas of Variant Effects to help deliver on the promise of genomics.
Response to survey questionnaires is vital for social and behavioural research, and most analyses assume full and accurate response by participants. However, nonresponse is common and impedes proper interpretation and generalizability of results. We examined item nonresponse behaviour across 109 questionnaire items in the UK Biobank (N = 360,628). Phenotypic factor scores for two participant-selected nonresponse answers, 'Prefer not to answer' (PNA) and 'I don't know' (IDK), each predicted participant nonresponse in follow-up surveys (incremental pseudo-R2 = 0.056), even when controlling for education and self-reported health (incremental pseudo-R2 = 0.046). After performing genome-wide association studies of our factors, PNA and IDK were highly genetically correlated with one another (rg = 0.73 (s.e. = 0.03)) and with education (rg,PNA = -0.51 (s.e. = 0.03); rg,IDK = -0.38 (s.e. = 0.02)), health (rg,PNA = 0.51 (s.e. = 0.03); rg,IDK = 0.49 (s.e. = 0.02)) and income (rg,PNA = -0.57 (s.e. = 0.04); rg,IDK = -0.46 (s.e. = 0.02)), with additional unique genetic associations observed for both PNA and IDK (P < 5 × 10-8). We discuss how these associations may bias studies of traits correlated with item nonresponse and demonstrate how this bias may substantially affect genome-wide association studies. While the UK Biobank data are deidentified, we further protected participant privacy by avoiding exploring non-response behaviour to single questions, assuring that no information can be used to associate results with any particular respondents.
Linkage disequilibrium (LD) is the correlation among nearby genetic variants. In genetic association studies, LD is often modeled using large correlation matrices, but this approach is inefficient, especially in ancestrally diverse studies. In the present study, we introduce LD graphical models (LDGMs), which are an extremely sparse and efficient representation of LD. LDGMs are derived from genome-wide genealogies; statistical relationships among alleles in the LDGM correspond to genealogical relationships among haplotypes. We published LDGMs and ancestry-specific LDGM precision matrices for 18 million common variants (minor allele frequency >1%) in five ancestry groups, validated their accuracy and demonstrated order-of-magnitude improvements in runtime for commonly used LD matrix computations. We implemented an extremely fast multiancestry polygenic prediction method, BLUPx-ldgm, which performs better than a similar method based on the reference LD correlation matrix. LDGMs will enable sophisticated methods that scale to ancestrally diverse genetic association data across millions of variants and individuals.
Autism spectrum disorders (ASDs) have been linked to genes with enriched expression in the brain, but it is unclear how these genes converge into cell-type-specific networks. We built a protein-protein interaction network for 13 ASD-associated genes in human excitatory neurons derived from induced pluripotent stem cells (iPSCs). The network contains newly reported interactions and is enriched for genetic and transcriptional perturbations observed in individuals with ASDs. We leveraged the network data to show that the ASD-linked brain-specific isoform of ANK2 is important for its interactions with synaptic proteins and to characterize a PTEN-AKAP8L interaction that influences neuronal growth. The IGF2BP1-3 complex emerged as a convergent point in the network that may regulate a transcriptional circuit of ASD-associated genes. Our findings showcase cell-type-specific interactomes as a framework to complement genetic and transcriptomic data and illustrate how both individual and convergent interactions can lead to biological insights into ASDs.
Genetics have nominated many schizophrenia risk genes and identified convergent signals between schizophrenia and neurodevelopmental disorders. However, functional interpretation of the nominated genes in the relevant brain cell types is often lacking. We executed interaction proteomics for six schizophrenia risk genes that have also been implicated in neurodevelopment in human induced cortical neurons. The resulting protein network is enriched for common variant risk of schizophrenia in Europeans and East Asians, is down-regulated in layer 5/6 cortical neurons of individuals affected by schizophrenia, and can complement fine-mapping and eQTL data to prioritize additional genes in GWAS loci. A sub-network centered on HCN1 is enriched for common variant risk and contains proteins (HCN4 and AKAP11) enriched for rare protein-truncating mutations in individuals with schizophrenia and bipolar disorder. Our findings showcase brain cell-type-specific interactomes as an organizing framework to facilitate interpretation of genetic and transcriptomic data in schizophrenia and its related disorders.
Single-cell and single-nucleus RNA-sequencing (sxRNA-seq) measures gene expression in individual cells or nuclei enabling comprehensive characterization of cell types and states. However, isolation of cells or nuclei for sxRNA-seq releases contaminating RNA, which can distort biological signals, through, for example, cell damage and transcript leakage. Thus, identifying barcodes containing high-quality cells or nuclei is a critical analytical step in the processing of sxRNA-seq data. Here, we present valiDrops, an automated method to identify high-quality barcodes and flag dead cells. In valiDrops, barcodes are initially filtered using data-adaptive thresholding on community-standard quality metrics, and subsequently, valiDrops uses a novel clustering-based approach to identify barcodes with distinct biological signals. We benchmark valiDrops and show that biological signals from cell types and states are more distinct, easier to separate and more consistent after filtering by valiDrops compared to existing tools. Finally, we show that valiDrops can predict and flag dead cells with high accuracy. This novel classifier can further improve data quality or be used to identify dead cells to interrogate the biology of cell death. Thus, valiDrops is an effective and easy-to-use method to improve data quality and biological interpretation. Our method is openly available as an R package at www.github.com/madsen-lab/valiDrops.