Data within biobanks capture broad yet detailed indices of human variation, but biobank-wide insights can be difficult to extract due to complexity and scale. Here, using large-scale factor analysis, we distill hundreds of variables (diagnoses, assessments and survey items) into 35 latent constructs, using data from unrelated individuals with predominantly estimated European genetic ancestry in UK Biobank. These factors recapitulate known disease classifications, disentangle elements of socioeconomic status, highlight the relevance of psychiatric constructs to health and improve measurement of pro-health behaviours. We go on to demonstrate the power of this approach to clarify genetic signal, enhance discovery and identify associations between underlying phenotypic structure and health outcomes. In building a deeper understanding of ways in which constructs such as socioeconomic status, trauma, or physical activity are structured in the dataset, we emphasize the importance of considering the interwoven nature of the human phenome when evaluating public health patterns.
Publications
2024
Underrepresented populations are often excluded from genomic studies owing in part to a lack of resources supporting their analyses. The 1000 Genomes Project (1kGP) and Human Genome Diversity Project (HGDP), which have recently been sequenced to high coverage, are valuable genomic resources because of the global diversity they capture and their open data sharing policies. Here, we harmonized a high-quality set of 4094 whole genomes from 80 populations in the HGDP and 1kGP with data from the Genome Aggregation Database (gnomAD) and identified over 153 million high-quality SNVs, indels, and SVs. We performed a detailed ancestry analysis of this cohort, characterizing population structure and patterns of admixture across populations, analyzing site frequency spectra, and measuring variant counts at global and subcontinental levels. We also show substantial added value from this data set compared with the prior versions of the component resources, typically combined via liftOver and variant intersection; for example, we catalog millions of new genetic variants, mostly rare, compared with previous releases. In addition to unrestricted individual-level public release, we provide detailed tutorials for conducting many of the most common quality-control steps and analyses with these data in a scalable cloud-computing environment and publicly release this new phased joint callset for use as a haplotype resource in phasing and imputation pipelines. This jointly called reference panel will serve as a key resource to support research of diverse ancestry populations.
MOTIVATION: The Variant Call Format (VCF) is widely used in genome sequencing but scales poorly. For instance, we estimate a 150 000 genome VCF would occupy 900 TiB, making it costly and complicated to produce, analyze, and store. The issue stems from VCF's requirement to densely represent both reference-genotypes and allele-indexed arrays. These requirements lead to unnecessary data duplication and, ultimately, very large files.
RESULTS: To address these challenges, we introduce the Scalable Variant Call Representation (SVCR). This representation reduces file sizes by ensuring they scale linearly with samples. SVCR's linear scaling relies on two techniques, both necessary for linearity: local allele indices and reference blocks, which were first introduced by the Genomic Variant Call Format. SVCR is also lossless and mergeable, allowing for N + 1 and N + K incremental joint-calling. We present two implementations of SVCR: SVCR-VCF, which encodes SVCR in VCF format, and VDS, which uses Hail's native format. Our experiments confirm the linear scalability of SVCR-VCF and VDS, in contrast to the super-linear growth seen with standard VCF files. We also discuss the VDS Combiner, a scalable, open-source tool for producing a VDS from GVCFs and unique features of VDS which enable rapid data analysis. SVCR, and VDS in particular, ensure the scientific community can generate, analyze, and disseminate genetics datasets with millions of samples.
AVAILABILITY AND IMPLEMENTATION: https://github.com/hail-is/hail/.
Although three-dimensional (3D) genome architecture is crucial for gene regulation, its role in disease remains elusive. We traced the evolution and malignant transformation of colorectal cancer (CRC) by generating high-resolution chromatin conformation maps of 33 colon samples spanning different stages of early neoplastic growth in persons with familial adenomatous polyposis (FAP). Our analysis revealed a substantial progressive loss of genome-wide cis-regulatory connectivity at early malignancy stages, correlating with nonlinear gene regulation effects. Genes with high promoter-enhancer (P-E) connectivity in unaffected mucosa were not linked to elevated baseline expression but tended to be upregulated in advanced stages. Inhibiting highly connected promoters preferentially represses gene expression in CRC cells compared to normal colonic epithelial cells. Our results suggest a two-phase model whereby neoplastic transformation reduces P-E connectivity from a redundant state to a rate-limiting one for transcriptional levels, highlighting the intricate interplay between 3D genome architecture and gene regulation during early CRC progression.
BACKGROUND: Cystic fibrosis (CF) is a rare and debilitating autosomal recessive disorder. It hampers the normal function of various organs and causes severe damage to the lungs, and digestive system leading to recurring pneumonia. Cf also affects reproductive health eventually may cause infertility. The disease manifests due to genetic aberrations in the cystic fibrosis transmembrane conductance regulator (CFTR) gene. This study aimed to screen for CFTR gene variants in Pakistani CF patients representing variable phenotypes.
METHODS: Clinical exome and Sanger sequencing were performed after clinical characterization of 25 suspected cases of CF (CF1-CF25). ACMG guidelines were followed to interpret the clinical significance of the identified variants.
RESULTS: Clinical investigations revealed common phenotypes such as pancreatic insufficiency, chest infections, chronic liver and lung diseases. Some patients also displayed symptoms like gastroesophageal reflux disease (GERD), neonatal cholestasis, acrodermatitis, diabetes mellitus, and abnormal malabsorptive stools. Genetic analysis of the 25 CF patients identified deleterious variants in the CFTR gene. Notably, 12% of patients showed compound heterozygous variants, while 88% had homozygous variants. The most prevalent variant was p. (Met1Thr or Met1?) at 24%, previously not reported in the Pakistani population. The second most common variant was p. (Phe508del) at 16%. Other variants, including p. (Leu218*), p. (Tyr569Asp), p. (Glu585Ter), and p. (Arg1162*) were also identified in the present study. Genetic analysis of one of the present patients showed a pathogenic variant in G6PD in addition to CFTR.
CONCLUSION: The study reports novel and reported variants in the CFTR gene in CF patients in Pakistani population having distinct phenotypes. It also emphasizes screening suspected Pakistani CF patients for the p. (Met1Thr) variant because of its increased observance and prevalence in the study. Moreover, the findings also signify searching for additional pathogenic variants in the genome of CF patients, which may modify the phenotypes. The findings contribute valuable information for the diagnosis, genetic counseling, and potential therapeutic strategies for CF patients in Pakistan.
The phenotypic impact of compound heterozygous (CH) variation has not been investigated at the population scale. We phased rare variants (MAF ∼0.001%) in the UK Biobank (UKBB) exome-sequencing data to characterize recessive effects in 175,587 individuals across 311 common diseases. A total of 6.5% of individuals carry putatively damaging CH variants, 90% of which are only identifiable upon phasing rare variants (MAF < 0.38%). We identify six recessive gene-trait associations (p < 1.68 × 10-7) after accounting for relatedness, polygenicity, nearby common variants, and rare variant burden. Of these, just one is discovered when considering homozygosity alone. Using longitudinal health records, we additionally identify and replicate a novel association between bi-allelic variation in ATP2C2 and an earlier age at onset of chronic obstructive pulmonary disease (COPD) (p < 3.58 × 10-8). Genetic phase contributes to disease risk for gene-trait pairs: ATP2C2-COPD (p = 0.000238), FLG-asthma (p = 0.00205), and USH2A-visual impairment (p = 0.0084). We demonstrate the power of phasing large-scale genetic cohorts to discover phenome-wide consequences of compound heterozygosity.
BACKGROUND: Bone infections with Staphylococcus aureus are notoriously difficult to treat and have high recurrence rates. Local antibiotic delivery systems hold the potential to achieve high in situ antibiotic concentrations, which are otherwise challenging to achieve via systemic administration. Existing solutions have been shown to confer suboptimal drug release and distribution. Here we present and evaluate an injectable in situ-forming depot system termed CarboCell. The CarboCell technology provides sustained and tuneable release of local high-dose antibiotics.
METHODS: CarboCell formulations of levofloxacin or clindamycin with or without antimicrobial adjuvants cis-2-decenoic acid or cis-11-methyl-2-dodecenoic acid were tested in experimental rodent and porcine implant-associated osteomyelitis models. In the porcine models, debridement and treatment with CarboCell-formulated antibiotics was carried out without systemic antibiotic administration. The bacterial burden was determined by quantitative bacteriology.
RESULTS: CarboCell formulations eliminated S. aureus in infected implant rat models. In the translational implant-associated pig model, surgical debridement and injection of clindamycin-releasing CarboCell formulations resulted in pathogen-free bone tissues and implants in 9 of 12 and full eradication in 5 of 12 pigs.
CONCLUSIONS: Sustained release of antimicrobial agents mediated by the CarboCell technology demonstrated promising therapeutic efficacy in challenging translational models and may be beneficial in combination with the current standard of care.
The operational building data presented in this paper has been collected from six office rooms located in an office building (research and educational purposes) located on the main campus of Aalborg University in Denmark. The dataset consists of measurements of occupancy, indoor environmental quality, room-level and system-level heating, ventilation and lighting operation at a 5 min resolution. The indoor environmental quality and building system data were collected from the building management system. The occupancy level in each monitored room is established from the computer vision-based analysis of wall-mounted camera footage of each office. The number of people present in the room is estimated using the YOLOv5s image recognition algorithm. The present dataset can be used for occupancy analysis, indoor environmental quality investigations, machine learning, and model predictive control.
Postnatal genomic regulation significantly influences tissue and organ maturation but is under-studied relative to existing genomic catalogs of adult tissues or prenatal development in mouse. The ENCODE4 consortium generated the first comprehensive single-nucleus resource of postnatal regulatory events across a diverse set of mouse tissues. The collection spans seven postnatal time points, mirroring human development from childhood to adulthood, and encompasses five core tissues. We identified 30 cell types, further subdivided into 69 subtypes and cell states across adrenal gland, left cerebral cortex, hippocampus, heart, and gastrocnemius muscle. Our annotations cover both known and novel cell differentiation dynamics ranging from early hippocampal neurogenesis to a new sex-specific adrenal gland population during puberty. We used an ensemble Latent Dirichlet Allocation strategy with a curated vocabulary of 2,701 regulatory genes to identify regulatory "topics," each of which is a gene vector, linked to cell type differentiation, subtype specialization, and transitions between cell states. We find recurrent regulatory topics in tissue-resident macrophages, neural cell types, endothelial cells across multiple tissues, and cycling cells of the adrenal gland and heart. Cell-type-specific topics are enriched in transcription factors and microRNA host genes, while chromatin regulators dominate mitosis topics. Corresponding chromatin accessibility data reveal dynamic and sex-specific regulatory elements, with enriched motifs matching transcription factors in regulatory topics. Together, these analyses identify both tissue-specific and common regulatory programs in postnatal development across multiple tissues through the lens of the factors regulating transcription.
Familial adenomatous polyposis (FAP) is a genetic disease causing hundreds of premalignant polyps in affected persons and is an ideal model to study transitions of early precancer states to colorectal cancer (CRC). We performed deep multiomic profiling of 93 samples, including normal mucosa, benign polyps and dysplastic polyps, from six persons with FAP. Transcriptomic, proteomic, metabolomic and lipidomic analyses revealed a dynamic choreography of thousands of molecular and cellular events that occur during precancerous transitions toward cancer formation. These involve processes such as cell proliferation, immune response, metabolic alterations (including amino acids and lipids), hormones and extracellular matrix proteins. Interestingly, activation of the arachidonic acid pathway was found to occur early in hyperplasia; this pathway is targeted by aspirin and other nonsteroidal anti-inflammatory drugs, a preventative treatment under investigation in persons with FAP. Overall, our results reveal key genomic, cellular and molecular events during the earliest steps in CRC formation and potential mechanisms of pharmaceutical prophylaxis.