Publications

2025

Bjune, Jan-Inge, Samantha Laber, Laurence Lawrence-Archer, Patrizia M C Nothnagel, Shuntaro Yamada, Xu Zhao, Pouda Panahandeh Strømland, et al. (2025) 2025. “IRX3 Controls a SUMOylation-Dependent Differentiation Switch in Adipocyte Precursor Cells.”. Nature Communications 16 (1): 7248. https://doi.org/10.1038/s41467-025-62361-1.

Publisher's Version

IRX3 is linked to predisposition to obesity through the FTO locus and is upregulated during early adipogenesis in risk-allele carriers, shifting adipocyte fate toward fat storage. However, how this elevated IRX3 expression influences later developmental stages remains unclear. Here we show that IRX3 regulates adipocyte fate by modulating epigenetic reprogramming. ChIP-sequencing in preadipocytes identifies over 300 IRX3 binding sites, predominantly at promoters of genes involved in SUMOylation and chromatin remodeling. IRX3 knockout alters expression of SUMO pathway genes, increases global SUMOylation, and inhibits PPARγ activity and adipogenesis. Pharmacological SUMOylation inhibition rescues these effects. IRX3 KO also reduces SUMO occupancy at Wnt-related genes, enhancing Wnt signaling and promoting osteogenic fate in 3D cultures. This fate switch is partially reversible by SUMOylation inhibition. We identify IRX3 as a key transcriptional regulator of epigenetic programs, acting upstream of SUMOylation to maintain mesenchymal identity and support adipogenesis while suppressing osteogenesis in mouse embryonic fibroblasts.

Costanzo, Maria C, Beena Akolkar, Melina Claussnitzer, Jose C Florez, Anna L Gloyn, Struan F A Grant, Klaus H Kaestner, et al. (2025) 2025. “Accelerating Medicines Partnership in Type 2 Diabetes and Common Metabolic Diseases: Collaborating to Maximize the Value of Genetic and Genomic Data.”. Diabetes 74 (7): 1089-98. https://doi.org/10.2337/db25-0042.

Publisher's Version

UNLABELLED: In the last two decades, significant progress has been made toward understanding the genetic basis of type 2 diabetes. An important supporter of this research has been the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), most recently through the Accelerating Medicines Partnership Program for Type 2 Diabetes (AMP T2D) and Accelerating Medicines Partnership Program for Common Metabolic Diseases (AMP CMD). These public-private partnerships of the National Institutes of Health, multiple biopharmaceutical and life sciences companies, and nonprofit organizations, facilitated and managed by the Foundation for the National Institutes of Health, were designed to improve understanding of therapeutically relevant biological pathways for type 2 diabetes. On the occasion of NIDDK's 75th anniversary, we review the history of NIDDK support for these partnerships, which saw the convergence of research directions prioritized by academic consortia, the pharmaceutical industry, and government funders. Although the NIDDK was not the sole originator or funder of these efforts, its support and leadership have been pivotal to the partnerships' success and have enabled their research to be broadly accessible through the AMP Common Metabolic Diseases Knowledge Portal (CMDKP) and the AMP Common Metabolic Diseases Genome Atlas (CMDGA). Findings from AMP CMD align with NIDDK's mission to conduct research and share results with the goal of improving health and quality of life.

ARTICLE HIGHLIGHTS: The Accelerating Medicines Partnership Program for Type 2 Diabetes (AMP T2D) and Accelerating Medicines Partnership Program for Common Metabolic Diseases (AMP CMD) were created to accelerate the translation of genetic and genomic data into knowledge about the biology of disease. Their goal was to gain a better understanding of the mechanisms underlying types 1 and 2 diabetes and prediabetes, obesity, cardiovascular disease, kidney disease, and nonalcoholic steatohepatitis. This work identified multiple genes and pathways underlying these diseases. The findings of AMP T2D and AMP CMD have implications for drug development and improved risk prediction, diagnosis, and treatment for common metabolic diseases.

Wu, Jingyi, Nicolas Gonzalez Castro, Sofia Battaglia, Chadi A El Farran, Joshua P D’Antonio, Tyler E Miller, Mario L Suvà, and Bradley E Bernstein. (2025) 2025. “Evolving Cell States and Oncogenic Drivers During the Progression of IDH-Mutant Gliomas.”. Nature Cancer 6 (1): 145-57. https://doi.org/10.1038/s43018-024-00865-3.

Publisher's Version

Isocitrate dehydrogenase (IDH) mutants define a class of gliomas that are initially slow-growing but inevitably progress to fatal disease. To characterize their malignant cell hierarchy, we profiled chromatin accessibility and gene expression across single cells from low-grade and high-grade IDH-mutant gliomas and ascertained their developmental states through a comparison to normal brain cells. We provide evidence that these tumors are initially fueled by slow-cycling oligodendrocyte progenitor cell-like cells. During progression, a more proliferative neural progenitor cell-like population expands, potentially through partial reprogramming of 'permissive' chromatin in progenitors. This transition is accompanied by a switch from methylation-based drivers to genetic ones. In low-grade IDH-mutant tumors or organoids, DNA hypermethylation appears to suppress interferon (IFN) signaling, which is induced by IDH or DNA methyltransferase 1 inhibitors. High-grade tumors frequently lose this hypermethylation and instead acquire genetic alterations that disrupt IFN and other tumor-suppressive programs. Our findings explain how these slow-growing tumors may progress to lethal malignancies and have implications for therapies that target their epigenetic underpinnings.

Javed, Nauman, Thomas Weingarten, Arijit Sehanobish, Adam Roberts, Avinava Dubey, Krzysztof Choromanski, and Bradley E Bernstein. (2025) 2025. “A Multi-Modal Transformer for Cell Type-Agnostic Regulatory Predictions.”. Cell Genomics 5 (2): 100762. https://doi.org/10.1016/j.xgen.2025.100762.

Publisher's Version

Sequence-based deep learning models have emerged as powerful tools for deciphering the cis-regulatory grammar of the human genome but cannot generalize to unobserved cellular contexts. Here, we present EpiBERT, a multi-modal transformer that learns generalizable representations of genomic sequence and cell type-specific chromatin accessibility through a masked accessibility-based pre-training objective. Following pre-training, EpiBERT can be fine-tuned for gene expression prediction, achieving accuracy comparable to the sequence-only Enformer model, while also being able to generalize to unobserved cell states. The learned representations are interpretable and useful for predicting chromatin accessibility quantitative trait loci (caQTLs), regulatory motifs, and enhancer-gene links. Our work represents a step toward improving the generalization of sequence-based deep neural networks in regulatory genomics.

Yoshiji, Satoshi, Tianyuan Lu, Guillaume Butler-Laporte, Julia Carrasco-Zanini-Sanchez, Chen-Yang Su, Yiheng Chen, Kevin Liang, et al. (2025) 2025. “Integrative Proteogenomic Analysis Identifies COL6A3-Derived Endotrophin As a Mediator of the Effect of Obesity on Coronary Artery Disease.”. Nature Genetics 57 (2): 345-57. https://doi.org/10.1038/s41588-024-02052-7.

Publisher's Version

Obesity strongly increases the risk of cardiometabolic diseases, yet the underlying mediators of this relationship are not fully understood. Given that obesity strongly influences circulating protein levels, we investigated proteins mediating the effects of obesity on coronary artery disease, stroke and type 2 diabetes. By integrating two-step proteome-wide Mendelian randomization, colocalization, epigenomics and single-cell RNA sequencing, we identified five mediators and prioritized collagen type VI α3 (COL6A3). COL6A3 levels were strongly increased by body mass index and increased coronary artery disease risk. Notably, the carboxyl terminus product of COL6A3, endotrophin, drove this effect. COL6A3 was highly expressed in disease-relevant cell types and tissues. Finally, we found that body fat reduction could reduce plasma levels of COL6A3-derived endotrophin, indicating a tractable way to modify endotrophin levels. In summary, we provide actionable insights into how circulating proteins mediate the effects of obesity on cardiometabolic diseases and prioritize endotrophin as a potential therapeutic target.

2024

Bernstein, Bradley E. 2024. “Helicase-Assisted Continuous Editing for Programmable Mutagenesis of Endogenous Genomes”. Science.

Publisher's Version

See also: Highlighted Publications

Barrès, Romain. 2024. “Exercise-Induced Crosstalk Between Immune Cells and Adipocytes in Humans: Role of Oncostatin-M”. Cell Rep Med.

Publisher's Version

Andersson, Robin. 2024. “MYC Activity at Enhancers Drives Prognostic Transcriptional Programs through an Epigenetic Switch”. Nat Genet.

Publisher's Version

Hrytsenko, Yana, Benjamin Shea, Michael Elgart, Nuzulul Kurniansyah, Genevieve Lyons, Alanna C Morrison, April P Carson, et al. (2024) 2024. “Machine Learning Models for Predicting Blood Pressure Phenotypes by Combining Multiple Polygenic Risk Scores.”. Scientific Reports 14 (1): 12436. https://doi.org/10.1038/s41598-024-62945-9.

Publisher's Version

We construct non-linear machine learning (ML) prediction models for systolic and diastolic blood pressure (SBP, DBP) using demographic and clinical variables and polygenic risk scores (PRSs). We developed a two-model ensemble, consisting of a baseline model, where prediction is based on demographic and clinical variables only, and a genetic model, where we also include PRSs. We evaluate the use of a linear versus a non-linear model at both the baseline and the genetic model levels and assess the improvement in performance when incorporating multiple PRSs. We report the ensemble model's performance as percentage variance explained (PVE) on a held-out test dataset. A non-linear baseline model improved the PVEs from 28.1 to 30.1% (SBP) and 14.3% to 17.4% (DBP) compared with a linear baseline model. Including seven PRSs in the genetic model computed based on the largest available GWAS of SBP/DBP improved the genetic model PVE from 4.8 to 5.1% (SBP) and 4.7 to 5% (DBP) compared to using a single PRS. Adding additional 14 PRSs computed based on two independent GWASs further increased the genetic model PVE to 6.3% (SBP) and 5.7% (DBP). PVE differed across self-reported race/ethnicity groups, with primarily all non-White groups benefitting from the inclusion of additional PRSs. In summary, non-linear ML models improves BP prediction in models incorporating diverse populations.

Pershad, Yash, Taralynn Mack, Hannah Poisner, Yasminka A Jakubek, Adrienne M Stilp, Braxton D Mitchell, Joshua P Lewis, et al. (2024) 2024. “Determinants of Mosaic Chromosomal Alteration Fitness.”. Nature Communications 15 (1): 3800. https://doi.org/10.1038/s41467-024-48190-8.

Publisher's Version

Clonal hematopoiesis (CH) is characterized by the acquisition of a somatic mutation in a hematopoietic stem cell that results in a clonal expansion. These driver mutations can be single nucleotide variants in cancer driver genes or larger structural rearrangements called mosaic chromosomal alterations (mCAs). The factors that influence the variations in mCA fitness and ultimately result in different clonal expansion rates are not well understood. We used the Passenger-Approximated Clonal Expansion Rate (PACER) method to estimate clonal expansion rate as PACER scores for 6,381 individuals in the NHLBI TOPMed cohort with gain, loss, and copy-neutral loss of heterozygosity mCAs. Our mCA fitness estimates, derived by aggregating per-individual PACER scores, were correlated (R2 = 0.49) with an alternative approach that estimated fitness of mCAs in the UK Biobank using population-level distributions of clonal fraction. Among individuals with JAK2 V617F clonal hematopoiesis of indeterminate potential or mCAs affecting the JAK2 gene on chromosome 9, PACER score was strongly correlated with erythrocyte count. In a cross-sectional analysis, genome-wide association study of estimates of mCA expansion rate identified a TCL1A locus variant associated with mCA clonal expansion rate, with suggestive variants in NRIP1 and TERT.

Recent Publications

Ghouse, Jonas, Gustav Ahlberg, Søren Albertsen Rand, Morten Salling Olesen, Bjarni Vilhjalmsson, Stefan Stender, and Henning Bundgaard. (2024) 2024. “Potential Influence of Risk Factor Control on the Association Between Lipoprotein(a) and Atherosclerotic Cardiovascular Disease.”. Arteriosclerosis, Thrombosis, and Vascular Biology 44 (6): 1455-57. https://doi.org/10.1161/ATVBAHA.124.320990.

Publisher's Version

Publisher's Version
Albiñana, Clara, Zhihong Zhu, Nis Borbye-Lorenzen, Sanne Grundvad Boelt, Arieh S Cohen, Kristin Skogstrand, Naomi R Wray, et al. (2024) 2024. “Publisher Correction: Genetic Correlates of Vitamin D-Binding Protein and 25-Hydroxyvitamin D in Neonatal Dried Blood Spots.”. Nature Communications 15 (1): 1741. https://doi.org/10.1038/s41467-024-46199-7.

Publisher's Version

Publisher's Version
Chen, Siwei, Laurent C Francioli, Julia K Goodrich, Ryan L Collins, Masahiro Kanai, Qingbo Wang, Jessica Alföldi, et al. (2024) 2024. “Author Correction: A Genomic Mutational Constraint Map Using Variation in 76,156 Human Genomes.”. Nature 626 (7997): E1. https://doi.org/10.1038/s41586-024-07050-7.

Publisher's Version

Publisher's Version
Larsen, Janne Tidselbak, Zeynep Yilmaz, Cynthia M Bulik, Clara Albiñana, Bjarni Jóhann Vilhjálmsson, Preben Bo Mortensen, and Liselotte Vogdrup Petersen. (2024) 2024. “Diagnosed Eating Disorders in Danish Registers - Incidence, Prevalence, Mortality, and Polygenic Risk.”. Psychiatry Research 337: 115927. https://doi.org/10.1016/j.psychres.2024.115927.

Publisher's Version

Publisher's Version

Eating disorders are a group of severe and potentially enduring psychiatric disorders associated with increased mortality. Compared to other severe mental illnesses, they have received relatively limited research attention. Epidemiological studies often only report relative measures despite these being difficult to interpret having limited practical use. The aims of this study were to evaluate the incidence and prevalence of diagnosed anorexia nervosa (AN), bulimia nervosa, and eating disorder not otherwise specified recorded in Danish hospital registers and estimate both relative and absolute measures of subsequent mortality - both all-cause and cause-specific in a general nationwide population of 1,667,374 individuals. In a smaller, genetically informed case-cohort sample, the prediction of polygenic scores for AN, body fat percentage, and body mass index on AN prevalence and severity was estimated. Despite males being less likely to be diagnosed with an eating disorder, those that do have significantly increased rates of mortality. AN prevalence was highest for individuals with high AN and low body fat percentage/body mass index polygenic scores.
Wimberley, Theresa, Isabell Brikell, Aske Astrup, Janne T Larsen, Liselotte Petersen V, Clara Albiñana, Bjarni J Vilhjálmsson, et al. (2024) 2024. “Shared Familial Risk for Type 2 Diabetes Mellitus and Psychiatric Disorders: A Nationwide Multigenerational Genetics Study.”. Psychological Medicine 54 (11): 2976-85. https://doi.org/10.1017/S0033291724001053.

Publisher's Version

Publisher's Version

BACKGROUND: Psychiatric disorders and type 2 diabetes mellitus (T2DM) are heritable, polygenic, and often comorbid conditions, yet knowledge about their potential shared familial risk is lacking. We used family designs and T2DM polygenic risk score (T2DM-PRS) to investigate the genetic associations between psychiatric disorders and T2DM.
METHODS: We linked 659 906 individuals born in Denmark 1990-2000 to their parents, grandparents, and aunts/uncles using population-based registers. We compared rates of T2DM in relatives of children with and without a diagnosis of any or one of 11 specific psychiatric disorders, including neuropsychiatric and neurodevelopmental disorders, using Cox regression. In a genotyped sample (iPSYCH2015) of individuals born 1981-2008 (n = 134 403), we used logistic regression to estimate associations between a T2DM-PRS and these psychiatric disorders.
RESULTS: Among 5 235 300 relative pairs, relatives of individuals with a psychiatric disorder had an increased risk for T2DM with stronger associations for closer relatives (parents:hazard ratio = 1.38, 95% confidence interval 1.35-1.42; grandparents: 1.14, 1.13-1.15; and aunts/uncles: 1.19, 1.16-1.22). In the genetic sample, one standard deviation increase in T2DM-PRS was associated with an increased risk for any psychiatric disorder (odds ratio = 1.11, 1.08-1.14). Both familial T2DM and T2DM-PRS were significantly associated with seven of 11 psychiatric disorders, most strongly with attention-deficit/hyperactivity disorder and conduct disorder, and inversely with anorexia nervosa.
CONCLUSIONS: Our findings of familial co-aggregation and higher T2DM polygenic liability associated with psychiatric disorders point toward shared familial risk. This suggests that part of the comorbidity is explained by shared familial risks. The underlying mechanisms still remain largely unknown and the contributions of genetics and environment need further investigation.
Vilhjálmsson, Bjarni Jóhann. (2024) 2024. “Towards Fair and Clinically Relevant Polygenic Predictions.”. Trends in Genetics : TIG 40 (5): 379-80. https://doi.org/10.1016/j.tig.2024.04.002.

Publisher's Version

Publisher's Version

Lennon et al. recently proposed a clinical polygenic score (PGS) pipeline as part of the Electronic Medical Records and Genomics (eMERGE) network initiative. In this spotlight article we discuss the broader context for the use of PGS in preventive medicine and highlight key limitations and challenges facing their inclusion in prediction models.
Chen, Siwei, Laurent C Francioli, Julia K Goodrich, Ryan L Collins, Masahiro Kanai, Qingbo Wang, Jessica Alföldi, et al. (2024) 2024. “A Genomic Mutational Constraint Map Using Variation in 76,156 Human Genomes.”. Nature 625 (7993): 92-100. https://doi.org/10.1038/s41586-023-06045-0.

Publisher's Version

Publisher's Version

The depletion of disruptive variation caused by purifying natural selection (constraint) has been widely used to investigate protein-coding genes underlying human disorders1-4, but attempts to assess constraint for non-protein-coding regions have proved more difficult. Here we aggregate, process and release a dataset of 76,156 human genomes from the Genome Aggregation Database (gnomAD)-the largest public open-access human genome allele frequency reference dataset-and use it to build a genomic constraint map for the whole genome (genomic non-coding constraint of haploinsufficient variation (Gnocchi)). We present a refined mutational model that incorporates local sequence context and regional genomic features to detect depletions of variation. As expected, the average constraint for protein-coding sequences is stronger than that for non-coding regions. Within the non-coding genome, constrained regions are enriched for known regulatory elements and variants that are implicated in complex human diseases and traits, facilitating the triangulation of biological annotation, disease association and natural selection to non-coding DNA analysis. More constrained regulatory elements tend to regulate more constrained protein-coding genes, which in turn suggests that non-coding constraint can aid the identification of constrained genes that are as yet unrecognized by current gene constraint metrics. We demonstrate that this genome-wide constraint map improves the identification and interpretation of functional human genetic variation.
Koenig, Zan, Mary T Yohannes, Lethukuthula L Nkambule, Xuefang Zhao, Julia K Goodrich, Heesu Ally Kim, Michael W Wilson, et al. (2024) 2024. “A Harmonized Public Resource of Deeply Sequenced Diverse Human Genomes.”. Genome Research 34 (5): 796-809. https://doi.org/10.1101/gr.278378.123.

Publisher's Version

Publisher's Version

Underrepresented populations are often excluded from genomic studies owing in part to a lack of resources supporting their analyses. The 1000 Genomes Project (1kGP) and Human Genome Diversity Project (HGDP), which have recently been sequenced to high coverage, are valuable genomic resources because of the global diversity they capture and their open data sharing policies. Here, we harmonized a high-quality set of 4094 whole genomes from 80 populations in the HGDP and 1kGP with data from the Genome Aggregation Database (gnomAD) and identified over 153 million high-quality SNVs, indels, and SVs. We performed a detailed ancestry analysis of this cohort, characterizing population structure and patterns of admixture across populations, analyzing site frequency spectra, and measuring variant counts at global and subcontinental levels. We also show substantial added value from this data set compared with the prior versions of the component resources, typically combined via liftOver and variant intersection; for example, we catalog millions of new genetic variants, mostly rare, compared with previous releases. In addition to unrestricted individual-level public release, we provide detailed tutorials for conducting many of the most common quality-control steps and analyses with these data in a scalable cloud-computing environment and publicly release this new phased joint callset for use as a haplotype resource in phasing and imputation pipelines. This jointly called reference panel will serve as a key resource to support research of diverse ancestry populations.
Carey, Caitlin E, Rebecca Shafee, Robbee Wedow, Amanda Elliott, Duncan S Palmer, John Compitello, Masahiro Kanai, et al. (2024) 2024. “Principled Distillation of UK Biobank Phenotype Data Reveals Underlying Structure in Human Variation.”. Nature Human Behaviour 8 (8): 1599-1615. https://doi.org/10.1038/s41562-024-01909-5.

Publisher's Version

Publisher's Version

Data within biobanks capture broad yet detailed indices of human variation, but biobank-wide insights can be difficult to extract due to complexity and scale. Here, using large-scale factor analysis, we distill hundreds of variables (diagnoses, assessments and survey items) into 35 latent constructs, using data from unrelated individuals with predominantly estimated European genetic ancestry in UK Biobank. These factors recapitulate known disease classifications, disentangle elements of socioeconomic status, highlight the relevance of psychiatric constructs to health and improve measurement of pro-health behaviours. We go on to demonstrate the power of this approach to clarify genetic signal, enhance discovery and identify associations between underlying phenotypic structure and health outcomes. In building a deeper understanding of ways in which constructs such as socioeconomic status, trauma, or physical activity are structured in the dataset, we emphasize the importance of considering the interwoven nature of the human phenome when evaluating public health patterns.
Poterba, Timothy, Christopher Vittal, Daniel King, Daniel Goldstein, Jacqueline I Goldstein, Patrick Schultz, Konrad J Karczewski, Cotton Seed, and Benjamin M Neale. (2024) 2024. “The Scalable Variant Call Representation: Enabling Genetic Analysis Beyond One Million Genomes.”. Bioinformatics (Oxford, England) 41 (1). https://doi.org/10.1093/bioinformatics/btae746.

Publisher's Version

Publisher's Version

MOTIVATION: The Variant Call Format (VCF) is widely used in genome sequencing but scales poorly. For instance, we estimate a 150 000 genome VCF would occupy 900 TiB, making it costly and complicated to produce, analyze, and store. The issue stems from VCF's requirement to densely represent both reference-genotypes and allele-indexed arrays. These requirements lead to unnecessary data duplication and, ultimately, very large files.
RESULTS: To address these challenges, we introduce the Scalable Variant Call Representation (SVCR). This representation reduces file sizes by ensuring they scale linearly with samples. SVCR's linear scaling relies on two techniques, both necessary for linearity: local allele indices and reference blocks, which were first introduced by the Genomic Variant Call Format. SVCR is also lossless and mergeable, allowing for N + 1 and N + K incremental joint-calling. We present two implementations of SVCR: SVCR-VCF, which encodes SVCR in VCF format, and VDS, which uses Hail's native format. Our experiments confirm the linear scalability of SVCR-VCF and VDS, in contrast to the super-linear growth seen with standard VCF files. We also discuss the VDS Combiner, a scalable, open-source tool for producing a VDS from GVCFs and unique features of VDS which enable rapid data analysis. SVCR, and VDS in particular, ensure the scientific community can generate, analyze, and disseminate genetics datasets with millions of samples.
AVAILABILITY AND IMPLEMENTATION: https://github.com/hail-is/hail/.