This article has been translated into বাংলা, Català, Deutsch, Ελληνικά, Español, فارسی, Suomi, Français, עברית, हिन्दी, Magyar, Italiano, 日本語, Bahasa Melayu, Nederlands, Polski, Português, Русский, Српски, தமிழ், ייִדיש and 汉语.
March 02, 2021
Written by Minttu Marttila, Annika Faucon, Nirmal Vadgama, Shea Andrews, Brooke Wolford, and Kumar Veerapen on behalf of the COVID-19 HGI
Note: The COVID-19 Host Genetics Initiative (HGI) represents a consortium of over 2000 scientists from over 54 countries working collaboratively to share data, ideas, recruit patients and disseminate our findings. For a primer on our study design, please read our inaugural blog post. Our research is iterative, and we summarize our new results via blog posts and on the results section of our website. Finally, if any vocabulary here is unfamiliar, please send us an email at hgi-faq@icda.bio—we’d be happy to update the information here to provide more clarity. In the coming weeks, additional information explaining concepts or terminology will be made available. In the interim, take a look at this resource to review the basics of genetics.
Scientific article of this release is now on medRXiv.
The COVID-19 HGI has iteratively shown robust genetic signals in our previous releases and represents the largest genome-wide association study (GWAS) in history: both in terms of study participants (>2 million individuals) and number of collaborators (> 2,000 scientists). Here, we describe results from our latest data freeze release 5. In our previous data freeze (release 4), we reported the identification of human genetic variants associated with severe COVID-19 (view our blog posts for non-scientists on data release 3 and release 4). We identified these variants through a GWAS in over 30,000 COVID-19 patients (i.e., cases) and 1.47 million non-COVID-19 patients (i.e., controls). In data freeze 5, we have further increased the sample size to almost 50,000 COVID-19 cases and over 2 million controls by combining data from 47 studies across 19 countries (Figure 1). By increasing the sample size, we are also improving the confidence of our findings. In this data freeze, we also attempted to improve the diversity of populations. Studying genetics across populations of diverse genetic ancestries helps us better understand genetic variants affecting COVID-19 severity and their impact around the world. Of the 47 contributing studies, 19 included non-European populations.
Figure 1: List of COVID-19 HGI contributors for data freeze release 5 Of the 47 contributing studies, 19 included non-European populations. Adapted from Andrea Ganna’s presentation on January 25, 2021.
As in previous data freezes, we continue to examine three outcomes (Figure 2): A) Being critically ill with COVID-19 (ending up on respiratory support or dying from COVID-19), B) Being hospitalized for COVID-19 and C) Being infected with SARS-CoV-2. These analyses aim to capture genetic features associated with both susceptibility and severity to SARS-CoV-2 and COVID-19. The last analysis (Analysis C) aimed to detect genetic variants contributing to reported infection of SARS-CoV-2. This analysis included all cases, irrespective of the presence or severity of symptoms.The analysis outcomes and their case and control definitions and sample sizes are shown in Figure 2.
Figure 2: Definition of cases and controls for each of the analysis in data freeze 5. Note that SARS-CoV-2 is the virus that causes the COVID-19 infection. Adapted from Andrea Ganna’s presentation on January 25, 2021.
Upon collecting the genetic data produced by our contributors, we performed GWAS according to the definitions in Figure 2. Previously, in data freeze 4, we highlighted novel genetic signals in susceptibility and severity of COVID-19 in 7 chromosome regions. These regions point towards an etiology in innate immunity and lung dysfunction: in line with the leading clinical understanding of COVID-19 infections. In data freeze 5, we identified 15 genome-wide significant regions across all chromosomes: 1 chromosome region had genome wide significance only in the critically ill analysis (analysis A); 11 chromosome regions have higher effect in the severity analysis than in reported infection analysis (analysis B); and 4 of these chromosome regions are specific to SARS-CoV-2 reported infection (analysis C). In Figure 3, we present a graphical representation of these results as a Miami plot (a panelized version of a Manhattan plot. Named after Miami as the Miami skyline is reflected on the water).
Figure 3. Miami plot of genome-wide association results for COVID-19. The top panel shows results of genome-wide association study of hospitalized COVID-19 and controls (analysis B), and bottom panel the results of reported SARS-CoV-2 infection and controls (analysis C).
We understand that with many genetic studies, diversity in sample collection is a major concern (elaborated here). As such, we aimed to improve the diversity in our sample collection as our study grew (Figure 4). Our improved sample collection effort has led us to identify new genetic factors associated with COVID-19 (our previous results are discussed in blog posts on Release 3 and R4). With the concomitant identification of genetic risk factors using our analytical methods, we are able to observe genetic variants in or close to genes. So far, most the genes we identified point towards an increased risk in cellular mechanisms, immune regulation, and cardiac function. The identification of these risk factors can ultimately lead to treatments in targeting the identified genes.
Figure 4. Overview of the studies contributing to the COVID-19 host genetics initiative and composition by major ancestry groups in meta-analyses. In data freeze 5, 19 studies contributed with non-European populations: 7 African American, 5 Admixed American, 4 East Asian, 2 South Asian and 1 Arab. Diamonds show the effective sample size (sample size that will find statistically significant effect in scientific events) received from different geographical locations.
We found 9 new chromosome regions associated with COVID-19. In analysis A, for critical illness, these include chromosome regions near two genes: LZTFL1 on chromosome 3 and TAC4 on chromosome 17. The protein LZTFL1 regulates protein trafficking to the ciliary (root: cilia) membrane. Cilia are hair-like structures that extend from the cell body. They are found in airways, lungs and many other organs. LZTFL1 also participates in the immune responses. The TAC4 protein functions to regulate blood pressure and the immune system.
For analysis B, in patients hospitalized with COVID-19, we found associations in variants close to 4 genes. Firstly, we identified a chromosome region on THBS3 on chromosome 1. This gene codes for the protein THBS3 that is expressed in the heart and upregulated during cardiac diseases. Secondly, we identified a chromosome region on SCN1A on chromosome 2. Variations in the SCN1A gene have been shown to cause epilepsy and seizures. Thirdly, we identified a chromosome region on TMEM65 on chromosome 8. This gene codes for the TMEM65 protein which has a role in cardiac development, regulation of cardiac conduction and function. It may also play a role in cell energy metabolism. Notably, the variant identified in our analysis for TMEM65 has frequency of 12% in East Asia, and 1% in European population. Allele frequencies describe the amount of variation in a certain gene or genomic region. Finally, we identified a chromosome region on KANSL1 on chromosome 17. It has been suggested that the protein coded by this gene, KANSL1, has a role in neuronal processes.
Finally, in analysis C, for reported SARS-CoV-2 infections, 3 new associations were found in the regions close to genes, ZBTB11 on chromosome 3, DNAH5 on chromosome 5, and PPP1R15A on chromosome 19. Firstly, we identified a region by the gene ZBTB11 on chromosome 3. This gene codes for the protein ZBTB11 which has been shown to regulate immune cell development. Secondly, we identified a chromosomal region in DNAH5 on chromosome 5. Genetic variations in DNAH5 have been shown to cause primary ciliary dyskinesia, defective movement of cilia, leading to recurrent chest infections, ear/nose/throat symptoms, bronchitis and infertility. Finally, we identified a chromosome region close to PPP1R15A on chromosome 19. This gene codes for the protein PPP1R15A that has been shown to mediate growth arrest and cell death in response to DNA damage, negative growth signals, and incorrect protein structure.
Genes affecting the immune system play an important role in COVID-19 in our analyses. The genes involved in lung and cardiac function and neuronal processes are also part of our findings. Cardiac diseases have previously been reported as a susceptibility factor to COVID-19 and neuronal symptoms reported as part of COVID-19 disease.
Risk factors identified in association studies may not point to a causal basis of COVID-19 susceptibility or severity. As such we employed a method called Mendelian randomization (MR) that uses genomic information to infer causal associations. MR is a method that uses genetic variants known to influence a given exposure (e.g. BMI) to examine the causal effect of an exposure on disease outcomes. For a closer look, we described MR in a recent blog post (catered to the scientific audience). Across the three COVID-19 phenotypes, we identified statistically significant causal associations between the three COVID-19 outcomes and 6 traits (out of 38 selected traits that we tested for, Figure 4). We found that genetically predicted higher body mass index (BMI) was associated with a higher risk of SARS-CoV-2 infection and COVID-19 hospitalization. This result corroborates findings from observational studies, which have observed an increased risk of severe COVID-19 outcomes associated with increased BMI. Additionally, genetically predicted smoking was associated with an increased risk of COVID-19 hospitalization.
Figure 5: Genetic correlations and Mendelian randomization causal estimates between 38 traits and COVID-19 severity and SARS-CoV-2 reported infection. Traits are listed on X-axis and COVID-19 phenotypes on Y-axis. Blue represents negative genetic correlation and protective Mendelian randomization (MR) causal estimates and red represents positive genetic correlation and risk MR causal estimates. Larger squares correspond to more significant correlation. Causal estimates that pass a statistical significance threshold are marked with asterisk.
In the current global crisis of the COVID-19 pandemic, these results demonstrate the power of a global effort of 47 diverse contributors. In total, we identified 15 genomic regions associated with COVID-19 susceptibility and severity. To further interrogate the causality of these regions, we utilized statistical inference (i.e., Mendelian Randomization) to identify 6 traits with statistically significant causality with these COVID-19 GWAS signals. Currently, we are finalizing our results into a scientific article. As we continue to soldier through the COVID-19 global pandemic, the COVID-19 HGI will iteratively produce genetic results. By working together, we can generate robust findings required to better understand the biological factors and clinical presentation of COVID-19.
Thank you to Andrea Ganna, PhD for thoughtful feedback and comments.