In this post I critically evaluate the evidence for an association between rs2076205 and COVID-19 susceptibility, as previously reported
here,
here and
here. There is a need to take stock because the sample sizes are as yet modest, and small differences in the details of the analysis of the UK Biobank data appear to produce different interpretations of the strength of evidence.
Statistical considerations that question the association include the observations that:
- The significance of the association goes down if individuals with close relatives in UK Biobank are included in the analysis.
- The signal appears specific to UK Biobank participants of white European ancestry.
- Power to detect a true association at a particular variant, if one were to exist, would be low because the sample size is modest.
Statistical considerations that support the association include the observations that:
- The direction and magnitude of the effect are compatible whether individuals with close relatives in UK Biobank are included in the analysis or not.
- The direction and magnitude of effect are compatible when comparing COVID-19 +ve individuals to either (i) COVID-19 -ve individuals, as in the discovery analysis, and (ii) untested individuals, as in the quasi-replication analysis.
- In individuals of white European ancestry, the variant is not strongly stratified geographically nor does it appear confounded with any of a large number of variables recorded by UK Biobank.
Robustness to the inclusion of close relatives
Comparison of results to other groups in the
COVID-19 Host Genetics Initiative revealed that the signal in XPNPEP2 varies depending on the details of the analysis.
Tomoko Nakanishi and Brent Richards have been helpful picking this apart. The differences are driven by whether one excludes participants with other close relatives (cousins or closer) in the analysis (as I do) or not (as many others do).
The table compares the effect of the minor (rarer) allele on the risk of a PCR positive for COVID-19, the standard error (uncertainty), and the -log10 p-value (bigger is more statistically interesting) in a logistic regression analysis controlling for sex, age, age*age, sex*age and 40 genetic principal components:
The reasons for including individuals with close relatives in UKB are:
- Increased sample size should reduce uncertainty and improve the ability to discover signals.
- There are no pairs of close relatives who have both received a PCR test result among English UKB participants, so far.
The reasons for excluding individuals with close relatives in UKB are:
- Pairs of close relatives may appear in the PCR test positive vs negative comparison in future and do currently feature in other comparisons (e.g. PCR test positive vs no test).
- Determining inclusion/exclusion criteria depending on outcome variables (e.g. do they currently have a PCR test result) potentially introduces undesired ascertainment effects.
- Close relatives are more likely to share exposure to unmeasured environmental risk factors, notably exposure to SARS-CoV-2 which is highly variable and prerequisite for disease. If this non-independence is not fully controlled (difficult even with sophisticated tools) the inclusion of outcomes among a group of genetically similar individuals can unduly influence results.
On balance one would think excluding close relatives is the cautious approach, but with modest sample sizes the argument that more is better has merit. What is unusual is that including more individuals reduces signal at this gene.
Signal in other ancestries
The signal at SNP rs2076205 in XPNPEP2 is not apparent in UKB participants of non-white European ancestry. This means that it does not appear to be able to account for the
strong increased risk of COVID-19 among individuals of non-white European ancestry in England.
Geographical stratification
In considering whether the signal could be an artefact, it is important to look at geographical stratification of the allele because risk factors including exposure to SARS-CoV-2 vary geographically. The figure shows the frequency of the minor allele across England, which was the allele estimated to be protective, among those of white European ancestry only. The map suggests that the allele has broadly similar frequency across England, although there are some local pockets of higher allele frequency scattered around. The map does not show a geographical trend (for example increasing allele frequency moving from North to South). (Note the density of points reflects UKB's sampling distribution, which was focused on particular urban centres.)
Robustness to measured confounders
Supposing rs2076205 had no effect on susceptibility to COVID-19, then what other explanations are there for the strong signal of association? To address this question
Nicolas Arning and I have investigated the large number of variables recorded in UKB. In particular, Nick has helped to sift through the variables by applying machine learning methods, mainly
XGBoost.
In the table, we have investigated the effect of including potential confounders on the association between COVID-19 PCR positive vs negative, both in terms of the statistical significance (measured by the -log
10 p-value) and the direction and magnitude of the effect itself. The focus is on individuals of white European ancestry, since that was where the signal was found.
The short story is we have failed to make the signal go away by including any of a large number of covariates. Moreover, the direction and magnitude of the effect is remarkably consistent, indicating that rs2076205 is not very correlated with any of the >18,000 covariates available. We have included potential confounders directly in the logistic regression, including those we might expect to be important beforehand (geographical location, genotyping array), known risk factors (body mass index), variables that came out in early analyses (Townsend deprivation index) and variables suggested by the XGBoost analysis. We have also directly included composite predictors produced by XGBoost.
Analysis of individuals with close relatives in UK Biobank
We investigated why inclusion or exclusion of individuals with close relatives in UK Biobank affects the signal, building on an earlier
subgroup analysis.
Individuals with other close relatives in UK Biobank are not a random subset of all UKB participants. They are stratified geographically, because they are more likely to be found in the most densely-sampled urban areas and they differ in other respects. Nick ran an XGBoost analysis to identify predictors of having a close relative in UKB. Prediction had a low specificity and sensitivity of roughly 60%. Nonetheless, the predictors with highest feature importance were:
- Number of full brothers (UKB code 1873)
- Number of full sisters (1883)
- Distance Euclidean to coast (24508)
- Time urine sample collected (20035)
- Country of birth UK elsewhere (1647)
- Length of time at current address (699)
- Frequency of friend family visits (1031)
I included Nick's XGBoost predictor of having UKB relatives in the logistic regression (see table above) but the predictor did not alter the signal of association at rs2076205, whether relatives were excluded (as shown) or not.
The take-home is we don't understand why actually including versus excluding relatives in UKB affects the signal of association, but given the lack of evidence for stratification and confounding of rs2076205, it may be a combination of (i) re-weighting the contribution to the analysis of different regions or test centres and (ii) differences in interpretation or ascertainment of the PCR positive-versus-negative outcome by region or test centre. The latter I have tried to investigate by subgroup analysis but the numbers are as yet too small.
Statistical evaluation of the signal of association at rs2076205
Here I summarize the statistical evidence favouring or disfavouring an association between rs2076205 and susceptibility to COVID-19. The figure shows the relative strength of evidence for the association (positive and bigger is stronger) as a function of the effect size in four logistic regression models (discovery vs quasi-replication, excluding vs including relatives) and two combined models (excluding vs including relatives). The idea is that one can weight the strength of evidence by one's prior expectation of observing an effect of a given magnitude in a common SNP (~25% frequency). To assist, known effect sizes discovered in other genome-wide association studies for susceptibility to infection are marked from
Tian and colleagues (2017).
Briefly, the result of combining evidence across the
discovery and
quasi-replication analyses indicates that
- The statistical evidence for an association between rs2076205 and susceptibility to COVID-19 just meets the stringent criterion for 'genome-wide' significance if one includes close relatives (orange line).
- The evidence substantially surpasses the 'genome-wide' significance criterion if one excludes close relatives (red line).
- The effect size of the SNP is broadly similar whether one excludes or includes relatives, and between discovery and quasi-replication analyses.
- The effect size is not outlandishly large compared to previous studies of infection.
In more detail, the
y-axis shows the log
10 likelihood ratio between the alternative hypothesis with the stated effect size and the null hypothesis with effect size zero. The log-likelihoods are maximized with respect to all other parameters, so they have the same degrees of freedom, making the log-likelihoods
directly comparable. The log-likelihood is essentially acting as a
Bayes factor between the alternative and null hypotheses.
If one were testing a single candidate SNP,
Jeffreys' interpretation of the
y-axis would mean above 0.5 represents substantial evidence in favour of the association. However, there are generally considered around 10
6 effective tests in a genome-wide association study, so the threshold becomes 6.5. This is equivalent to the 'genome-wide' significance threshold for
p-values of 5×10
-8, indicated on the figure after transforming to the log-likelihood scale.
Since the logistic regression is not quite identical to the SAIGE analysis, which is preferred because of a more sophisticated model of genetic covariance, I have indicated with stars the maximized likelihood and corresponding effect size estimate from SAIGE. The values at the grey and blue stars (analyses including relatives) were provided by
Tomoko Nakanishi.
Biological considerations
As discussed previously, XPNPEP2 appears to be a very plausible candidate for a role in susceptibility to COVID-19. It belongs to a pathway in common with ACE2, the receptor by which SARS-CoV-2 gains entry to the cell. Moreover, its involvement in
bradykinin signalling and its
previous implication in angioedema lends support to a new and - from the perspective of treatment options - potentially important hypothesis that the
kallikrein-kinin system, rather than the
renin-angiotensin (RAS) system, may mediate the life-threatening pathophysiology of COVID-19:
Kallikrein-kinin blockade in patients with COVID-19 to prevent acute respiratory distress syndrome
F. L. van de Veerdonk, M. G. Netea, M. van Deuren, J. W. M van der Meer, Q. de Mast, R. J. Brüggemann and H. van der Hoeven (2020)
eLife 9: e57555
Conclusions
In summary, the statistical genetic evidence from the UK Biobank cohort is 'substantial' or 'decisive' depending on whether one includes or excludes close relatives, but the signal appears restricted to those of white European ancestry. The >1000-fold difference in strength of evidence is a concern for the robustness of the association. Yet detailed analyses of the SNP in question do not reveal stratification or confounding. The phenomenon may therefore reflect differences in the interpretation or ascertainment of the COVID-19 PCR positive and negative outcomes by region or testing centre. There is a desire to replicate the signal in an independent population, taking care to consider the potential for COVID-19 test results and take-up to vary from place-to-place. There is a need to urgently find improved interventions for the ongoing pandemic, and the analyses here lend support to the kallikrein-kinin blockade hypothesis.