Bugbank News: May 2020

18 May 2020 COVID-19 PCR test results for UK Biobank released

1326 returned one or more positive tests, of whom 932 returned tests while a hospital inpatient*
3184 never returned a positive test, of whom 2254 returned tests while a hospital inpatient

* This definition of COVID-19 positive inpatient does not require that the positive test and the test while an inpatient necessarily coincide in participants with multiple tests. Requiring them to coincide on the same test reduces this number from 932 to 921.

Compare the totals to the 3 May 2929 release here.

Important notes regarding negative test results:

Public Health England has restructured its databases to manage large scale testing in response to implementation of the UK Government strategy's Pillar 2.
Consequently, new negative test results are absent for some laboratories from this release. We aim to incorporate the omitted test results in future releases.

Here is a new version of Figure 4 of our paper, updated to reflect this latest tranche of test results. For a daily updating summary of lab-confirmed cases in England, visit coronavirus.data.gov.uk.

Critical evaluation of the rs2076205 association with COVID-19 susceptibility in UK Biobank

In this post I critically evaluate the evidence for an association between rs2076205 and COVID-19 susceptibility, as previously reported here, here and here. There is a need to take stock because the sample sizes are as yet modest, and small differences in the details of the analysis of the UK Biobank data appear to produce different interpretations of the strength of evidence.

Statistical considerations that question the association include the observations that:

The significance of the association goes down if individuals with close relatives in UK Biobank are included in the analysis.
The signal appears specific to UK Biobank participants of white European ancestry.
Power to detect a true association at a particular variant, if one were to exist, would be low because the sample size is modest.

Statistical considerations that support the association include the observations that:

The direction and magnitude of the effect are compatible whether individuals with close relatives in UK Biobank are included in the analysis or not.
The direction and magnitude of effect are compatible when comparing COVID-19 +ve individuals to either (i) COVID-19 -ve individuals, as in the discovery analysis, and (ii) untested individuals, as in the quasi-replication analysis.
In individuals of white European ancestry, the variant is not strongly stratified geographically nor does it appear confounded with any of a large number of variables recorded by UK Biobank.

Robustness to the inclusion of close relatives

Comparison of results to other groups in the COVID-19 Host Genetics Initiative revealed that the signal in XPNPEP2 varies depending on the details of the analysis. Tomoko Nakanishi and Brent Richards have been helpful picking this apart. The differences are driven by whether one excludes participants with other close relatives (cousins or closer) in the analysis (as I do) or not (as many others do).

The table compares the effect of the minor (rarer) allele on the risk of a PCR positive for COVID-19, the standard error (uncertainty), and the -log₁₀ p-value (bigger is more statistically interesting) in a logistic regression analysis controlling for sex, age, age*age, sex*age and 40 genetic principal components:

The reasons for including individuals with close relatives in UKB are:

Increased sample size should reduce uncertainty and improve the ability to discover signals.
There are no pairs of close relatives who have both received a PCR test result among English UKB participants, so far.

The reasons for excluding individuals with close relatives in UKB are:

Pairs of close relatives may appear in the PCR test positive vs negative comparison in future and do currently feature in other comparisons (e.g. PCR test positive vs no test).
Determining inclusion/exclusion criteria depending on outcome variables (e.g. do they currently have a PCR test result) potentially introduces undesired ascertainment effects.
Close relatives are more likely to share exposure to unmeasured environmental risk factors, notably exposure to SARS-CoV-2 which is highly variable and prerequisite for disease. If this non-independence is not fully controlled (difficult even with sophisticated tools) the inclusion of outcomes among a group of genetically similar individuals can unduly influence results.

On balance one would think excluding close relatives is the cautious approach, but with modest sample sizes the argument that more is better has merit. What is unusual is that including more individuals reduces signal at this gene.

Signal in other ancestries

The signal at SNP rs2076205 in XPNPEP2 is not apparent in UKB participants of non-white European ancestry. This means that it does not appear to be able to account for the strong increased risk of COVID-19 among individuals of non-white European ancestry in England.

Geographical stratification

In considering whether the signal could be an artefact, it is important to look at geographical stratification of the allele because risk factors including exposure to SARS-CoV-2 vary geographically. The figure shows the frequency of the minor allele across England, which was the allele estimated to be protective, among those of white European ancestry only. The map suggests that the allele has broadly similar frequency across England, although there are some local pockets of higher allele frequency scattered around. The map does not show a geographical trend (for example increasing allele frequency moving from North to South). (Note the density of points reflects UKB's sampling distribution, which was focused on particular urban centres.)

Robustness to measured confounders

Supposing rs2076205 had no effect on susceptibility to COVID-19, then what other explanations are there for the strong signal of association? To address this question Nicolas Arning and I have investigated the large number of variables recorded in UKB. In particular, Nick has helped to sift through the variables by applying machine learning methods, mainly XGBoost.

In the table, we have investigated the effect of including potential confounders on the association between COVID-19 PCR positive vs negative, both in terms of the statistical significance (measured by the -log₁₀ p-value) and the direction and magnitude of the effect itself. The focus is on individuals of white European ancestry, since that was where the signal was found.

The short story is we have failed to make the signal go away by including any of a large number of covariates. Moreover, the direction and magnitude of the effect is remarkably consistent, indicating that rs2076205 is not very correlated with any of the >18,000 covariates available. We have included potential confounders directly in the logistic regression, including those we might expect to be important beforehand (geographical location, genotyping array), known risk factors (body mass index), variables that came out in early analyses (Townsend deprivation index) and variables suggested by the XGBoost analysis. We have also directly included composite predictors produced by XGBoost.

Analysis of individuals with close relatives in UK Biobank

We investigated why inclusion or exclusion of individuals with close relatives in UK Biobank affects the signal, building on an earlier subgroup analysis.

Individuals with other close relatives in UK Biobank are not a random subset of all UKB participants. They are stratified geographically, because they are more likely to be found in the most densely-sampled urban areas and they differ in other respects. Nick ran an XGBoost analysis to identify predictors of having a close relative in UKB. Prediction had a low specificity and sensitivity of roughly 60%. Nonetheless, the predictors with highest feature importance were:

Number of full brothers (UKB code 1873)
Number of full sisters (1883)
Distance Euclidean to coast (24508)
Time urine sample collected (20035)
Country of birth UK elsewhere (1647)
Length of time at current address (699)
Frequency of friend family visits (1031)

I included Nick's XGBoost predictor of having UKB relatives in the logistic regression (see table above) but the predictor did not alter the signal of association at rs2076205, whether relatives were excluded (as shown) or not.

The take-home is we don't understand why actually including versus excluding relatives in UKB affects the signal of association, but given the lack of evidence for stratification and confounding of rs2076205, it may be a combination of (i) re-weighting the contribution to the analysis of different regions or test centres and (ii) differences in interpretation or ascertainment of the PCR positive-versus-negative outcome by region or test centre. The latter I have tried to investigate by subgroup analysis but the numbers are as yet too small.

Statistical evaluation of the signal of association at rs2076205

Here I summarize the statistical evidence favouring or disfavouring an association between rs2076205 and susceptibility to COVID-19. The figure shows the relative strength of evidence for the association (positive and bigger is stronger) as a function of the effect size in four logistic regression models (discovery vs quasi-replication, excluding vs including relatives) and two combined models (excluding vs including relatives). The idea is that one can weight the strength of evidence by one's prior expectation of observing an effect of a given magnitude in a common SNP (~25% frequency). To assist, known effect sizes discovered in other genome-wide association studies for susceptibility to infection are marked from Tian and colleagues (2017).

Briefly, the result of combining evidence across the discovery and quasi-replication analyses indicates that

The statistical evidence for an association between rs2076205 and susceptibility to COVID-19 just meets the stringent criterion for 'genome-wide' significance if one includes close relatives (orange line).
The evidence substantially surpasses the 'genome-wide' significance criterion if one excludes close relatives (red line).
The effect size of the SNP is broadly similar whether one excludes or includes relatives, and between discovery and quasi-replication analyses.
The effect size is not outlandishly large compared to previous studies of infection.

In more detail, the y-axis shows the log₁₀ likelihood ratio between the alternative hypothesis with the stated effect size and the null hypothesis with effect size zero. The log-likelihoods are maximized with respect to all other parameters, so they have the same degrees of freedom, making the log-likelihoods directly comparable. The log-likelihood is essentially acting as a Bayes factor between the alternative and null hypotheses.

If one were testing a single candidate SNP, Jeffreys' interpretation of the y-axis would mean above 0.5 represents substantial evidence in favour of the association. However, there are generally considered around 10⁶ effective tests in a genome-wide association study, so the threshold becomes 6.5. This is equivalent to the 'genome-wide' significance threshold for p-values of 5×10^-8, indicated on the figure after transforming to the log-likelihood scale.

Since the logistic regression is not quite identical to the SAIGE analysis, which is preferred because of a more sophisticated model of genetic covariance, I have indicated with stars the maximized likelihood and corresponding effect size estimate from SAIGE. The values at the grey and blue stars (analyses including relatives) were provided by Tomoko Nakanishi.

Biological considerations

As discussed previously, XPNPEP2 appears to be a very plausible candidate for a role in susceptibility to COVID-19. It belongs to a pathway in common with ACE2, the receptor by which SARS-CoV-2 gains entry to the cell. Moreover, its involvement in bradykinin signalling and its previous implication in angioedema lends support to a new and - from the perspective of treatment options - potentially important hypothesis that the kallikrein-kinin system, rather than the renin-angiotensin (RAS) system, may mediate the life-threatening pathophysiology of COVID-19:

Kallikrein-kinin blockade in patients with COVID-19 to prevent acute respiratory distress syndrome
F. L. van de Veerdonk, M. G. Netea, M. van Deuren, J. W. M van der Meer, Q. de Mast, R. J. Brüggemann and H. van der Hoeven (2020)
eLife 9: e57555

Conclusions

In summary, the statistical genetic evidence from the UK Biobank cohort is 'substantial' or 'decisive' depending on whether one includes or excludes close relatives, but the signal appears restricted to those of white European ancestry. The >1000-fold difference in strength of evidence is a concern for the robustness of the association. Yet detailed analyses of the SNP in question do not reveal stratification or confounding. The phenomenon may therefore reflect differences in the interpretation or ascertainment of the COVID-19 PCR positive and negative outcomes by region or testing centre. There is a desire to replicate the signal in an independent population, taking care to consider the potential for COVID-19 test results and take-up to vary from place-to-place. There is a need to urgently find improved interventions for the ongoing pandemic, and the analyses here lend support to the kallikrein-kinin blockade hypothesis.

Quasi-replication of a COVID-19 susceptibility locus in UK Biobank

To gain broad acceptance that a genetic variant is truly associated with a trait requires (i) a strong signal in the population in which it was discovered, after controlling for possible artefacts and (ii) a replication of the signal and direction of effect in an independent population.

Quasi-replication is a weaker form of replication in which a related trait in the same population is used in step (ii), instead of the same trait in an independent population: see e.g. this paper. The assumption is that genetic variant affects both traits in the same way.

Quasi-replication supports the association between COVID-19 susceptibility and rs2076205 in XPNPEP2, a signal that already meets the step (i) condition. The related traits I have studied are

Step 1 (discovery): Susceptibility to COVID-19. Specifically, whether individuals in UK Biobank that were tested for SARS-CoV-2 were positive (cases) or negative (controls).
Step 2 (quasi-replication): Susceptibility to severe COVID-19. Specifically, comparing UK Biobank participants positive for SARS-CoV-2 and requiring hospitalization (cases), versus UK Biobank participants not tested for SARS-CoV-2 (controls).

These definitions require some justification. The step 1 trait better controls for who has been exposed to SARS-CoV-2, but it does not differentiate infection severity. The step 2 trait does not control for whether people have been exposed to SARS-CoV-2, but it defines severity as COVID-19 requiring hospitalization. The untested population is a reasonable control here because a minority of infected people would require hospitalization if they were infected. The cases partially overlap between trait definitions, but the controls do not, so the tests are independent if the variant has no effect and artefacts are avoided.

The rs2076205 variant quasi-replicates in UK Biobank

To test whether the most significant variant in step 1 quasi-replicated, I used the sub-group analysis to predict the direction of effect and identify the subgroup in which the effect would most likely be strongest. This led to the specific hypothesis that

The rare allele at rs2076205 reduces the risk of COVID-19 requiring hospitalization among the white European subgroup of men and women, in an analysis that excludes close relatives.

In a replication study, the requirement for deeming a result statistically interesting is usually considered to be much less stringent. In this case, it could be argued that the single-tailed test would require a p-value of 0.01 or smaller.

This reduced stringency is convenient because the inability to know who has been exposed to SARS-CoV-2 probably makes the step 2 trait noisy, which hurts statistical power.

The result is that rs2076205 does quasi-replicate, with a one-tailed p-value of 0.00012, an estimated effect size of -0.21 and 95% confidence interval of (-0.33, -0.10). The estimated effect size is a log odds ratio, and implies that a male with a copy of the rare allele is 19% less likely to experience COVID-19 severe enough to require hospitalization, compared to a male with a copy of the common allele.

The Manhattan plots, with the XPNPEP2 gene marked, are shown, first for step 1 (discovery):

And then for step 2 (quasi-replication):

The Manhattan plots show that the signal is centred on the XPNPEP2 gene in both discovery and quasi-replication subjects. More work would be needed to dissect the signal, determine whether it is one signal or multiple, and identify the possible genetic mechanism, particularly as rs2076205 occurs in an intron.

Subgroup analysis

Having testing the specific quasi-replication hypothesis, I repeated the analysis to understand whether the effect differs in different subgroups.

Like in step 1 (discovery), the effect appears slightly stronger in males than females, and in genetically-identified white Europeans compared to other groupings. The signal is again diluted by failing to exclude close relatives, whose inclusion can affect analyses in unexpected ways. The specific analysis used for quasi-replication is marked with a red asterisk.

The SAIGE output for the quasi-replication analysis was:

CHR POS rsid SNPID Allele1 Allele2 AC_Allele2 AF_Allele2 imputationInfo N BETA SE Tstat p.value p.value.NA Is.SPA.converge varT varTstar AF.Cases AF.Controls

X 128893417 rs2076205 X:128893417_C_T C T 153134.094117643 0.270302781358809 0.990731443602318 283264 -0.212939151738995 0.0578228858969803 -64.2609437640745 0.000230857982728242 0.000216327322582023 1 301.780782159031 303.712925876132 0.205153495741731 0.270416828147468

The analysis included age, sex, age*age, age*sex and 20 genetic principal components.

There are some potential criticisms of the analysis, beyond using the same population to validate the association. In particular, the step 2 analysis is itself of direct interest, and having run a genome-wide association study (GWAS) and found nothing significant, is it reasonable to claim that a single constituent result from that GWAS provides quasi-replication of a separate GWAS? I would argue that falsifiability was possible: the effect could have been in the opposite direction and the signal strength could have been below that required for replication. Since there were two opportunities for falsification, the quasi-analysis does provide additional support that XPNPEP2 is a COVID-19 susceptibility locus.

XPNPEP2 is genome-wide significant in a COVID-19 +ve/-ve analysis of UK Biobank participants

On April 19 I wrote about a biologically interesting signal in XPNPEP2 in a comparison of COVID-19 positive and negative individuals who have participated in UK Biobank. In the latest data release, which represents a doubling in sample size for this analysis, the hit is genome-wide significant in an analysis of white Europeans. However, it is not genome-wide significant when analysing individuals with all ancestries.

The top hit is rs2076205, a single nucleotide polymorphism whose common allele occurs in 68% of COVID-19 negative individuals and 79% of COVID-19 positive individuals. It has a p-value of 1.1 in 10 million, which is below the widely-used threshold of 5 in 10 million that is often deemed statistically interesting.

In the analysis of individuals of all ancestries, the signal is substantially muted, and no variants are genome-wide significant. The reasons for the difference in analysis are still unclear, but the SNP does appear to be strongly population stratified.

XPNPEP2 is of interest because its product's normal function (Aminopeptidase P) includes degrading bradykinin, also degraded by ACE2. The involvement of XPNPEP2 variants in ACE inhibitor-associated angioedema has been proposed as combining with the drug to produce higher circulating bradykinin.

Bradykinin has already been suggested as an important mediator of the COVID-19 because of clinical similarities between COVID-19 and ACE inhibitor induced angioedema. Since SARS-CoV-2 gains entry to the cell via ACE2, the virus may inhibit its normal role and contribute to elevated bradykinin. The dry cough associated with COVID-19 has been likened to the 'bradykinin cough' associated with the use of ACE inhibitors.

Interestingly, bradykinin inhibitors such as icatibant already exist and are used in ACE inhibitor induced angioedema. However, to make the jump that such drugs might be useful against COVID-19 on this evidence alone is speculative, and there are several caveats to the association itself, including the requirement to replicate the effect in an independent population. There is a need to understand the robustness of the association to population stratification and other possible confounders.

Update

This post was updated a second time on 8 May to include inpatient status in the subgroup analysis.

To dig a little deeper into the result I have looked at effect sizes in different groups defined by

Ethnicity: British (self reported), euranc (white European genetically) or any (these are the only groupings so far with sufficient sample size)
Sex: female, male or any
Close relatives: whether they are excluded (as above) or included in the analysis (larger sample size but potential for family clusters to over-influence the results)
Inpatients: whether individuals were in (ever inpatients when tested) or nin (never inpatients when tested)

The results suggest that the effect is both larger in magnitude and stronger in statistical significance in males. But this difference depends on inpatient status: the signal appears strong and similar in magnitude among male and female inpatients. In non-inpatients, the signal is non-significant in almost every subgroup. Including individuals not identified genetically as white European slightly dilutes the signal, as does including close relatives.

The graph shows the effect (and 95% confidence interval) of the rarer allele: negative coefficients (on a log odds scale) indicate that the rare allele reduces the risk of returning a positive test. To illustrate, an effect size of -0.35 would imply that a male with the rare allele is 30% less likely to test positive than a male with the common allele.

Some technical details: the effect size analysis is based on a logistic regression, whereas the GWAS above uses the SAIGE tool. The covariates included are: sex, age, age*age, sex*age and genetic principal components (40 in the logistic regression, 20 in the GWAS). The results are being contributed to the COVID-19 Host Genetics Initiative.

Latest COVID-19 PCR test results for UK Biobank released

The latest tranche of COVID-19 PCR test result data for UK Biobank participants, collated from Public Health England data, has been released. This latest data covers the period March 16 - May 3. For details of how to access the data, see this post.

The updated data contains 5356 test results, 1806 of them positive. The tests correspond to 3002 participants, broken down as follows:

1073 returned one or more positive tests, of whom 825 returned tests while a hospital inpatient*
1929 never returned a positive test, of whom 1260 returned tests while a hospital inpatient