Bugbank Navigation

Quasi-replication of a COVID-19 susceptibility locus in UK Biobank

To gain broad acceptance that a genetic variant is truly associated with a trait requires (i) a strong signal in the population in which it was discovered, after controlling for possible artefacts and (ii) a replication of the signal and direction of effect in an independent population.

Quasi-replication is a weaker form of replication in which a related trait in the same population is used in step (ii), instead of the same trait in an independent population: see e.g. this paper. The assumption is that genetic variant affects both traits in the same way.

Quasi-replication supports the association between COVID-19 susceptibility and rs2076205 in XPNPEP2, a signal that already meets the step (i) condition. The related traits I have studied are

  • Step 1 (discovery): Susceptibility to COVID-19. Specifically, whether individuals in UK Biobank that were tested for SARS-CoV-2 were positive (cases) or negative (controls).
  • Step 2 (quasi-replication): Susceptibility to severe COVID-19. Specifically, comparing UK Biobank participants positive for SARS-CoV-2 and requiring hospitalization (cases), versus UK Biobank participants not tested for SARS-CoV-2 (controls).
These definitions require some justification. The step 1 trait better controls for who has been exposed to SARS-CoV-2, but it does not differentiate infection severity. The step 2 trait does not control for whether people have been exposed to SARS-CoV-2, but it defines severity as COVID-19 requiring hospitalization. The untested population is a reasonable control here because a minority of infected people would require hospitalization if they were infected. The cases partially overlap between trait definitions, but the controls do not, so the tests are independent if the variant has no effect and artefacts are avoided.

The rs2076205 variant quasi-replicates in UK Biobank

To test whether the most significant variant in step 1 quasi-replicated, I used the sub-group analysis to predict the direction of effect and identify the subgroup in which the effect would most likely be strongest. This led to the specific hypothesis that
The rare allele at rs2076205 reduces the risk of COVID-19 requiring hospitalization among the white European subgroup of men and women, in an analysis that excludes close relatives.
In a replication study, the requirement for deeming a result statistically interesting is usually considered to be much less stringent. In this case, it could be argued that the single-tailed test would require a p-value of 0.01 or smaller. 

This reduced stringency is convenient because the inability to know who has been exposed to SARS-CoV-2 probably makes the step 2 trait noisy, which hurts statistical power.

The result is that rs2076205 does quasi-replicate, with a one-tailed p-value of 0.00012, an estimated effect size of -0.21 and 95% confidence interval of (-0.33, -0.10). The estimated effect size is a log odds ratio, and implies that a male with a copy of the rare allele is 19% less likely to experience COVID-19 severe enough to require hospitalization, compared to a male with a copy of the common allele.

The Manhattan plots, with the XPNPEP2 gene marked, are shown, first for step 1 (discovery):
And then for step 2 (quasi-replication):
The Manhattan plots show that the signal is centred on the XPNPEP2 gene in both discovery and quasi-replication subjects. More work would be needed to dissect the signal, determine whether it is one signal or multiple, and identify the possible genetic mechanism, particularly as rs2076205 occurs in an intron.

Subgroup analysis

Having testing the specific quasi-replication hypothesis, I repeated the analysis to understand whether the effect differs in different subgroups.
Like in step 1 (discovery), the effect appears slightly stronger in males than females, and in genetically-identified white Europeans compared to other groupings. The signal is again diluted by failing to exclude close relatives, whose inclusion can affect analyses in unexpected ways. The specific analysis used for quasi-replication is marked with a red asterisk.

The SAIGE output for the quasi-replication analysis was:
CHR POS rsid SNPID Allele1 Allele2 AC_Allele2 AF_Allele2 imputationInfo N BETA SE Tstat p.value p.value.NA Is.SPA.converge varT varTstar AF.Cases AF.Controls
X 128893417 rs2076205 X:128893417_C_T C T 153134.094117643 0.270302781358809 0.990731443602318 283264 -0.212939151738995 0.0578228858969803 -64.2609437640745 0.000230857982728242 0.000216327322582023 1 301.780782159031 303.712925876132 0.205153495741731 0.270416828147468
The analysis included age, sex, age*age, age*sex and 20 genetic principal components.

There are some potential criticisms of the analysis, beyond using the same population to validate the association. In particular, the step 2 analysis is itself of direct interest, and having run a genome-wide association study (GWAS) and found nothing significant, is it reasonable to claim that a single constituent result from that GWAS provides quasi-replication of a separate GWAS? I would argue that falsifiability was possible: the effect could have been in the opposite direction and the signal strength could have been below that required for replication. Since there were two opportunities for falsification, the quasi-analysis does provide additional support that XPNPEP2 is a COVID-19 susceptibility locus.

No comments:

Post a Comment