Bugbank Navigation

What happened to the XPNPEP2 signal?

The last two releases of COVID-19 test results for the UK Biobank (18 May and 31 May) have not reproduced the signal of genome-wide significance in XPNPEP2. By contrast, another signal in chromosome 3 has meanwhile been detected in an independent cohort and appears to be supported by the COVID-19 Host Genetics Initiative's May 15th meta-analysis.

So what is the current status of the signal of association in rs2076205 in XPNPEP2? This signal was unearthed in a comparison of COVID-19 PCR positive versus negative individuals of European ancestry. The rare allele was commoner than expected among PCR negative individuals. Depending how you performed the analysis, the signal was significant or very significant in the UK Biobank cohort. However, there was a difficult-to-explain sensitivity to whether individuals with close relatives in the rest of the cohort were included in the analysis. Even so, I was unable to explain away the signal with measured confounders or population stratification.

The signal is no longer genome-wide significant in UK Biobank, whether or not you exclude individuals with close relatives. Nor was its significance supported by the May 15th meta-analysis combining UK Biobank results with other international cohorts, notably Lifelines, FinnGen and the Netherlands Twin Register.

Therefore it is tempting to write off the association as noise in a relatively small sample. Without dismissing that as the possible explanation, here I do a little more digging to try to explain why else there might have been a signal that is now much reduced.

Perhaps surprisingly, the deviation of allele frequencies from their expectation under the null hypothesis (of no association, as judged against overall allele frequencies among individuals of European ancestry in the Biobank) was driven mainly by PCR negative individuals, rather than PCR positive individuals. PCR negative individuals showed an enrichment for the rare allele. While PCR positive individuals showed a corresponding depletion of the rare allele, the signal was weaker. The trend is still apparent in the data, but its magnitude - and therefore significance - is now reduced.

The graph shows that the significance of the association generally increased with sample size between mid-March and May 1st, before reversing. Why should that be?
Observed and expected genotype counts and statistical significance of the rare allele at rs2076205 in members of the UK Biobank with European ancestry, classified by SARS-CoV-2 test result: anypos.in (ever tested positive, ever tested while an inpatient), anypos.nin (ever tested positive, never tested while an inpatient), neg.in (never tested positive, ever tested while an inpatient), neg.nin (never tested positive, never tested while an inpatient). Red lines represent analyses including individuals with close relatives in UK Biobank. Blue lines represent analyses excluding individuals with close relatives in UK Biobank. Significance calculated crudely with a binomial test. Expected genotype frequencies calculated from all UK Biobank participants of European ancestry.
Given the reliance of the signal on the interpretation of PCR negative individuals, one possible explanation could be a change in inpatient testing at the end of April. The testing criteria before and after the end of April probably differed by hospital, and the date of any change in testing would have varied too, but my clinical colleagues in Oxford have characterized it as follows (and apologies if there are any errors in reproducing the account here):

  • Before circa 25th April, only individuals deemed likely to have COVID-19 were tested.
  • From around 25th April onwards, testing of inpatients was drastically broadened.
There is some evidence of an increased rate of testing in the graphs: the allele counts for negative inpatients become steeper around the end of April.

The idea - and this is only an idea - is that PCR negative individuals before 25th April contained a sizeable subgroup of individuals exposed to SARS-CoV-2 who did not present detectable levels of virus, perhaps because they have a degree of resistance to infection. After 25th April, many individuals without true exposure to SARS-CoV-2 were also tested, diluting the signal.

When UK Biobank releases more detailed data on hospital episodes, it may be possible to test this idea, for example by comparing individuals with and without a diagnosis of COVID-19.

There are other possible explanations, including unmeasured confounding, ascertainment bias and noise. The explanation offered above does not explain why the signal in positive inpatients (weaker though it was) also reversed. And there are alternative interpretations of negative inpatients - another clinical colleague of mine has suggested they contain a subgroup of individuals whose disease is more progressed (i.e. worse) at the time of admission, that there may be a window of opportunity early after infection for detecting the virus from throat swabs, and that window was missed in this subgroup.

Whatever the explanation, there is excitement at the discovery of signals elsewhere in the genome by others (which appear to be replicated in independent cohorts including UK Biobank), and the enhanced prospects the discovery represents for finding new ways to tackle the disease.

31 May 2020 COVID-19 PCR test results for UK Biobank released

The latest tranche of COVID-19 PCR test result data for UK Biobank participants, collated from Public Health England data, has been released. This latest data covers the period March 16 - May 31. For details of how to access the data, see this post.
  • 1474 returned one or more positive tests, of whom 991 returned tests while a hospital inpatient*
  • 4643 never returned a positive test, of whom 3457 returned tests while a hospital inpatient
* This definition of COVID-19 positive inpatient does not require that the positive test and the test while an inpatient necessarily coincide in participants with multiple tests. Requiring them to coincide on the same test reduces this number from 991 to 976.

Compare the totals to the 18 May 2020 release here.

Important notes regarding negative test results:
  • Public Health England has restructured its databases to manage large scale testing in response to implementation of the UK Government strategy's Pillar 2.
  • Consequently, new negative test results are absent for some laboratories from this release. We aim to incorporate the omitted test results in future releases.
Changes to download mechanism:

UK Biobank have slightly changed the method for downloading the covid19_result table. I have updated the steps here.


SARS-CoV-2 negative inpatient identification issue

Internal data quality checks have highlighted an issue with the identification of inpatient versus non-inpatient status among SARS-CoV-2 negative test results at some laboratories. This issue affects the PCR-based tests that we have linked to UK Biobank from Public Health England (field 40100).

For some laboratories, we report SARS-CoV-2 negative inpatients as non-inpatients (value 0 in the origin column).

The issue became apparent because we report large numbers of negative test results for some laboratories, none (or a very small number) of which are identified as inpatients. Notable laboratories include

  • Northern General Hospital (Sheffield)
  • St George’s Hospital Tooting
  • Leeds General Infirmary
  • Oxford John Radcliffe
  • Darent Valley Hospital Dartford
  • Royal Liverpool University Hospital
Additionally, we appear to have intermittently reported COVID-19 negative inpatients as non-inpatients for Sunderland Royal Infirmary. The identity of the laboratory is coded in the laboratory column. UK Biobank codes these seven laboratories as 38, 11, 31, 40, 19, 56 and 63 respectively.

We have tracked down the source of the issue to heterogeneity in the processing of negative results between laboratories. All laboratories report positive results to the Second Generation Surveillance System (SGSS). Usually only positive microbiological results are reported to SGSS. Exceptionally for SARS-CoV-2, negative results are also reported. This happens in two ways:
  1. Directly to SGSS, with a pseudo-code indicating the organism (SARS-CoV-2) and test result (negative), where usually an organism alone is reported. Most laboratories have taken this approach.
  2. Directly to Respiratory DataMart, a system set up to monitor influenza. The affected laboratories have taken this approach. We access the negative results for these laboratories separately, and do not import data required to identify inpatient status.
We are investigating a remedy for the issue. Meanwhile we advise researchers to be aware of the potential for artefacts, for example when comparing positive inpatients to negative inpatients.

(Strictly speaking, the coding in UK Biobank is correct:
  • 0: No explicit evidence in microbiological record that the participant was an inpatient
  • 1: Evidence from microbiological record that the participant was an inpatient
So the issue is one of heterogeneity of evidence, which the careful wording allows for.)