Bugbank Navigation

Updated preprint: linking COVID-19 tests to UK Biobank

My bugbank colleagues and I have posted a substantially revised preprint describing our work linking COVID-19 test results between Public Health England's microbiology database SGSS and UK Biobank, the first data from which were released on Friday.

I recommend reading this short paper for anyone interpreting these test results, because it describes the scientific rationale, the data limitations, and the basic epidemiological characteristics of the COVID-19 positive UK Biobank cohort.

Armstrong, J., Rudkin, J. K., Allen, N., Crook, D. W., Wilson, D. J., Wyllie, D. H. and A.-M. O'Connell (2020)
Dynamic linkage of COVID-19 test results between Public Health's Second Generation Surveillance System and UK Biobank
Figshare doi:10.6084/m9.figshare.12091455 (abstract preprint)

Suggestive hits for COVID-19 positive versus negative individuals

Brent Richards of McGill University and colleagues had a different take on how to define cases of COVID-19 susceptible individuals and controls than what I previously advocated. They analysed only UK Biobank participants who had a test result, defining cases as those with at least one positive test, and controls as anyone whose test results were all negative (whether one test or multiple).

They and Andrea Ganna suggested I take a look at this definition, and the new analyses throws up at least one very suggestive finding: a non-coding variant on the X chromosome in the XPNPEP2 gene. X-linked genes are of special interest because of the known elevated risk in males (who carry only one X chromosome) and the co-location of the ACE2 gene on chromosome X. ACE2 is a transmembrane protein by which SARS-CoV-2 (the COVID-19 virus) gains entry to cells.

Interestingly, variation in XPNPEP2 is said to confer susceptibility to ACE-inhibitor-induced angioedema, an inflammatory reaction which causes swelling and can lead to respiratory distress. It is also interesting that ACE inhibitor drugs forge a connection between the X-linked ACE2 and XPNPEP2 genes.

Here are details of the analyses. I performed two analyses, one of which was focused only on white Europeans. The case:control ratios were 387:522 and 535:636 respectively. I controlled only for age, sex, and interaction between age and sex, and the first 10 or 40 principal components of genetic variation respectively. I controlled for population structure using SAIGE.

The Manhattan plot - where bigger numbers mean greater strength of evidence - for white Europeans shows two distinct peaks in chromsomes 3 and X. The most significant variant in both has a p-value of one in 10 million, which does not quite reach the conventionally agreed threshold for statistically interesting signals by a factor of two. However, sample sizes are set to increase.

The hit on chromosome 3 (rs7637558) occurs in the PTPRG gene, apparently involved in tumour suppression in some tissues. I have not yet looked into this signal in any detail. Brent and colleagues previously noted on twitter this hit at a similar level of significance. They did not analyse the X chromosome.
Here is a close-up of the X chromosome, with ACE2 marked by a grey vertical line. The peak of blue points is in XPNPEP2, with the most significant occuring at variant rs2076205.
I have made initial efforts to check the results by re-analysing each variant using a simple logistic regression, which shows an expected dosage effect whereby individuals homozygous for the risk allele have a higher risk of being COVID-19 positive than heterozygotes.

Reassuringly, the top hit variant has a similar signal when analysing individuals of any ancestry. 

These results are as yet preliminary, and are subject to scrutiny via meta-analysis of different studies in the COVID-19 Hg consortium and as the number of cases continues to rise, as regrettably it will, in the UK Biobank cohort.

For aficionados, the SAIGE output for the top two hits is here:
SAIGE results, white Europeans:
CHR POS rsid SNPID Allele1 Allele2 AC_Allele2 AF_Allele2 imputationInfo N BETA SE Tstat p.value p.value.NA Is.SPA.converge varT varTstar AF.Cases AF.Controls

X 128893417 rs2076205 X:128893417_C_T C T 490.125490196078 0.269595979205764 0.990731443602318 909 -0.467221760698161 0.0880842712142245 -59.6888627031124 1.1313178853627e-07 1.28548802524349e-07 1 127.752745535483 130.654411688048 0.188812889496884 0.329486890541657

03 61708608 rs7637558 3:61708608_A_G A G 516.086274509803 0.283875838564248 0.989045376208952 909 -0.552085056361661 0.103915060855678 -50.6246662875149 1.07924297975498e-07 1.24547474669322e-07 1 91.6972225641108 93.7799545311091 0.218731316816132 0.332172639170611

First pass analysis of human genetic susceptibility to severe COVID-19

We have performed a preliminary analysis using UK Biobank COVID-19 data to test for genes or genetic variants that increase the risk of severe COVID-19. We have done several analyses but the conclusion is the same at this stage - the sample size is currently too small to distinguish true signals from noise. However, this may change over the next few weeks as (i) unfortunately, the number of cases will rise and (ii) the results from the UK Biobank cohort are combined with other cohorts through the COVID-19 Hg initiative using meta-analysis.

I will summarize just one of the analyses for brevity. The first analysis I tried was restricted to white European participants. Restricting the analysis this way is one approach to account for possible correlations between susceptibility to severe COVID-19 and genetic ancestry. I started with white Europeans because this is the largest group in the UK Biobank. I compared 330 cases of severe COVID-19 (identified as hospital inpatients with positive tests) to a control population of 283,722 participants with no known COVID-19, as described here. I excluded individuals who did not live in England at the time of recruitment and individuals no longer followed up by UK Biobank. I also excluded individuals with genetic data that did not pass standard quality controls.

In the analyses so far, I have not yet accounted for most important epidemiological variables. Instead I have followed the COVID-19 Hg standard analysis plan which adjusts only for age and sex. In future iterations, we will address this limitation.



The figure shows a Manhattan plot summarizing the location of signals of susceptibility to infection in the human genome. The bigger the peak, the stronger the evidence that genes in that region of the genome are associated with severe disease. No signal yet meets the stringent threshold for deeming it statistically interesting. The strongest signal so far is on chromosome 2. The closest gene is called KLHL29, which is involved in a wide variety of traits including obesity according to various studies in the GWAS catalog. The next strongest signal so far is on chromosome 14, near a gene called NRXN3 which is also involved in diverse traits, also including obesity according to the GWAS catalog. If these turn out to be statistically interesting signals, it supports predictions that the analysis is likely to pick up genetic susceptibility to pre-disposing factors, of which obesity is one. This underlines the importance of controlling for such mediating risk factors in future analyses.

Interpreting UK Biobank COVID-19 test data

Following the press release earlier this week, UK Biobank have released the COVID-19 lab-confirmed test results for English participants that we helped link from Public Health England's Second Generation Surveillance System. In this post I describe how the data is obtained and make some recommendations for interpreting the test results.

Registering for the data

To access the data, you need to be a registered researcher attached to an approved project. (This can be a lengthy process.)

UK Biobank emailed existing researchers this week with information on how to access the new data: "If you wish to receive the primary care data restricted for COVID-19 research purposes, or the more frequent updates of other health outcome data, then please go to UK Biobank's Access Management System (AMS) and go to the Data tab of your Project/Application where you will find a button that takes you to a sign-up page for requesting these data. 

"Researchers requiring access to the primary care data will be asked to confirm that they will only use them for COVID-19 related research. The Application's Principal Investigator on an approved project will need to sign-up to apply for these data before they will be released."

Downloading the data

UK Biobank have created instructions explaining how researchers with existing projects who have successfully registered for additional access to COVID-19 data can download it.

Downloading the data is through the data portal, which requires some knowledge of SQL. To download (updated 5 June 2020):
  • Log in as usual at bbams.ndph.ox.ac.uk 
  • Navigate to your Project
  • View/Update it
  • Click the Data tab
  • Select the Go to Showcase to refresh or download data button
  • Select the Data Portal tab
  • Click Connect
  • At the bottom of the page, click the Table Download tab
  • In the Name of table box, enter covid19_result and click Fetch Data
  • Click the link produced to download the file.

Interpreting the data

NB: a description of the data format has been provided by UKB. The data format and definitions may change.

The data look like this:
eidspecdatespectypelaboratoryoriginresult
XXXXXXX29/03/20201911
XXXXXXX29/03/202021010
XXXXXXX31/03/202021001

The first thing to notice is that each row of the data refers to a COVID-19 test. So most users will need to reinterpret that table at the level of an individual participant, using the eid column.

Here I will explain how my colleagues and I envisage the data being used, which will be explained in more detail in draft 2 of the preprint. We believe the data are suited for addressing the question Why do some people suffer from severe COVID-19? For statistical analyses, we are defining:
  • A case as anyone with severe COVID-19: any person with at least one positive test while an inpatient. This means anyone with one or more test results with origin=1 (inpatient) AND result=1 (infected).
  • A control as anyone not known to have COVID-19: any person with no positive test results at all. This means anyone with no tests or anyone whose tests were all negative (result=0).
  • An excluded person as anyone with COVID-19 of unknown severity: any person with at least one positive test but never while an inpatient.
The reason for defining cases this way is because UK policy since 16 March 2020 has been to admit to hospital only patients with severe disease. To help define severe cases only, no test results have been included before this date. Many individuals with positive tests who were not hospital inpatients at the time of the test are likely to be healthcare workers, so they did not necessarily have severe disease. We are assuming that members of the control group (anyone not known to have COVID-19) have not been exposed, or have not suffered severe disease.

In the first tranch of data, using the above definitions, there were 572 cases, 97 excluded participants and 805 participants mentioned in the results file but never tested positive who therefore qualify as controls. [This article previously incorrectly said that all 97+805=902 non-cases in the results file were excluded participants.]

Note that only test results in England are reported. One way of excluding individuals who cannot be cases because they do not live in England is to assume participants live in the same country as the recruitment centre they originally attended. Assessment centre at recruitment is field 54, and the following are the codes for English recruitment centres:
11012Barts
11021Birmingham
11011Bristol
11008Bury
11024Cheadle
11020Croydon
11018Hounslow
11010Leeds
11016Liverpool
11001Manchester
11017Middlesborough
11009Newcastle
11013Nottingham
11002Oxford
11007Reading
11014Sheffield
10003Stockport
11006Stoke
11025Cheadle
11026Reading
11027Newcastle
11028Bristol
For me, this removed 56,650 participants, none of whom had test results (as expected). Excluding other participants who cannot be cases, such as those lost to follow-up, is also recommended.

UK Biobank announces COVID-19 initiative

UK Biobank have announced that results of COVID-19 tests for UK Biobank participants, including confirmed cases, are being provided through Public Health England, and will shortly be available for research: https://www.ukbiobank.ac.uk/2020/04/covid

Other data will follow:

  • GP (primary care) data on a monthly basis for COVID-19 related research. It will be provided via GP system suppliers EMIS Health and TPP which cover about 95% of GP practices in England. Similar updates from Wales and Scotland are expected;
  • Hospital episodes (HES) data and death data on a monthly basis;
  • Intensive care (ICNARC) data.