8th April 2021

This document describes the steps I followed to impute the T1D GWAS data using the Michigan imputation server.

Data Preperation

The Michigan Imputation Server requires sorted, per-chromosome VCF files compressed with bgzip. I used Will Rayner’s toolbox to prepare the data. Firstly, I created an allele frequency file from the GWAS data. I then used the HRC-1000G-check-bim.pl script which checks strands, alleles, positions, Ref/Alt assignments and frequency differences between the GWAS data and the reference data (HRC reference panel). The script produces a set of plink commands to update or remove SNPs. Specifically, those removed are: A/T & G/C SNPs if \(MAF > 0.4\), SNPs with differing alleles, SNPs with \(> 0.2\) allele frequency difference and SNPs not in reference panel.

For the ORPS data, we started with 296,250 SNPs in the .bim file. The method removes 3180 SNPs, updates the position for 1 SNP, flips the alleles for 2966 SNPs and uses the --a2-allele flag on the remaining SNPs to ensure the correct allele labeling when converting to VCF.

I then used the VcfCooker tool to generate per-chromosome VCF files from the plink files and compressed these with bgzip.

Michigan imputation server

I transferred the bgzipped VCF files to the Michigan imputation server. The server uses Eagle for phasing and runs some basic QC steps (see below for example from the ORPS data, 250 SNPs were excluded and 292,620 remained):

Statistics:
Alternative allele frequency > 0.5 sites: 83,417
Reference Overlap: 100.00 %
Match: 292,620
Allele switch: 0
Strand flip: 0
Strand flip and allele switch: 0
A/T, C/G genotypes: 0
Filtered sites:
Filter flag set: 0
Invalid alleles: 0
Multiallelic sites: 0
Duplicated sites: 0
NonSNP sites: 0
Monomorphic sites: 449
Allele mismatch: 1
SNPs call rate < 90%: 0

The imputation job then runs on their server and when finished you can download the files in .dose.vcf.gz or .info.gz format.

Post-imputation QC

The imputation server gives you an option to trim SNPs based on a user-defined \(r^2\) imputation quality threshold. I decided not to trim at this stage, with the plan to trim by \(r^2\) value later on.

I then used Will Rayner’s IC software to perform post-imputation checking. The results are summarised here (https://github.com/annahutch/PhD/blob/master/ORPS.html) and look pretty good.

In particular, this poster describes the plots. Note that the red (rather than blue) points on the chr 2 plot of information score plotted per chromosome show that there are 2 sections of >1MB without variants, indicating possible imputation failures. The bar graphs summarising information score counts show that we should definitely trim by information score as there are lots of SNPs with very low information score.

In all we now have imputed genotypes for 40,355,712 SNPs in 192 individuals in the ORPS data set (but we will loose a good chunk of these when I filter by information score).

Future plans

Impute AdDIT, T1DGC, WTCCC and C58 (what information score threshold?) - note that I need to re-run the QC for the last 3, including the relatedness/ancestral filtering step, but what am I plotting a histogram of to check relatedness (https://zzz.bwh.harvard.edu/plink/ibdibs.shtml), possibly DST?
Check for sample overlap and think of a system to allocate overlapping samples to a particular cohort.

8th April 2021

Anna Hutchinson

Data Preperation

Michigan imputation server

Post-imputation QC

Future plans