3. T1D GWAS

We have non-QC’ed data in plink format for 3 data sets: ORPS (200 samples), AdDIT (1363 samples) and OXFORD (708 novel samples sequenced by Thermo Fisher using the Axiom UKBB V2 array). The ORPS and AdDIT data is present in two separate directories: GWASdata from UVA2016 and GWASdata from USB_NeilWalker. The former is imputed, QC’ed data but the latter contains the raw data. I propose that we use the raw data and follow our own QC and imputation protocol. I plan to use Snakemake for this project (useful for Snakemake automation of GWAS QC: https://github.com/pmonnahan/DataPrep).


QC

This will be useful: https://github.com/MareesAT/GWA_tutorial

Sample QC:

  • Remove sample mislabeled as male/female (if sex is missing but can be inferred, use this).
  • Remove samples with \(>5%\) missingness.
  • Remove samples with extreme heterozygosity values (plot heterozygosity f statistic and determine a threshold value; https://discuss.hail.is/t/filtering-samples-with-extreme-heterozygosity-in-hail/1277).
  • Filter out related samples (IBS).
  • Extract non-European samples and store these separately.

SNP QC:

  • Remove SNPs with \(>5%\) missingness.
  • Remove low frequency SNPs (\(<0.01\); but maybe this is too stringent for small sample sizes, e.g. ORPS).
  • Remove SNPs deviating from HWE (need to discuss threshold).
  • Remove heterozygous haploid SNPs (that remain after creating Chr 25).

[Note that I can compare our QC results for ORPS and AdDIT with those using Rany’s script].


Imputation

Use Will Rayner’s script for data preparation. This produces a set of plink commands to remove problematic palindromic SNPs, remove SNPs with differing alleles, remove SNPs with \(>0.2\) allele frequency difference between GWAS data and reference panel (this value can be adjusted) and remove SNPs not in reference panel. It then updates SNPs by strands, alleles, positions and Ref/Alt assignment.

Use Michigan Imputation Server to impute using HRC reference panel (monomorphic SNPs are lost at this point but we can add these back in later).

Use Will Rayner’s IC software for post-imputation checking.


Queries

  • Should I be doing some combined QC steps?
  • I’ll have to re-run all of these QC steps again after imputation.. is that correct?
  • Thresholding values for our small sample sizes (e.g. MAF).