6th May

1. Related trait simulations

Investigating correlations

When deciding to leverage log-transformed \(p\)-values in our related trait simulations, we found that the correlation between our simulated traits was too high. Specifically, the (Pearson’s) correlation between log-transformed \(p\)-values for the related traits (\(q^k\)) jumped up to \(\approx 0.2\) (from \(\approx 0.07\) when leveraging raw \(p\)-values) and the FDR was not controlled.

In my old simulations, \(p,q^k\) shared causal variants in 4 out of the 24 LD blocks, and \(q^k,q^l\) also shared causal variants in 4 out of the 24 LD blocks.
In my new simulations, \(p,q^k\) share causal variants in 2 out of 24 LD blocks, and \(q^k,q^l\) share causal variants in 1 out of the 24 LD blocks.

This makes the correlations slightly more representative of that which we see in real data (although the correlations in Tom’s PID data are tiny).

In the plots below, “real data” refers to data I collected from Guilles realm for T1D (Cooper et al. 2017) and related traits: RA (Ishigaki et al. 2020), UC (DeLange et al. 2017), CD (DeLange et al. 2017), PSC (Ji et al. 2017), PBD (Cordell et al. 2015) and CEL (Trynka et al. 2011). I also extracted GWAS \(p\)-values for T2D (Mahajan et al. 2018) to use as a negative control (shown in blue). I removed chromosome 6 so extremely significant \(p\)-values in the MHC do not bias the results.

The first column is for raw p-values, the second for log-transformed and the third for Z-scores.

However, using my new less correlated simulations - empirical cFDR performs very badly. [Note that the Z-score results are for fewer simulations hence the larger error bars, and also in these simulations (bottom panel), empirical cFDR was only run on \(p<0.01\)].

FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables

Why is this?

I have re-ran empirical cFDR using only those SNPs with \(p<0.01\), as recommended in the software, but this doesn’t help.
Perhaps it’s because there are a few simulations where the correlation is negative and empirical cFDR doesn’t know what to do? Although this can’t be causing the pattern we see as its over most simulations.

Comments and queries

Tom is re-running his results to see how many points get left censored in a real GWAS analysis (he has \(\approx 5\) million SNPs).
In Tom’s analysis he uses (i) a check_indep_cor flag to stop the check that the sign of the Spearman’s correlations between the whole data set and the independent subset are the the same and (ii) an enforce_p_q_cor flag to stop the flipping of q if the Spearman’s correlation is negative. I’m wary about this and not sure whether to use them in my simulations.

3. T1D GWAS

We have non-QC’ed data in plink format for 3 data sets: ORPS (200 samples), AdDIT (1363 samples) and OXFORD (708 novel samples sequenced by Thermo Fisher using the Axiom UKBB V2 array). The ORPS and AdDIT data is present in two separate directories: GWASdata from UVA2016 and GWASdata from USB_NeilWalker. The former is imputed, QC’ed data but the latter contains the raw data. I propose that we use the raw data and follow our own QC and imputation protocol. I plan to use Snakemake for this project (useful for Snakemake automation of GWAS QC: https://github.com/pmonnahan/DataPrep).

QC

This will be useful: https://github.com/MareesAT/GWA_tutorial

Sample QC:

Remove sample mislabeled as male/female (if sex is missing but can be inferred, use this).
Remove samples with \(>5%\) missingness.
Remove samples with extreme heterozygosity values (plot heterozygosity f statistic and determine a threshold value; https://discuss.hail.is/t/filtering-samples-with-extreme-heterozygosity-in-hail/1277).
Filter out related samples (IBS).
Extract non-European samples and store these separately.

SNP QC:

Remove SNPs with \(>5%\) missingness.
Remove low frequency SNPs (\(<0.01\); but maybe this is too stringent for small sample sizes, e.g. ORPS).
Remove SNPs deviating from HWE (need to discuss threshold).
Remove heterozygous haploid SNPs (that remain after creating Chr 25).

[Note that I can compare our QC results for ORPS and AdDIT with those using Rany’s script].

Imputation

Use Will Rayner’s script for data preparation. This produces a set of plink commands to remove problematic palindromic SNPs, remove SNPs with differing alleles, remove SNPs with \(>0.2\) allele frequency difference between GWAS data and reference panel (this value can be adjusted) and remove SNPs not in reference panel. It then updates SNPs by strands, alleles, positions and Ref/Alt assignment.

Use Michigan Imputation Server to impute using HRC reference panel (monomorphic SNPs are lost at this point but we can add these back in later).

Use Will Rayner’s IC software for post-imputation checking.

Queries

Should I be doing some combined QC steps?
I’ll have to re-run all of these QC steps again after imputation.. is that correct?
Thresholding values for our small sample sizes (e.g. MAF).