23rd July 2020

Simulations

I iterate our method 5 times under 4 different simulation scenarios:

Independent uniform \(q\): \(q\) is simulated from the standard uniform distribution independent of \(p\).
Dependent uniform \(q\): \(q\) is simulated from the standard uniform distribution and then undergoes 2000 passes (or until spearmans cor > 0.1) comparing adjacent elements and swapping them if they are in the wrong order (similar to the bubble sort algorithm) to make it dependent of \(p\).
Independent mixture \(q\): \(q\) is simulated from different mixture normal distributions each iteration that are independent of \(p\).
Dependent mixture \(q\): \(q\) is simulated from different mixture normal distributions each iteration and the pre-selected distributions that are sampled from for each data point in each iteration are chosen based on whether a simulated standard uniform variable is greater than or equal to the corresponding \(p\) value.

These simulations take between 40 mins and 1 hour to complete.

1. Independent uniform

For the simulations using \(q\) generated from the uniform distribution, we do not have the problem of sparse data for low \(q\) and so we don’t need left-censoring (I set gridp=0).

Encouragingly, after each iteration the resultant \(v\)-values are of the same distribution as the original \(p\) values:

2. Dependent uniform

Again, I set gridp=0 since we do not need left-censoring for uniform \(q\).

## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

Encouragingly, after each iteration the resultant \(v\)-values appear to be of a different distribution to the original \(p\) values (slightly?):

3. Independent mixture normal

I generate continuous auxillary data that is simulated from various mixture normal distributions of the form:

\(w N(\mu_1, 1) + (1-w) N(\mu_2, 1)\)

where \(w \in \{0.3, 0.4, 0.5, 0.6, 0.7\}\) and \(\mu_1, \mu_2 \in \{-3,-2,-1,0,1,2,3\}\).

Encouragingly, after each iteration the resultant \(v\)-values are of the same distribution as the original \(p\) values:

4. Dependent mixture normal

I generate dependent continuous auxiliary data that is simulated from various mixture normal distributions with \(\mu \in \{-3,-2,-1,0,1,2,3\}\). Which normal distribution is sampled from for each data point in each iteration depends on whether a random uniform value is greater than or less than the corresponding \(p\) value.

Note that I can increase the dependency (the correlation between p and q) by using a mixture normal distribution with modes further apart… (e.g. correlation is 0.2 when choosing mixture normal with means 0 and 1 but is 0.5 when choosing mixture normal with means -3 and 1). This may be interesting to look at for more simulation results.

Encouragingly, the distribution of \(v\)-values after each iteration compared to original \(p\) values changes.

Manuscript simulation figure ideas

Key things to show:

Iterating over independent \(q\) adds noise.
As \(q\) becomes more dependent on \(p\), the \(p\) values are systematically re-weighted.

Ideas:

QQ-plot for original p against v5 values for the 4 simulation types.

As the correlation between \(p\) and \(q\) increases, the \(p\)-values get reweighted more. To show this, I sample \(q\) from various mixture normal distributions where the normal distribution sampled from for each data point is chosen based on whether a simulated standard uniform variable is greater than or equal to the corresponding \(p\) value (no iterations - \(p\) is kept constant and only \(q\) changes). I plot the difference between \(p\) and \(v\) against the spearmans correlation between \(p\) and \(q\) in that iteration. [currently running simulations to plot].

FINDOR

Method:

Run stratified LD score regression (S-LDSC) to obtain annotation effect size estimates, \(\hat{\tau_C}\).

a.) Download (i) partitioned LD scores from baseline LD model v2.2 (which were available for 1,190,321 SNPs), (ii) regression weight LD scores (which were available for 1,187,349 SNPs) and allele frequencies for available variants in the 1000 Genomes phase 3 data set.

b.) Use the munge_sumstats.py file in the ldsc python package to convert the Asthma GWAS summary statistics to the correct format for use in the ldsc software. It is recommended in the ldsc software to only keep HapMap3 SNPs, so we downloaded a list of HapMap3 SNPs and used the --merge-alleles flag to keep only these SNPs. 1,045,340 SNPs (out of 2,001,256) in the Asthma data set overlapped with HapMap3 SNPs and were retained for the S-LDSC analysis.

c.) Run S-LDSC with the --print-coefficients flag to generate the .result file containing \(\hat{\tau_C}\) values required for FINDOR. In the S-LDSC method, regression weight LD scores are read in for 1,187,349 variants, 1,034,758 of these remain after merging with reference panel SNP LD scores, then 1,032,395 SNPs remain after merging with regression SNP LD. The Lambda GC value was 1.0255, the mean \(\chi^2\) value was 1.0717 and the intercept was 0.9398 (0.0106).

Generate partitioned LD scores for SNPs in Asthma GWAS data set.

a.) Download the relevant PLINK files and annotation data for baselineLD model v2.2 and follow the LD Score Estimation Tutorial on the LDSC github page to generate partitioned LD scores. Partitioned LD scores could be generated for 1,976,361 of the 2,001,256 SNPs in the Asthma data set (due to missing PLINK files and annotation data).

Run FINDOR.

a.) After generating a file containing the GWAS data (for all 2,001,256 variants) including columns for sample sizes, SNP IDs and Z-scores. We then used this file, along with the computed partitioned LD scores and the .result file from S-LDSC to obtain re-weighted \(p\)-values for 1,976,361 SNPs using FINDOR.

The results clearly show the effect of the binning step that FINDOR implements. The default number of expected \(\chi^2\) bins is set to 100 in the FINDOR software, which means here that there are only 57 unique weights for 1,976,361 SNPs. Running FINDOR with the --output_bin_info flag gives predicted tag variance (PTV; \(\hat{\chi^2}\)) values, bins, \(\pi_0\) (proportion of null SNPs) and weights.

The weight is the ratio of the estimated proportion of alternative SNPs (\(\hat{\pi_1}\)) and null SNPs (\(\hat{\pi_0}\)), so that SNPs in bins with a greater proportion of alternative SNPs will be given greater weights and have their \(p\) value down-weighted (nominal \(p\)-values are divided by weights).

The manuscript states “we can recommend the use of 100 bins for a wide range of SNP densities” and show that this parameter value is robust on smaller data sets (e.g. HapMap3; 1.2 million SNPs) and larger data sets (e.g. UKBB; 9.6 million SNPs), so I don’t investigate using more bins here.

Comments and queries

I specified the maximum number of dimensions to be 50 in the PCAmix analysis (to save time) is this ok?
FINDOR figure:
- Panel A - FINDOR reweighted p vrs original p coloured by weight.
- Panel B - cFDR reweighted p (v5) vrs original p (although can’t colour as will be using v after many iterations).
- Panel C - Venn diagram comparing Asthma SNP re-weighting.
Is a possible disadvantage of FINDOR that it requires a relatively small number of bins for the large number of data points, meaning that tens to hundreds of thousands of SNPs are given the same weight. Supplementary figure 2 below (https://www.cell.com/cms/10.1016/j.ajhg.2018.11.008/attachment/c4f28917-8a52-4604-848d-b80395570f84/mmc1) shows that the estimation of the proportion of null SNPs fails \(>75%\) of the time when increasing the number of bins to even 0.05% of the total number of data points. I think that they keep the number of bins small to not overfit the data and produce false positive as they make multiple passes over the data (“likely due to the small number of global parameters estimated (hundred) relative to the large number of hypothesis tests performed (millions)”). Their method requires a sufficiently coarse partitioning of the data.