20th August

Simulations

1. Independent uniform simulations

(Horizontally scrollable figures)

Comparing functional and empirical cFDR for independent uniform \(q\):

3. Independent simulations

I now move to continuous \(q\) vectors that are not supported by empirical cFDR. Here, \(q\) is randomly sampled from distribution A (\(N(-2, 0.5^2)\)) and B (\(N(3, 2^2)\))

4. Weakly informative simulations

Functionals (CVs + SNPs within 10,000 bp) are sampled from distribution A (\(N(-2, 0.5^2)\)), non-functionals from distribution B (\(N(3, 2^2)\)).

Firstly, I show that the method re-weights p-values accordingly when the data becomes dependent.

There is also a gain in power:

without a loss of specificity:

We also need to use these simulations to show that repeatedly simulating over the same \(q\) (or \(q\) capturing the same information, as in these simulations) is bad. Here, I repeatedly simulate from the same mixture distribution (i.e. q are not independent, correlation \(\approx 0.05\)).

[Note here that we need to use the segment removal method (optional parameter ?) to prevent values getting further and further from 1].

Note that the correlation between p and q in each iteration increases as we keep sampling the same SNPs from the same extreme distribution:

Maybe the “badness” of iterating over the same q is shown by the specificity dropping? This doens’t happen in the probabilistic q simulations below.

4. Probabilistic functional simulations

As above but I vary the weights. I.e. functionals are sampled from distribution A with weight 0.9 (weight changes each iteration).

One example: This also shows nicely how the p-values are reweighted according to how informative q is (i.e. how large the weight is).

Asthma application

For the Asthma application, I restrict analysis to those SNPs that are present in all the datasets (Demenais “discovery” data, UKBB “replication” data, 1000 Genomes phase 3 data [for (1) reference genotype data for LD clumping (2) those with baseline-LD annotations (for dimensionality reduction in cFDR and to generate partitioned LD scores for FINDOR]). There are 1,968,651 of these.

To define independent loci, I follow the method used in the FINDOR manuscript - “We conservatively define independent loci using PLINK’s LD-clumping algorithm with a 5 Mb window and an r2 threshold of 0.01”…

I use the European 1000 Genomes Phase 3 data as my reference genotype data. (approx. 10-15 SNPs lost per chromosome when matching).

Clumping using Demenais P-values: 152 clumps (with index-variant \(p<0.0001\))

Clumping using UKBB P-values: 415 clumps (with index-variant \(p<0.0001\))

There will be two comparisons:

“GenoCanyon”: GenoWAP and functional cFDR leveraging GenoCanyon scores
“Baseline-LD”: FINDOR and functional cFDR leveraging baseline LD model annotations

GenoCanyon

[More problematic because GenoWAP outputs PPs.]

SNP-level

For the 4246 SNPs that reached genome-wide significance in the larger UKBB GWAS, we compared the rank of the posterior scores from GenoWAP and the v-values from functional cFDR with the rank of the original \(p\)-value in the smaller Asthma GWAS data set. When using GenoWAP, 60.7% had an improved or equal rank compared to 64.7% when using functional cFDR.

The venn diagram below shows the number of UKBB significant SNPs that recieved an improved or equal rank after applying each of the two methods.

Similarly, for the 1,964,405 that were not genome-wide significant in UKBB, 44.8% had a decreased or equal rank after applying GenoWAP, compared to 48.3% when applying functional cFDR.

Locus-level

There are 18 significant loci in Demenatis data. The same 18 loci are significant after applying functional cFDR (after checking if any SNPs in that locus have become the new index SNP).

To compare GenoWAP PPs I follow what they do in the FINDOR manuscript (“Rank SNPs by PP and count the number of independent loci in the top K SNPs, where K is the total number of genome-wide significant SNPs reported by functional cFDR”). Here, K is 662 and the number of independent loci for the top 662 PPs is 21, of which 18 are significant are the same as before and the 3 new ones have p-values in demenais close to GWS (9e-08, 1e-07, 1e-07). I think this relates to GenoWAP pushing 3 more loci over the significance threshold?

Following what they do in the GenoWAP paper:

A total of 297 loci passed genome-wide significance level in the validation UKBB data set. We ranked 1,978,017 SNPs in the Demenais study based on their P-values (demenais), v-values (functional cFDR) and posterior scores (GenoWAP). Then, within each of the 297 loci, we compared the rank (in Demenais) of the lowest P-value (in UKBB) to the rank of the lowest v-value and the rank of the largest posterior score.

GenoWAP: 209/297 (70.4%) had an improved rank, 88/297 (29.6%) had a reduced rank (no ranks stayed the same).
Functional cFDR: 216/297 (72.7%) had an improved rank, 78/297 (26.3%) had a reduced rank (3/297 stayed the same).

Summary of GenoCanyon results

The results when applying GenoWAP and functional cFDR (both leveraging GenoCanyon scores) are largely comparable and are validated in the larger UKBB dataset.
SNP level: We ranked 1,978,017 SNPs in the Demenais study based on their P-values (demenais), v-values (functional cFDR) and posterior scores (GenoWAP). Then, for each of the 4247 SNPs that passed genome-wide significance in the UKBB dataset, we compared the rank of the unweighted P-values (demenais) with (1) the rank of the (negative) posterior score from GenoWAP and (2) the rank of the v-value from functional cFDR. 60.7% had equal or improved rank after applying GenoWAP compared to 64.7% after applying functional cFDR. 82.8% (2411/2912) of the SNPs that had an improved or equal rank were shared between the two methods.
Locus level: We ranked 1,978,017 SNPs in the Demenais study based on their P-values (demenais), v-values (functional cFDR) and posterior scores (GenoWAP). A total of 297 loci passed genome-wide significance level in the validation UKBB data set. Within each of the 297 loci, we identified the SNP with the lowest UKBB P-value (“lead-SNP”), and compared the rank of the unweighted P-values (demenais) with (1) the rank of the (negative) posterior score from GenoWAP and (2) the rank of the v-value from functional cFDR. 70.4% had an improved or equal rank after applying GenoWAP compared to 73.7% when applying functional cFDR. 86.9% (199/229) of the loci that had an improved or equal rank were shared between the two methods.

FINDOR (preliminary results)

SNP-level

[NOTE THAT THESE ARE PRELIMINARY RESULTS - I HAVENT ITERATED USING OUR METHOD YET]

Following the comparison used in the FINDOR manuscript (although they do this at the locus level), I define 3 classes of SNPs:

GWS only in Demenais
GWS only after adjustment
GWS before and after adjustment

and plot the -log10(P) values from the replication data (UKBB) for each of the groups.

We see that the SNPs that functional cFDR boosts over the significance threshold (25 of these) have a lower P-value in the UKBB data set, compared with those that FINDOR boosts over the significance threshold (116 of these).

Can also compare results in terms of sensitivity and specificity (kind of):

“Sensitivity”: What proportion of significant SNPs in the UKBB data are significant (1) in the Demenatis data (2) After applying FINDOR (3) After applying functional CFDR.

## [1] 0.1533208

## [1] 0.1745172

## [1] 0.1519077

“Specificity”: What proportion of not-significant SNPs in the UKBB data are not-significant (1) in the Demenatis data (2) After applying FINDOR (3) After applying functional CFDR.

## [1] 0.9999975

## [1] 0.9999929

## [1] 0.9999975

“Replication slopes” used in FINDOR manuscript:

“To quantify replication, we computed the replication slope, defined as the slope resulting from a regression of the standardized effect sizes in the replication data versus the discovery data.”

[Note - in the FINDOR manuscript they do this at the locus level but I won’t be able to as I have much fewer SNPs so that there are only a couple of loci in each stratification below].

For SNPs detected (1) only in Demenais, (2) only after adjustment or (3) before and after adjustment, I report results of a regression of standardised effect sizes \((Z\sqrt(N))\) of SNPs in UKBB data (replication) vrs Demenais (discovery).

	Slope	SE	Num SNP
Demenais only	-1.7	2.13	16
FINDOR only	3.3	1.17	116
Both	0.39	0.04	640

I do the same for functional cFDR:

	Slope	SE	Num SNP
Demenais only	-3.6	5.87	31
cFDR only	-2.6	4.2	25
Both	0.41	0.04	625

Locus-level

For each SNP clump, I look to see whether any SNPs exceed genome-wide significance (note I can’t just base this analysis on lead-SNPs because the lead-SNP may change after applying the adjustment methods). We see that FINDOR uniquely identifies 3 additional loci as GWS, whilst functional cFDR finds an additional 1.

The p-values in the UKBB data for the lead-SNPs in the new loci that FINDOR finds are 4.7e-31, 1.2e-08 and 2e-02. But the lead-SNP (SNP with smallest p-value) in the locus after applying FINDOR may not be the lead-SNP in the locus using UKBB p-values. The smallest UKBB p-values for SNPs in the newly identified locus are 4.77794e-31, 4.59214e-09, 2.94344e-41.

The p-value in the UKBB data for the lead-SNP in the new loci that functional cFDR picks out is 4.6e-24 (the smallest UKBB p-value in this locus is 1.2172e-24).

Iterating

Question: How best to utilize the 96 annotations in iterative cFDR.

Dimensionality reduction: We first decided that the q vectors we iterate over should be statistically independent (or at least orthogonal), to ensure that we are not using the same information for re-weighting more than once. I therefore used PCAmix which outputs a lower dimensional decomposition made up of orthogonal vectors. To decide which of these dimensions to iterate over, in each iteration I regressed log(p) on the dimensions (removing those that have already been used). I then picked out the most significant dimension as q for that iteration. Problem: The dimensions are not necessarily monotonic in p (i.e. low p can be enriched in low and high q). This is because PCA often describes a gradient from one “thing” to another “thing”.

To try and eliminate this non-monotonicity, I looked at applying non-negative PCA and non-negative matrix factorisation, which should hopefully just pick out one “thing” (rather than comparing two “things”). However, these didn’t give informative results.

After some consideration, we decided that the dimensions to iterate over need not be statistically independent (or orthogonal in the PCA sense), because if one SNPs gives a high value in one assay and a high value in a replicate assay then we ought to use this data and more strongly believe in the high value. Stas and Kath suggested looking at mutual information theory, i.e. how much information one random variable tells us about another. This seems to make sense in our context. I therefore investigate independent component analysis (ICA), which aims to find (non-orthogonal) components that are mutually independent in the complete statistical sense (http://compneurosci.com/wiki/images/4/42/Intro_to_PCA_and_ICA.pdf). However this is likely to lead to the non-monotonic relationships we saw with PCA as it is still comparing one “thing” with another “thing”.

Regression: We consider regularisation methods to select the relevant features (annotations). Lasso may be particularly promising as it picks out one “thing” from a correlated set of “things”. Ridge regression is not applicable here as it only shrinks coefficients, it doesn’t completely eliminate them. Elastic net is a combination of lasso and ridge and groups and shrinks the parameters associated with the correlated variables and leaves them in the equation or removes them all at once. Note that the non-monotonicity problem is less likely to occur here because will just be measuring one annotation (rather than multiple annotations, or comparing multiple annotations).

Note that there are correlations between the annotations:

I run a lasso regression of log(p) on the annotations [using an independent subset of SNPs]. I choose the lambda value (tuning parameter) that minimises the MSE of prediction.

But when I look at the correlation of each of these annotations with p, it is very low, and the correlation sign often differs between the full data set and the independent subset of SNPs. This means that my method throws an error.

I invesigate iterating over the most highly correlated non-binary annotations. These are “background selection statistic” and “MAF_Adj_LLD_AFR”.

Comments and queries

Still need a name. My favourites: “Generic cFDR; versatile cFDR; extended cFDR”. Stephens favourites: “Flexible cFDR; unrestricted cFDR; extended cFDR”. [“General purpose cFDR”;“general cFDR”;“generic cFDR”;“flexible cFDR”;“unconstrained cFDR”;“universal cFDR”;“versatile cFDR”;“unrestricted cFDR”;“all-purpose cFDR”;“extended cFDR”]
After applying cFDR and FINDOR should I be applying the BH procedure (before comparing results)?

20th August

Anna Hutchinson

Simulations

1. Independent uniform simulations

2. Related trait simulations

3. Independent simulations

4. Weakly informative simulations

4. Probabilistic functional simulations

Asthma application

GenoCanyon

SNP-level

Locus-level

Summary of GenoCanyon results

FINDOR (preliminary results)

SNP-level

Locus-level

Iterating

Comments and queries