23/1/19

1. The problem

Plots from many, many simulations.

2. Altering GAM models

NAs in results when fitting GAMs perhaps causing systematic bias.
These occured when trying to fit GAMs where all 5000 simulated CSs included CV (covered=1) except a handful of points, in which case a GAM cannot be fit to covered vrs claimed as claimed coverage provides no information to covered.
This obviously happens more in higher thresholds (more likely to contain CV), higher maxz0 and higher OR (more association –> will be at front in ordering step –> will be present in cred set).
Consequently, if there are NaNs in the GAM result or R reports an error when fitting the GAM, the code has been altered so that the predicted coverage for that SNP causal is \(mean(covered)\) (i.e. does not utilise information on claim0).
In these simulations, the number of simulated credible sets for each SNP being considered causal has been increased from 5000 to 10000 (only increases running time by a couple of minutes).
Have also ran simulations using a logit(covered)~logit(claimed) GAM model for comparison. Both seem good.

Figure: Claimed and corrected coverage for simulations using logit(covered)~claimed (left) or logit(covered)~logit(claimed) (right) GAM models. mean(covered) used if the GAM model cannot be fitted for that SNP considered causal. The number of simulated credible sets for each SNP considered as causal is 10000. The true mu value was used.

The method works well when we know the true value of \(\mu\).
Useful to check that the method works at all ORs and all maximum z0s –> Facet above plot by OR and maxz0.

Faceting by OR and maxz0

Figure: Claimed and corrected coverage for simulations using logit(covered)~claimed.cov GAM models. mean(covered) used if the GAM model cannot be fitted for that SNP considered causal. The number of simulated credible sets for each SNP considered as causal is 10000. The true mu value was used. Faceted by OR and maxz0.

Figure: Claimed and corrected coverage for simulations using logit(covered)~logit(claimed) GAM models. mean(covered) used if the GAM model cannot be fitted for that SNP considered causal. The number of simulated credible sets for each SNP considered as causal is 10000. The true mu value was used. Faceted by OR and maxz0.

Green is better than red, showing that our method works at all OR and all maxz0.
Problem boils down to trying to find a representative value of \(\mu\).

3. Results

Simulations were ran for a variety of values for \(\mu\) using the normal model and also the logit model.
The best results seem to be from the \(\mu=maxz0\) logit model method and the \(\mu=sum(pp*abs(z0))\).

Figure: Claimed and corrected coverage for simulations using logit(covered)~claimed (left) or logit(covered)~logit(claimed) (right) GAM models. mean(covered) used if the GAM model cannot be fitted for that SNP considered causal. The number of simulated credible sets for each SNP considered as causal is 10000. mu=sum(pp*abs(z0)) was used.

4. More mu estimates

Now that we know the method works for known true \(\mu\), we simply need to find an appropriate estimate of \(\mu\).

Key (for some - can explain the rest):

maxz: max(abs(z0))
z0.pp: sum(z0*pp)
z0.abspp: sum(abs(z0)*pp)
ph0.maxz0: (1-ph0)*max(z0) (where ph0 is probability of null model)
ph0.maxabsz0: (1-ph0)*max(abs(z0))
z0.abs.pp0dash: sum(abs(z0)*pp0’) (pp0’ normalised including null using fixed prior method)
z0.abs.pp0dd: sum(abs(z0)*pp0’‘) (pp0’’ calculated using learnt ph0)

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

It seems that the z0.abs.pp0dash method is best. Here, \(abs(z0)\) is multiplied by \(pp0'\) where \(pp0'\) are the posterior probabilities obtained when we normalise including the null model of no genetic association (i.e. \(sum(pp0')<1\)). To obtain these, we fix a prior.
Within our methods we use two approaches to obtain different posterior probabilities:
1. pp0' and ph0: Here, we include the null model of no genetic association. We fix a prior for the null (0.98) and for each of the SNPs (0.0001) to calculate the posterior probabilities for the null (ph0) and the SNPs (pp0').
2. pp0'' and ph0.learnt: Here, we include the null model of no genetic association. We learn a prior for the null, which is the probability of observing something as extreme, or more extreme, given the null. Hence, we simulate lots of z-scores under the null (\(mu=0\)) and see which proportion of these is greater than maxz0 (ph0.learnt). To find the posterior probabilities for the SNPs, we calculate pp0''=pp0(1-ph0.learnt) where pp0 is the original posterior probabilties for the SNPs which sum to 1.
Note that in the latter method we find \[ph0.learnt=P(Z\geq maxz0|H_0)\] while we want, \[ph0.ideal=P(H_0|Z=maxz0)=\frac{P(Z=maxz0|H_0)P(H_0)}{P(Z=maxz0)}\]
3rd on top row and 3rd on second row look the same as the only difference is the posterior probabilities, in the latter we multiply the posterior probabilities used for the top right by \((1-ph0.learnt)\) but often ph0.learnt=0 since no simulated \(zstar~N(0,1)\) have a maximum value larger than maxz0. Hence, wehn ph0.learnt=0, these two estimates of \(\mu\) are identical.

Clearly something goes wrong for \(mu<4\), but would \(mu<4\) in genetic fine-mapping studies?
Below shows a plot of the maximum absolute z score vrs true mu for \(mu<4\), i.e. those regions in the above plots where all the estimators fail. A red line indicates \(maxz0=5\), i.e. approximately the genome-wide significance threshold.
The points above this red line (~3%) are the ones that would be considered for genetic fine-mapping.
Hence, although none of the estimators are good for \(mu<4\), only ~3% of these lie above the genome-wide significance threshold.

5. Switching alleles problem

LD matrices seem bias towards positive \(r\), but not exclusively positive.
This is why \(sum(pp\times z0)\) is not a good estimate of \(\mu\) and we have tried \(sum(pp\times abs(z0))\) instead.

	LD1	LD2	LD3	LD4
< -0.5	0.13%	0.26%	0.40%	0.80%
> 0.5	0.76%	1.99%	2.37%	2.91%

Questions

How are associations found in GWAS? Is it the single-SNP regression approach (association looked for at each tag-SNP independently??)
Are summary statistics marginal z scores and the LD matrix? Or OR, MAF and p-values? Are sufficient summary statistics the above plus sample size, variance of phenotype and LD matrix? (Have read conflicting literature)