NAs in results when fitting GAMs perhaps causing systematic bias.
These occured when trying to fit GAMs where all 5000 simulated CSs included CV (covered=1) except a handful of points, in which case a GAM cannot be fit to covered vrs claimed as claimed coverage provides no information to covered.
This obviously happens more in higher thresholds (more likely to contain CV), higher maxz0 and higher OR (more association –> will be at front in ordering step –> will be present in cred set).
Consequently, if there are NaNs in the GAM result or R reports an error when fitting the GAM, the code has been altered so that the predicted coverage for that SNP causal is \(mean(covered)\) (i.e. does not utilise information on claim0).
In these simulations, the number of simulated credible sets for each SNP being considered causal has been increased from 5000 to 10000 (only increases running time by a couple of minutes).
Have also ran simulations using a logit(covered)~logit(claimed) GAM model for comparison. Both seem good.
The method works well when we know the true value of \(\mu\).
Useful to check that the method works at all ORs and all maximum z0s –> Facet above plot by OR and maxz0.
Green is better than red, showing that our method works at all OR and all maxz0.
Problem boils down to trying to find a representative value of \(\mu\).
Simulations were ran for a variety of values for \(\mu\) using the normal model and also the logit model.
The best results seem to be from the \(\mu=maxz0\) logit model method and the \(\mu=sum(pp*abs(z0))\).
Key (for some - can explain the rest):
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
It seems that the z0.abs.pp0dash
method is best. Here, \(abs(z0)\) is multiplied by \(pp0'\) where \(pp0'\) are the posterior probabilities obtained when we normalise including the null model of no genetic association (i.e. \(sum(pp0')<1\)). To obtain these, we fix a prior.
Within our methods we use two approaches to obtain different posterior probabilities:
pp0'
and ph0
: Here, we include the null model of no genetic association. We fix a prior for the null (0.98) and for each of the SNPs (0.0001) to calculate the posterior probabilities for the null (ph0
) and the SNPs (pp0'
).
pp0''
and ph0.learnt
: Here, we include the null model of no genetic association. We learn a prior for the null, which is the probability of observing something as extreme, or more extreme, given the null. Hence, we simulate lots of z-scores under the null (\(mu=0\)) and see which proportion of these is greater than maxz0 (ph0.learnt
). To find the posterior probabilities for the SNPs, we calculate pp0''=pp0(1-ph0.learnt)
where pp0
is the original posterior probabilties for the SNPs which sum to 1.
Note that in the latter method we find \[ph0.learnt=P(Z\geq maxz0|H_0)\] while we want, \[ph0.ideal=P(H_0|Z=maxz0)=\frac{P(Z=maxz0|H_0)P(H_0)}{P(Z=maxz0)}\]
3rd on top row and 3rd on second row look the same as the only difference is the posterior probabilities, in the latter we multiply the posterior probabilities used for the top right by \((1-ph0.learnt)\) but often ph0.learnt=0 since no simulated \(zstar~N(0,1)\) have a maximum value larger than maxz0. Hence, wehn ph0.learnt=0, these two estimates of \(\mu\) are identical.
Clearly something goes wrong for \(mu<4\), but would \(mu<4\) in genetic fine-mapping studies?
Below shows a plot of the maximum absolute z score vrs true mu for \(mu<4\), i.e. those regions in the above plots where all the estimators fail. A red line indicates \(maxz0=5\), i.e. approximately the genome-wide significance threshold.
The points above this red line (~3%) are the ones that would be considered for genetic fine-mapping.
Hence, although none of the estimators are good for \(mu<4\), only ~3% of these lie above the genome-wide significance threshold.
LD matrices seem bias towards positive \(r\), but not exclusively positive.
This is why \(sum(pp\times z0)\) is not a good estimate of \(\mu\) and we have tried \(sum(pp\times abs(z0))\) instead.
LD1 | LD2 | LD3 | LD4 | |
---|---|---|---|---|
< -0.5 | 0.13% | 0.26% | 0.40% | 0.80% |
> 0.5 | 0.76% | 1.99% | 2.37% | 2.91% |
How are associations found in GWAS? Is it the single-SNP regression approach (association looked for at each tag-SNP independently??)
Are summary statistics marginal z scores and the LD matrix? Or OR, MAF and p-values? Are sufficient summary statistics the above plus sample size, variance of phenotype and LD matrix? (Have read conflicting literature)