A common problem with Bayesian fine-mapping approaches is the large number of variants contained in the credible set. Moreover, we have identified instances of over- and under- coverage, whereby less or more variants respectively are needed in the credible set to accurately reach the \(\alpha\) credible set size (often \(\alpha=95\%\)). We hope to amend this problem by developing a correction factor, which when applied to the data, will rectify instances of over- and under- coverage.
Schaid et al. (2018) state that the expected posterior probability depends on the effect size of the causal SNP on a trait (OR), the sample size (N), the number of SNPs (nsnps) and SNP correlation structure (LD). Since experimentors know the sample size, the number of SNPs analysed and have some knowledge of the correlation structure, I wish to find a way for experimentors to quantify the information OR provides to the system, without actually having to know the OR. This could then be used in the correction factor to improve coverage estimates in Bayesian fine-mapping experiments.
Firstly, low, medium and high power systems were constructed to understand the shape of the posterior probability systems and consider ways the OR affects this shape.
Secondly, possible measures of ‘entropy’ are considered. These are measures that I hope quantify the shape of the posterior probability system and the information that the OR provides. These are calculated and plotted for the model systems.
Thirdly, a simulation dataset is created with N, nsnps, thr, MAF and LD fixed. This means that I am considering simulations whereby the variation in the power is only due to variation in the OR. The entropy measures are calculated for each simulated system. Logistic regression and random forest methods are used to analyse the effectiveness of the entropy measures in predicting coverage.
Finally, I simulate posterior probability systems with everything fixed except OR. I look at the distribution of the entropy measures for these simulations.
Model posterior probability systems were produced for low, medium and high power systems.
# x.low <- ref()
# test.low <- simdata_x(x.low, OR=1, N0=50, N1=50) # get 100 systems
# t.low <- test.low[[1]][order(-test.low[[1]]$PP),]
setwd("/Users/anna/PhD")
t.low <- read.table("t.low")
head(t.low)
## snp pvalues MAF PP CV
## s76 SNP.76 0.02214234 0.251 0.01819772 FALSE
## s14 SNP.14 0.04174950 0.249 0.01565847 FALSE
## s52 SNP.52 0.03708935 0.133 0.01414266 FALSE
## s17 SNP.17 0.03914875 0.106 0.01343483 FALSE
## s21 SNP.21 0.09245567 0.295 0.01329824 FALSE
## s15 SNP.15 0.08572478 0.240 0.01326113 FALSE
The plots below show the pdf and the cdf of the posterior probabilities for this low power system.
# test.med <- simdata_x(x.low, OR=1.1, N0=700, N1=700) # get 100 systems
# t.med <- test.med[[1]][order(-test.med[[1]]$PP),]
setwd("/Users/anna/PhD")
t.med <- read.table("t.med")
head(t.med)
## snp pvalues MAF PP CV
## s13 SNP.13 0.007335678 0.090 0.07256723 FALSE
## s39 SNP.39 0.011852949 0.296 0.06153209 FALSE
## s92 SNP.92 0.013167979 0.133 0.05476122 FALSE
## s99 SNP.99 0.014588567 0.307 0.05254165 FALSE
## s87 SNP.87 0.025801453 0.212 0.03503086 FALSE
## s4 SNP.4 0.019619915 0.049 0.03316034 FALSE
The plots below show the pdf and the cdf of the posterior probabilities for this medium power system.
# test.high <- simdata_x(x.low, OR=1.3, N0=1000, N1=1000) # get 100 systems
# t.high <- test.high[[1]][order(-test.high[[1]]$PP),]
setwd("/Users/anna/PhD")
t.high <- read.table("t.high")
head(t.high)
## snp pvalues MAF PP CV
## s36 SNP.36 5.293425e-06 0.313 0.9816200810 TRUE
## s44 SNP.44 1.311857e-02 0.333 0.0014266890 FALSE
## s84 SNP.84 3.083094e-02 0.048 0.0007577625 FALSE
## s50 SNP.50 3.402391e-02 0.184 0.0007321616 FALSE
## s25 SNP.25 3.618281e-02 0.143 0.0007191907 FALSE
## s64 SNP.64 4.918353e-02 0.089 0.0006014134 FALSE
The plots below show the pdf and the cdf of the posterior probabilities for this high power system.
The following figure shows all three model systems on the same plot. Note the scale on the y axis, meaning we cannot see the first snp in the high power system.
A green dashed line has been added to illustrate the case where all posterior probabilities are equal (base level low power system). We could consider using this line to quantify the differences in shapes of posterior probability systems, with low power systems being close to this line and high power systems being far from this line.
Weighted exponential curves were fit to the pdfs of each system. The weight for the first snp (the one with the highest posterior probability) was set to 1000, to ensure the curve incorporated this point into the fitting.
set.seed(2)
we <- rep(1,100)
we[1] <- 1000 # set weight of first snp to 100
snps <- seq(1,100,1)
# fit exp model: low
exp.low <- lm(log(t.low$PP)~snps, weights=we)
# fit exp model: med
exp.med <- lm(log(t.med$PP)~snps, weights=we)
# fit exp model: high
exp.high <- lm(log(t.high$PP)~snps, weights = we)
The plot below shows the points and the fitted exponential curves.
Perhaps the equations of these curves could be use to quantify the disorder of the system, and the information OR provides to it.
Different measures were considered to try and capture the disorder of the system.
‘Entropy’: \(-mean(log(pp))\)
‘Max’: \(max(pp)\)
‘Median’: \(median(pp)\)
‘AUCDF’: Area under the cdf
‘KL divergence’: \(mean(log(pp/q))\) where q=rep(1/nsnps,nsnps)
.
‘Slope’: The slope of the weighted exponential curve fitted to the pdf of the posterior probabilities.
‘Intercept’: The intercept of the weighted exponential curve fitted to the pdf of the posterior probabilities (\(exp(B_0)\)).
‘Hellinger’: The hellinger distance between the pp prob distributions (P) and q (Q), \(\frac{1}{\sqrt{2}}||\sqrt{P}-\sqrt{Q}||_2\).
Kolmogorov distance: The sup-distance between the pp prob distributions and q.
The measures seem good at distinguishing between low, medium and high power systems.
A new simulation dataset (20000 entries) was made with fixed N = 1000, nsnps = 100, LD1 = 0.3, LD2 = 0.2 and thr = 0.8.
library(rsq)
set.seed(2)
baseline.noord <- glm(covered~logitsize, data=new.noord, family="binomial")
rsq(baseline.noord,adj=TRUE)
## [1] 0.009792076
summary(baseline.noord)
##
## Call:
## glm(formula = covered ~ logitsize, family = "binomial", data = new.noord)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8496 0.4868 0.6732 0.6825 0.6867
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.42966 0.11677 3.679 0.000234 ***
## logitsize 0.64551 0.07429 8.689 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9709.9 on 9999 degrees of freedom
## Residual deviance: 9580.4 on 9998 degrees of freedom
## AIC: 9584.4
##
## Number of Fisher Scoring iterations: 6
set.seed(2)
baseline.ord <- glm(covered~logitsize, data=new.ord, family="binomial")
rsq(baseline.ord,adj=TRUE)
## [1] 0.000273348
summary(baseline.ord)
##
## Call:
## glm(formula = covered ~ logitsize, family = "binomial", data = new.ord)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1277 0.6960 0.6971 0.6978 0.6986
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.9868 0.1529 6.454 1.09e-10 ***
## logitsize 0.2158 0.1056 2.043 0.0411 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10405 on 9999 degrees of freedom
## Residual deviance: 10400 on 9998 degrees of freedom
## AIC: 10404
##
## Number of Fisher Scoring iterations: 4
The logitsize only predictor model is better for the non-ordered sets than the ordered sets (lower r^2 adj), although there is a lot of noise.
set.seed(2)
ord.OR <- glm(covered~logitsize+OR, data=new.ord, family="binomial")
rsq(ord.OR,adj=TRUE)
## [1] 0.1015145
summary(ord.OR)
##
## Call:
## glm(formula = covered ~ logitsize + OR, family = "binomial",
## data = new.ord)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3361 0.3677 0.4833 0.8088 1.0746
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.7912 0.4062 -26.567 <2e-16 ***
## logitsize -0.1731 0.0920 -1.882 0.0598 .
## OR 11.4103 0.3775 30.227 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10404.9 on 9999 degrees of freedom
## Residual deviance: 9319.3 on 9997 degrees of freedom
## AIC: 9325.3
##
## Number of Fisher Scoring iterations: 5
By incorporating our new measures of entropy, we hope to decrease AIC and increase r squared.
set.seed(2)
ord.ent <- glm(covered~logitsize+entropy, data=new.ord, family="binomial")
rsq(ord.ent,adj=TRUE)
## [1] 0.0138588
summary(ord.ent)
##
## Call:
## glm(formula = covered ~ logitsize + entropy, family = "binomial",
## data = new.ord)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5425 0.6231 0.6544 0.6923 1.9394
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.91417 0.25424 11.462 < 2e-16 ***
## logitsize 1.58634 0.19525 8.125 4.48e-16 ***
## entropy -0.76415 0.06784 -11.264 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10405 on 9999 degrees of freedom
## Residual deviance: 10276 on 9997 degrees of freedom
## AIC: 10282
##
## Number of Fisher Scoring iterations: 5
set.seed(2)
ord.max <- glm(covered~logitsize+max_pp, data=new.ord, family="binomial")
rsq(ord.max,adj=TRUE)
## [1] 0.01374452
summary(ord.max)
##
## Call:
## glm(formula = covered ~ logitsize + max_pp, family = "binomial",
## data = new.ord)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6962 0.6267 0.6503 0.6887 1.0527
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3226 0.2131 1.514 0.13
## logitsize 0.9025 0.1568 5.757 8.59e-09 ***
## max_pp -1.6012 0.1420 -11.277 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10405 on 9999 degrees of freedom
## Residual deviance: 10276 on 9997 degrees of freedom
## AIC: 10282
##
## Number of Fisher Scoring iterations: 5
set.seed(2)
ord.KL <- glm(covered~logitsize+KL, data=new.ord, family="binomial")
rsq(ord.KL,adj=TRUE)
## [1] 0.0138588
summary(ord.KL)
##
## Call:
## glm(formula = covered ~ logitsize + KL, family = "binomial",
## data = new.ord)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5425 0.6231 0.6544 0.6923 1.9394
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.60488 0.25753 -2.349 0.0188 *
## logitsize 1.58634 0.19525 8.125 4.48e-16 ***
## KL 0.76415 0.06784 11.264 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10405 on 9999 degrees of freedom
## Residual deviance: 10276 on 9997 degrees of freedom
## AIC: 10282
##
## Number of Fisher Scoring iterations: 5
set.seed(2)
ord.AUCDF <- glm(covered~logitsize+AUCDF, data=new.ord, family="binomial")
rsq(ord.AUCDF,adj=TRUE)
## [1] 0.01374434
summary(ord.AUCDF)
##
## Call:
## glm(formula = covered ~ logitsize + AUCDF, family = "binomial",
## data = new.ord)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6962 0.6267 0.6503 0.6887 1.0527
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3066 0.2135 1.436 0.151
## logitsize 0.9025 0.1568 5.757 8.58e-09 ***
## AUCDF -1.6012 0.1420 -11.277 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10405 on 9999 degrees of freedom
## Residual deviance: 10276 on 9997 degrees of freedom
## AIC: 10282
##
## Number of Fisher Scoring iterations: 5
set.seed(2)
ord.slope <- glm(covered~logitsize+slope, data=new.ord, family="binomial")
rsq(ord.slope,adj=TRUE)
## [1] 0.0001876722
summary(ord.slope)
##
## Call:
## glm(formula = covered ~ logitsize + slope, family = "binomial",
## data = new.ord)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1320 0.6915 0.6968 0.6982 0.7132
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.986769 0.152889 6.454 1.09e-10 ***
## logitsize 0.215198 0.105653 2.037 0.0417 *
## slope -0.008469 0.026365 -0.321 0.7481
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10405 on 9999 degrees of freedom
## Residual deviance: 10400 on 9997 degrees of freedom
## AIC: 10406
##
## Number of Fisher Scoring iterations: 4
set.seed(2)
ord.int <- glm(covered~logitsize+int, data=new.ord, family="binomial")
rsq(ord.int,adj=TRUE)
## [1] 0.0002370621
summary(ord.int)
##
## Call:
## glm(formula = covered ~ logitsize + int, family = "binomial",
## data = new.ord)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1341 0.6953 0.6964 0.6973 0.8262
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.9888 0.1531 6.457 1.07e-10 ***
## logitsize 0.2183 0.1059 2.062 0.0393 *
## int -0.5396 0.7030 -0.767 0.4428
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10405 on 9999 degrees of freedom
## Residual deviance: 10399 on 9997 degrees of freedom
## AIC: 10405
##
## Number of Fisher Scoring iterations: 4
set.seed(2)
ord.hell <- glm(covered~logitsize+hellinger, data=new.ord, family="binomial")
rsq(ord.hell,adj=TRUE)
## [1] 0.01319722
summary(ord.hell)
##
## Call:
## glm(formula = covered ~ logitsize + hellinger, family = "binomial",
## data = new.ord)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5868 0.5983 0.6517 0.7059 1.1071
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.8663 0.1936 4.474 7.67e-06 ***
## logitsize 0.8243 0.1483 5.560 2.69e-08 ***
## hellinger -2.1807 0.1987 -10.978 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10405 on 9999 degrees of freedom
## Residual deviance: 10281 on 9997 degrees of freedom
## AIC: 10287
##
## Number of Fisher Scoring iterations: 5
set.seed(2)
ord.kol <- glm(covered~logitsize+kolmogorov, data=new.ord, family="binomial")
rsq(ord.kol,adj=TRUE)
## [1] 0.01433593
summary(ord.kol)
##
## Call:
## glm(formula = covered ~ logitsize + kolmogorov, family = "binomial",
## data = new.ord)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5679 0.6183 0.6500 0.6949 1.1487
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.4770 0.2047 2.330 0.0198 *
## logitsize 0.8329 0.1509 5.518 3.43e-08 ***
## kolmogorov -2.1300 0.1856 -11.476 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10405 on 9999 degrees of freedom
## Residual deviance: 10272 on 9997 degrees of freedom
## AIC: 10278
##
## Number of Fisher Scoring iterations: 5
set.seed(2)
ord.med <- glm(covered~logitsize+median, data=new.ord, family="binomial")
rsq(ord.med,adj=TRUE)
## [1] 0.01172334
summary(ord.med)
##
## Call:
## glm(formula = covered ~ logitsize + median, family = "binomial",
## data = new.ord)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4922 0.5941 0.6545 0.7105 0.9927
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.5286 0.2407 -2.196 0.0281 *
## logitsize 0.6633 0.1378 4.814 1.48e-06 ***
## median 157.1883 15.2126 10.333 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10405 on 9999 degrees of freedom
## Residual deviance: 10294 on 9997 degrees of freedom
## AIC: 10300
##
## Number of Fisher Scoring iterations: 5
Unfortunately, none of our measures of entropy seem to improve the model significantly.
set.seed(2)
library(glmnet)
x_ord <- as.matrix(new.ord[,c(1:3,5:15)])
cv.ord <- cv.glmnet(x_ord, y=new.ord$covered, alpha=1, family="binomial")
coef(cv.ord,s=cv.ord$lambda.1se) # r2 is 0.1757218
FALSE 15 x 1 sparse Matrix of class "dgCMatrix"
FALSE 1
FALSE (Intercept) -10.86689438
FALSE thr .
FALSE logitsize .
FALSE nvar 0.01942373
FALSE entropy .
FALSE max_pp .
FALSE KL .
FALSE AUCDF .
FALSE slope .
FALSE int .
FALSE hellinger .
FALSE kolmogorov .
FALSE median .
FALSE N .
FALSE OR 10.27669890
set.seed(2)
library(glmnet) # r2 is 0.001173106
x_ord1 <- as.matrix(new.ord[,c(1:3,5:14)]) # design matrix
cv.ord1 <- cv.glmnet(x_ord1, y=new.ord$covered, alpha=1, family="binomial")
coef(cv.ord1,s=cv.ord1$lambda.1se) # yep, all dropped
FALSE 14 x 1 sparse Matrix of class "dgCMatrix"
FALSE 1
FALSE (Intercept) 1.104698911
FALSE thr .
FALSE logitsize .
FALSE nvar 0.003632111
FALSE entropy .
FALSE max_pp .
FALSE KL .
FALSE AUCDF .
FALSE slope .
FALSE int .
FALSE hellinger .
FALSE kolmogorov .
FALSE median .
FALSE N .
All get dropped, with or without OR in our model. Notice the r squared is much higher for the model with OR and N.
Perhaps the relationship with coverage is non-linear? Unfortunately, this does not seem to be the case. Only improves logitsize model slightly and does not improve anywhere near as much as rcs(OR) does.
NB, r squared column wrong here
covered~ | R^2 | AIC |
---|---|---|
logitsize | 0.0026 | 10404 |
logitsize+rcs(OR) | 0.1424 | 9265 |
logitsize+rcs(entropy) | 0.0076 | 10284 |
logitsize+rcs(max_pp) | 0.0074 | 10287 |
logitsize+rcs(KL) | 0.0076 | 10284 |
logitsize+rcs(AUCDF) | 0.0074 | 10287 |
logitsize+rcs(slope) | 0.0058 | 10408 |
logitsize+rcs(int) | 0.0087 | 10315 |
logitsize+rcs(hellinger) | 0.0079 | 10278 |
logitsize+rcs(kolmogoro) | 0.0072 | 10274 |
logitsize+rcs(median) | 0.0078 | 10283 |
set.seed(2)
library(randomForest)
new.ord$covered <- as.factor(new.ord$covered)
output.forest <- randomForest(new.ord$covered ~ .,
data = new.ord, importance=TRUE)
varImpPlot(output.forest)
print(output.forest)
pred <- predict(output.forest,type="prob")
How about if we don’t allow OR or N (fixed anyway) in our model?
set.seed(2)
output.forest1 <- randomForest(covered ~ slope+int+entropy+KL+max_pp+AUCDF+nvar+thr+hellinger+kolmogorov+median,
data = new.ord, importance=TRUE)
varImpPlot(output.forest1)
print(output.forest1)
It seems that because there is so much noise in the model, none of the entropy measures are good at predicting coverage.
None of the entropy measures seem helpful in predicting coverage. Both linear and non-linear effects have been studied for each measure, but the AIC or BIC does not decrease close to that for the logitsize+OR model.
Have also tried a simulated dataset with everything fixed (as in ‘new’) but this time with OR values up to 1.4, the entropy measures still do not improve the model. [See new1 for OR up to 1.4 code].
1000 posterior probability systems were simulated for 6 different OR values to understand how the OR affects the systems with everything else fixed (N0=N1=1000, LD1=0.3, LD2=0.3, nsnps=100).
Lowest: OR=1
Low: OR=1.1
Mid: OR=1.2
Midhigh: OR=1.25
High: OR=1.3
Highest: OR=1.4
Summaries for the p-values and posterior probabilities are given below:
The following histograms show the distribution of values for the entropy measures for the simualtions.
Explanation: Entropy is low for low power systems, as most posterior probabilities will lie close to the y=1/nsnps line. Entropy increases for higher power systems but perhaps not as much as we’d have liked.
Explanation: The maximum moves from very close to 0 to very close to 1. Again, this seems like an appropriate measure, however it does not distinguish well between low OR systems.
Explanation: The median moves closer to 0 for higher OR systems, as there is one pp holding all the probability, and all the others have to share the small amount that is left. It seems that this measure would struggle for OR values between 1 and 1.25 (most likely OR values).
Explanation: Seems like a good measure to capture entropy, struggles for lower OR values.
Explanation: Seems like a good measure to capture entropy, struggles for lower OR values.
Explanation: Seems like a good measure to capture entropy, struggles for lower OR values.