Sup meeting 5

Introduction

A common problem with Bayesian fine-mapping approaches is the large number of variants contained in the credible set. Moreover, we have identified instances of over- and under- coverage, whereby less or more variants respectively are needed in the credible set to accurately reach the \(\alpha\) credible set size (often \(\alpha=95\%\)). We hope to amend this problem by developing a correction factor, which when applied to the data, will rectify instances of over- and under- coverage.

Schaid et al. (2018) state that the expected posterior probability depends on the effect size of the causal SNP on a trait (OR), the sample size (N), the number of SNPs (nsnps) and SNP correlation structure (LD). Since experimentors know the sample size, the number of SNPs analysed and have some knowledge of the correlation structure, I wish to find a way for experimentors to quantify the information OR provides to the system, without actually having to know the OR. This could then be used in the correction factor to improve coverage estimates in Bayesian fine-mapping experiments.

Method

Firstly, low, medium and high power systems were constructed to understand the shape of the posterior probability systems and consider ways the OR affects this shape.

Secondly, possible measures of ‘entropy’ are considered. These are measures that I hope quantify the shape of the posterior probability system and the information that the OR provides. These are calculated and plotted for the model systems.

Thirdly, a simulation dataset is created with N, nsnps, thr, MAF and LD fixed. This means that I am considering simulations whereby the variation in the power is only due to variation in the OR. The entropy measures are calculated for each simulated system. Logistic regression and random forest methods are used to analyse the effectiveness of the entropy measures in predicting coverage.

Finally, I simulate posterior probability systems with everything fixed except OR. I look at the distribution of the entropy measures for these simulations.

1. Construct model low, medium and high power systems

Model posterior probability systems were produced for low, medium and high power systems.

Low power system (OR=1, N0=N1=50)

# x.low <- ref()
# test.low <- simdata_x(x.low, OR=1, N0=50, N1=50) # get 100 systems
# t.low <- test.low[[1]][order(-test.low[[1]]$PP),] 

setwd("/Users/anna/PhD")
t.low <- read.table("t.low")
head(t.low)

##        snp    pvalues   MAF         PP    CV
## s76 SNP.76 0.02214234 0.251 0.01819772 FALSE
## s14 SNP.14 0.04174950 0.249 0.01565847 FALSE
## s52 SNP.52 0.03708935 0.133 0.01414266 FALSE
## s17 SNP.17 0.03914875 0.106 0.01343483 FALSE
## s21 SNP.21 0.09245567 0.295 0.01329824 FALSE
## s15 SNP.15 0.08572478 0.240 0.01326113 FALSE

The plots below show the pdf and the cdf of the posterior probabilities for this low power system.

Medium power system (OR=1.1, N0=N1=700)

# test.med <- simdata_x(x.low, OR=1.1, N0=700, N1=700) # get 100 systems
# t.med <- test.med[[1]][order(-test.med[[1]]$PP),] 

setwd("/Users/anna/PhD")
t.med <- read.table("t.med")
head(t.med)

##        snp     pvalues   MAF         PP    CV
## s13 SNP.13 0.007335678 0.090 0.07256723 FALSE
## s39 SNP.39 0.011852949 0.296 0.06153209 FALSE
## s92 SNP.92 0.013167979 0.133 0.05476122 FALSE
## s99 SNP.99 0.014588567 0.307 0.05254165 FALSE
## s87 SNP.87 0.025801453 0.212 0.03503086 FALSE
## s4   SNP.4 0.019619915 0.049 0.03316034 FALSE

The plots below show the pdf and the cdf of the posterior probabilities for this medium power system.

High power system (OR=1.3, N0=N1=1000)

# test.high <- simdata_x(x.low, OR=1.3, N0=1000, N1=1000) # get 100 systems
# t.high <- test.high[[1]][order(-test.high[[1]]$PP),] 

setwd("/Users/anna/PhD")
t.high <- read.table("t.high")
head(t.high)

##        snp      pvalues   MAF           PP    CV
## s36 SNP.36 5.293425e-06 0.313 0.9816200810  TRUE
## s44 SNP.44 1.311857e-02 0.333 0.0014266890 FALSE
## s84 SNP.84 3.083094e-02 0.048 0.0007577625 FALSE
## s50 SNP.50 3.402391e-02 0.184 0.0007321616 FALSE
## s25 SNP.25 3.618281e-02 0.143 0.0007191907 FALSE
## s64 SNP.64 4.918353e-02 0.089 0.0006014134 FALSE

The plots below show the pdf and the cdf of the posterior probabilities for this high power system.

The following figure shows all three model systems on the same plot. Note the scale on the y axis, meaning we cannot see the first snp in the high power system.

A green dashed line has been added to illustrate the case where all posterior probabilities are equal (base level low power system). We could consider using this line to quantify the differences in shapes of posterior probability systems, with low power systems being close to this line and high power systems being far from this line.

Weighted exponential curves

Weighted exponential curves were fit to the pdfs of each system. The weight for the first snp (the one with the highest posterior probability) was set to 1000, to ensure the curve incorporated this point into the fitting.

set.seed(2)
we <- rep(1,100)
we[1] <- 1000 # set weight of first snp to 100
snps <- seq(1,100,1)

# fit exp model: low
exp.low <- lm(log(t.low$PP)~snps, weights=we)

# fit exp model: med
exp.med <- lm(log(t.med$PP)~snps, weights=we)

# fit exp model: high
exp.high <- lm(log(t.high$PP)~snps, weights = we)

The plot below shows the points and the fitted exponential curves.

Perhaps the equations of these curves could be use to quantify the disorder of the system, and the information OR provides to it.

2. Measures of entropy

Different measures were considered to try and capture the disorder of the system.

‘Entropy’: \(-mean(log(pp))\)
‘Max’: \(max(pp)\)
‘Median’: \(median(pp)\)
‘AUCDF’: Area under the cdf
‘KL divergence’: \(mean(log(pp/q))\) where q=rep(1/nsnps,nsnps).
‘Slope’: The slope of the weighted exponential curve fitted to the pdf of the posterior probabilities.
‘Intercept’: The intercept of the weighted exponential curve fitted to the pdf of the posterior probabilities (\(exp(B_0)\)).
‘Hellinger’: The hellinger distance between the pp prob distributions (P) and q (Q), \(\frac{1}{\sqrt{2}}||\sqrt{P}-\sqrt{Q}||_2\).
Kolmogorov distance: The sup-distance between the pp prob distributions and q.

Plot these results (low, medium then high)

The measures seem good at distinguishing between low, medium and high power systems.

3. Simulation for varying OR

A new simulation dataset (20000 entries) was made with fixed N = 1000, nsnps = 100, LD1 = 0.3, LD2 = 0.2 and thr = 0.8.

1. Baseline model for unordered

library(rsq)
set.seed(2)
baseline.noord <- glm(covered~logitsize, data=new.noord, family="binomial")
rsq(baseline.noord,adj=TRUE)

## [1] 0.009792076

summary(baseline.noord)

## 
## Call:
## glm(formula = covered ~ logitsize, family = "binomial", data = new.noord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8496   0.4868   0.6732   0.6825   0.6867  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.42966    0.11677   3.679 0.000234 ***
## logitsize    0.64551    0.07429   8.689  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9709.9  on 9999  degrees of freedom
## Residual deviance: 9580.4  on 9998  degrees of freedom
## AIC: 9584.4
## 
## Number of Fisher Scoring iterations: 6

2. Baseline model for ordered

set.seed(2)
baseline.ord <- glm(covered~logitsize, data=new.ord, family="binomial")
rsq(baseline.ord,adj=TRUE)

## [1] 0.000273348

summary(baseline.ord)

## 
## Call:
## glm(formula = covered ~ logitsize, family = "binomial", data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1277   0.6960   0.6971   0.6978   0.6986  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.9868     0.1529   6.454 1.09e-10 ***
## logitsize     0.2158     0.1056   2.043   0.0411 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10405  on 9999  degrees of freedom
## Residual deviance: 10400  on 9998  degrees of freedom
## AIC: 10404
## 
## Number of Fisher Scoring iterations: 4

The logitsize only predictor model is better for the non-ordered sets than the ordered sets (lower r^2 adj), although there is a lot of noise.

3. Ordered: logitsize plus OR

set.seed(2)
ord.OR <- glm(covered~logitsize+OR, data=new.ord, family="binomial")
rsq(ord.OR,adj=TRUE)

## [1] 0.1015145

summary(ord.OR)

## 
## Call:
## glm(formula = covered ~ logitsize + OR, family = "binomial", 
##     data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3361   0.3677   0.4833   0.8088   1.0746  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.7912     0.4062 -26.567   <2e-16 ***
## logitsize    -0.1731     0.0920  -1.882   0.0598 .  
## OR           11.4103     0.3775  30.227   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10404.9  on 9999  degrees of freedom
## Residual deviance:  9319.3  on 9997  degrees of freedom
## AIC: 9325.3
## 
## Number of Fisher Scoring iterations: 5

By incorporating our new measures of entropy, we hope to decrease AIC and increase r squared.

4. Ordered: logitsize plus entropy measures

4a. Logitsize+entropy

set.seed(2)
ord.ent <- glm(covered~logitsize+entropy, data=new.ord, family="binomial")
rsq(ord.ent,adj=TRUE)

## [1] 0.0138588

summary(ord.ent)

## 
## Call:
## glm(formula = covered ~ logitsize + entropy, family = "binomial", 
##     data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5425   0.6231   0.6544   0.6923   1.9394  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.91417    0.25424  11.462  < 2e-16 ***
## logitsize    1.58634    0.19525   8.125 4.48e-16 ***
## entropy     -0.76415    0.06784 -11.264  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10405  on 9999  degrees of freedom
## Residual deviance: 10276  on 9997  degrees of freedom
## AIC: 10282
## 
## Number of Fisher Scoring iterations: 5

4b. Logitsize+max_pp

set.seed(2)
ord.max <- glm(covered~logitsize+max_pp, data=new.ord, family="binomial")
rsq(ord.max,adj=TRUE)

## [1] 0.01374452

summary(ord.max)

## 
## Call:
## glm(formula = covered ~ logitsize + max_pp, family = "binomial", 
##     data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6962   0.6267   0.6503   0.6887   1.0527  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.3226     0.2131   1.514     0.13    
## logitsize     0.9025     0.1568   5.757 8.59e-09 ***
## max_pp       -1.6012     0.1420 -11.277  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10405  on 9999  degrees of freedom
## Residual deviance: 10276  on 9997  degrees of freedom
## AIC: 10282
## 
## Number of Fisher Scoring iterations: 5

4c. Logitsize+KL

set.seed(2)
ord.KL <- glm(covered~logitsize+KL, data=new.ord, family="binomial")
rsq(ord.KL,adj=TRUE)

## [1] 0.0138588

summary(ord.KL)

## 
## Call:
## glm(formula = covered ~ logitsize + KL, family = "binomial", 
##     data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5425   0.6231   0.6544   0.6923   1.9394  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.60488    0.25753  -2.349   0.0188 *  
## logitsize    1.58634    0.19525   8.125 4.48e-16 ***
## KL           0.76415    0.06784  11.264  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10405  on 9999  degrees of freedom
## Residual deviance: 10276  on 9997  degrees of freedom
## AIC: 10282
## 
## Number of Fisher Scoring iterations: 5

4d. Logitsize+AUCDF

set.seed(2)
ord.AUCDF <- glm(covered~logitsize+AUCDF, data=new.ord, family="binomial")
rsq(ord.AUCDF,adj=TRUE)

## [1] 0.01374434

summary(ord.AUCDF)

## 
## Call:
## glm(formula = covered ~ logitsize + AUCDF, family = "binomial", 
##     data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6962   0.6267   0.6503   0.6887   1.0527  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.3066     0.2135   1.436    0.151    
## logitsize     0.9025     0.1568   5.757 8.58e-09 ***
## AUCDF        -1.6012     0.1420 -11.277  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10405  on 9999  degrees of freedom
## Residual deviance: 10276  on 9997  degrees of freedom
## AIC: 10282
## 
## Number of Fisher Scoring iterations: 5

4e. Logitsize+slope (of exponential curve fitted to posterior probabilities pdf)

set.seed(2)
ord.slope <- glm(covered~logitsize+slope, data=new.ord, family="binomial")
rsq(ord.slope,adj=TRUE)

## [1] 0.0001876722

summary(ord.slope)

## 
## Call:
## glm(formula = covered ~ logitsize + slope, family = "binomial", 
##     data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1320   0.6915   0.6968   0.6982   0.7132  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.986769   0.152889   6.454 1.09e-10 ***
## logitsize    0.215198   0.105653   2.037   0.0417 *  
## slope       -0.008469   0.026365  -0.321   0.7481    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10405  on 9999  degrees of freedom
## Residual deviance: 10400  on 9997  degrees of freedom
## AIC: 10406
## 
## Number of Fisher Scoring iterations: 4

4f. Logitsize+int (of exponential curve fitted to posterior probabilities)

set.seed(2)
ord.int <- glm(covered~logitsize+int, data=new.ord, family="binomial")
rsq(ord.int,adj=TRUE)

## [1] 0.0002370621

summary(ord.int)

## 
## Call:
## glm(formula = covered ~ logitsize + int, family = "binomial", 
##     data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1341   0.6953   0.6964   0.6973   0.8262  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.9888     0.1531   6.457 1.07e-10 ***
## logitsize     0.2183     0.1059   2.062   0.0393 *  
## int          -0.5396     0.7030  -0.767   0.4428    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10405  on 9999  degrees of freedom
## Residual deviance: 10399  on 9997  degrees of freedom
## AIC: 10405
## 
## Number of Fisher Scoring iterations: 4

4g. Logitsize+hellinger

set.seed(2)
ord.hell <- glm(covered~logitsize+hellinger, data=new.ord, family="binomial")
rsq(ord.hell,adj=TRUE)

## [1] 0.01319722

summary(ord.hell)

## 
## Call:
## glm(formula = covered ~ logitsize + hellinger, family = "binomial", 
##     data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5868   0.5983   0.6517   0.7059   1.1071  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.8663     0.1936   4.474 7.67e-06 ***
## logitsize     0.8243     0.1483   5.560 2.69e-08 ***
## hellinger    -2.1807     0.1987 -10.978  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10405  on 9999  degrees of freedom
## Residual deviance: 10281  on 9997  degrees of freedom
## AIC: 10287
## 
## Number of Fisher Scoring iterations: 5

4h. Logitsize+kolmogorov

set.seed(2)
ord.kol <- glm(covered~logitsize+kolmogorov, data=new.ord, family="binomial")
rsq(ord.kol,adj=TRUE)

## [1] 0.01433593

summary(ord.kol)

## 
## Call:
## glm(formula = covered ~ logitsize + kolmogorov, family = "binomial", 
##     data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5679   0.6183   0.6500   0.6949   1.1487  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.4770     0.2047   2.330   0.0198 *  
## logitsize     0.8329     0.1509   5.518 3.43e-08 ***
## kolmogorov   -2.1300     0.1856 -11.476  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10405  on 9999  degrees of freedom
## Residual deviance: 10272  on 9997  degrees of freedom
## AIC: 10278
## 
## Number of Fisher Scoring iterations: 5

4j. Logitsize+median

set.seed(2)
ord.med <- glm(covered~logitsize+median, data=new.ord, family="binomial")
rsq(ord.med,adj=TRUE)

## [1] 0.01172334

summary(ord.med)

## 
## Call:
## glm(formula = covered ~ logitsize + median, family = "binomial", 
##     data = new.ord)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4922   0.5941   0.6545   0.7105   0.9927  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.5286     0.2407  -2.196   0.0281 *  
## logitsize     0.6633     0.1378   4.814 1.48e-06 ***
## median      157.1883    15.2126  10.333  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10405  on 9999  degrees of freedom
## Residual deviance: 10294  on 9997  degrees of freedom
## AIC: 10300
## 
## Number of Fisher Scoring iterations: 5

Unfortunately, none of our measures of entropy seem to improve the model significantly.

5. Run cv.glmnet and see if they all get dropped

set.seed(2)
library(glmnet) 
x_ord        <- as.matrix(new.ord[,c(1:3,5:15)]) 
cv.ord <- cv.glmnet(x_ord, y=new.ord$covered, alpha=1, family="binomial")
coef(cv.ord,s=cv.ord$lambda.1se) # r2 is 0.1757218

FALSE 15 x 1 sparse Matrix of class "dgCMatrix"
FALSE                        1
FALSE (Intercept) -10.86689438
FALSE thr           .         
FALSE logitsize     .         
FALSE nvar          0.01942373
FALSE entropy       .         
FALSE max_pp        .         
FALSE KL            .         
FALSE AUCDF         .         
FALSE slope         .         
FALSE int           .         
FALSE hellinger     .         
FALSE kolmogorov    .         
FALSE median        .         
FALSE N             .         
FALSE OR           10.27669890

5b. Run cv.glmnet and see if they get dropped WITHOUT OR

Same result as just averaging the covered column.

set.seed(2)
library(glmnet) # r2 is 0.001173106
x_ord1        <- as.matrix(new.ord[,c(1:3,5:14)]) # design matrix
cv.ord1 <- cv.glmnet(x_ord1, y=new.ord$covered, alpha=1, family="binomial")
coef(cv.ord1,s=cv.ord1$lambda.1se) # yep, all dropped

FALSE 14 x 1 sparse Matrix of class "dgCMatrix"
FALSE                       1
FALSE (Intercept) 1.104698911
FALSE thr         .          
FALSE logitsize   .          
FALSE nvar        0.003632111
FALSE entropy     .          
FALSE max_pp      .          
FALSE KL          .          
FALSE AUCDF       .          
FALSE slope       .          
FALSE int         .          
FALSE hellinger   .          
FALSE kolmogorov  .          
FALSE median      .          
FALSE N           .

All get dropped, with or without OR in our model. Notice the r squared is much higher for the model with OR and N.

6. Check correlation of entropy measures with OR

NB KL BOX PLOT IS WRONG.

We see that there is some correlation with OR for most of the disorder measures.
So why are they not capturing the information OR provides? Surely knowing these measures (and therefore having an indication of OR) is better than not having them at all?
Lets investigate the relationship with covered for each entropy measure by looking at the pairs plot.

NB, DATASET CHANGED SO NO LONGER CORRECT

7. Non-linear relationship?

Perhaps the relationship with coverage is non-linear? Unfortunately, this does not seem to be the case. Only improves logitsize model slightly and does not improve anywhere near as much as rcs(OR) does.

NB, r squared column wrong here

covered~	R^2	AIC
logitsize	0.0026	10404
logitsize+rcs(OR)	0.1424	9265
logitsize+rcs(entropy)	0.0076	10284
logitsize+rcs(max_pp)	0.0074	10287
logitsize+rcs(KL)	0.0076	10284
logitsize+rcs(AUCDF)	0.0074	10287
logitsize+rcs(slope)	0.0058	10408
logitsize+rcs(int)	0.0087	10315
logitsize+rcs(hellinger)	0.0079	10278
logitsize+rcs(kolmogoro)	0.0072	10274
logitsize+rcs(median)	0.0078	10283

8. Random forest

With OR

set.seed(2)
library(randomForest)

new.ord$covered <- as.factor(new.ord$covered)
output.forest <- randomForest(new.ord$covered ~ ., 
                              data = new.ord, importance=TRUE)
varImpPlot(output.forest) 
print(output.forest) 
pred <- predict(output.forest,type="prob")

Without OR

How about if we don’t allow OR or N (fixed anyway) in our model?

set.seed(2)
output.forest1 <- randomForest(covered ~ slope+int+entropy+KL+max_pp+AUCDF+nvar+thr+hellinger+kolmogorov+median, 
                              data = new.ord, importance=TRUE)
varImpPlot(output.forest1) 
print(output.forest1)

Plot the prediction from the random forest model (with OR) against each entropy measure

It seems that because there is so much noise in the model, none of the entropy measures are good at predicting coverage.

Summary

None of the entropy measures seem helpful in predicting coverage. Both linear and non-linear effects have been studied for each measure, but the AIC or BIC does not decrease close to that for the logitsize+OR model.
Have also tried a simulated dataset with everything fixed (as in ‘new’) but this time with OR values up to 1.4, the entropy measures still do not improve the model. [See new1 for OR up to 1.4 code].

———————————–

Examining the effects of entropy for varying OR

1000 posterior probability systems were simulated for 6 different OR values to understand how the OR affects the systems with everything else fixed (N0=N1=1000, LD1=0.3, LD2=0.3, nsnps=100).

Lowest: OR=1
Low: OR=1.1
Mid: OR=1.2
Midhigh: OR=1.25
High: OR=1.3
Highest: OR=1.4

Summaries for the p-values and posterior probabilities are given below:

The following histograms show the distribution of values for the entropy measures for the simualtions.

Explanation: Entropy is low for low power systems, as most posterior probabilities will lie close to the y=1/nsnps line. Entropy increases for higher power systems but perhaps not as much as we’d have liked.

Explanation: The maximum moves from very close to 0 to very close to 1. Again, this seems like an appropriate measure, however it does not distinguish well between low OR systems.

Explanation: The median moves closer to 0 for higher OR systems, as there is one pp holding all the probability, and all the others have to share the small amount that is left. It seems that this measure would struggle for OR values between 1 and 1.25 (most likely OR values).

Explanation: Seems like a good measure to capture entropy, struggles for lower OR values.

Summary and questions

Why are there some seemingly high power systems in the OR=1 group (see image below for example)? Where is this power coming from (note this isn’t the CV…)?

Several of the measures (kolmogorov distance, hellinger distance, AUCDF, max, median) have near identical distributions for OR=1 and OR=1.1, suggesting they would struggle to capture information about OR for low power systems. This seems to be because the distributions of posterior probabilities for OR=1 and OR=1.1 are extremely similar.

I think our focus should shift to how we can use this KL measure instead of OR as a predictor of coverage.