27thFeb

Back Estimation

Aim: Obtain an estimated required threshold to use to get a credible set with the desired coverage.
We now have 200 GAMs built for each SNP considered causal. Each GAM is built from nrep*nsnp data points of the logit(cumsum) of the pps and a binary covered indicator. Note that each simulated credible set now contributes nsnp data points to fit the GAM to, rather than 1.
nrep is currently set to 700 for speed, resulting in 200*700=140000 data points to fit a GAM to. Lots of these points are at the top end, so I cut out those with cumsum>0.999 which significantly decreases the number of data points to fit the GAM to.
Sometimes a GAM cannot be fit. For example if the CV holds most of the posterior probability then it is always the first one in the set and therefore the covered column is all 1s. I have decided to discard these values as they only contribute one point, (1,1). Note that most of these are for variants with very low pp0 values, so the result would be down-weighted significantly anyway.
We want to find the corresponding \(x\) value (size) for \(y=0.8\) (I am focusing on the case where users want a credible set with 80% coverage of the causal variant). To do this, I use the approx function in R as follows.

approx(x = gam$fitted, y = seq(0.0001, 0.999999, by = 0.01), xout = 0.8)$y

Idea 1: Get estimated required threshold from each GAM and then normalise with original pps

For each GAM, get the corresponding \(x\) coordinate from the desired coverage. Weight these by the pp and sum. This is what I am investigating in idea1.R in backward_method on the HPC.
If a GAM cannot be fit to the 700 simulations for each SNP considered causal, then report NA. Hence we weight the final threshold estimate as:

final <- sum(thr.est[which(thr.est!="NA")] * pp0[which(thr.est!="NA")]) / sum(pp0[which(thr.est!="NA")])

Note that the denominator often equals 1 due to R’s floating point precision (the pps for the things we get rid of are tiny).
Here, I have fixed the threshold at 0.8. The key outputs are:
orig_thr: The original threshold (fixed at 0.8 for now)
orig_claim: The original claimed coverage
orig_corr: The original corrected coverage using thr=0.8 and the original claimed coverage
orig_true_cov: An empirical estimate of the true coverage (proportion of simulated credible sets with thr=0.8 that contain the CV)
req_thr: The required threshold found using our algorithm that is needed to get corrected coverage ~ 0.8. Would expect this to be \(>0.8\) for undercoverage and \(<0.8\) for overcoverage
new_claim: The new claimed coverage of the credible set obtained using the new required threshold
new_corr: The corrected coverage of the new credible set, would expect this to be approximately 0.8
new_true_cov: An empirical estimate of the true coverage (proportion of simulated credible sets with thr=req_thr that contain the CV)

##       N0   OR  true.mu   mu.est orig_thr orig_claim orig_corr
## 1:  5000 1.05 1.503374 2.037050      0.8  0.8011111 0.8590549
## 2:  5000 1.05 1.721908 1.777859      0.8  0.8013501 0.8475791
## 3:  5000 1.15 3.288080 4.134414      0.8  0.8516945 0.8816808
## 4:  5000 1.15 4.085090 3.932386      0.8  0.8022689 0.8855833
## 5: 20000 1.10 4.925089 4.882388      0.8  0.9830833 0.9925537
## 6: 40000 1.10 2.890369 3.723554      0.8  0.8067902 0.9158112
##    orig_true_cov          CV block   req.thr new_claim  new_corr
## 1:        0.7026 pos49799030    23 0.6786318 0.6832395 0.7481347
## 2:        0.6354 pos49308332    23 0.7117121 0.7118624 0.7666809
## 3:        0.8924 pos43095895    18 0.5635178 0.5981582 0.6699855
## 4:        0.9116 pos27788357     8 0.5569146 0.5905778 0.6950391
## 5:        0.9960 pos18114036     2 0.2802518 0.6400096 0.5965600
## 6:        0.9126 pos18065099     2 0.4570743 0.4745464 0.6258927
##    new_true_cov     max_pp   X2nd_max
## 1:       0.5730 0.02712818 0.02112361
## 2:       0.5414 0.13389410 0.02998073
## 3:       0.7182 0.18530376 0.11038765
## 4:       0.7212 0.13946552 0.13497144
## 5:       0.5918 0.64000960 0.34307372
## 6:       0.6708 0.23727319 0.23727319

dim(y)

## [1] 280  16

dim(y[y$true.mu>5,])

## [1] 124  16

This method results in lots of credible sets with corrected/empirical coverage below the threshold. Ideally, our backward method wouldn’t lead to any credible sets with corrected/empirical coverage below the threshold as users will not be happy if we replace their credible set with say 85% coverage with one of say 75% coverage.
BUT, lets see what happens if we only look at those simulations where the true effect is greater than 5 (those likely to be considered for fine-mapping).

Looks better, but still not great.

Idea 2: Average GAMs, then predict required threshold from this

Consider an overall average method whereby we take the average of all the GAMs and predict from this. (See /backward_method/idea2+3.R) in hpc.
Predict \(y\) values for the same \(x\) values (x <- seq(0.0001, 0.999999, by = 0.01)) from each of the GAMs. Average the predicted \(y\) values for each \(x\) value. These averaged \(y\) values and the corresponding \(x\) values form our new data set from which we can predict the required threshold off.
BUT, this does not incorporate any weighting of the GAMs based on pps.

Idea 3: Average GAMs, weight by pps, then predict required threshold from this

Same as idea 2, but weight the \(y\) values by the posterior probabilities.

Results

I ran simulations implementing ideas 2 and 3 above. The output is:
orig_thr: The original threshold (fixed at 0.8 for now)
orig_claim: The original claimed coverage
orig_corr: The original corrected coverage using thr=0.8 and the original claimed coverage
orig_true_cov: An empirical estimate of the true coverage (proportion of simulated credible sets with thr=0.8 that contain the CV)
req.thr: The required threshold found using our algorithm that is needed to get corrected coverage ~ 0.8. Would expect this to be \(>0.8\) for undercoverage and \(<0.8\) for overcoverage (using averge GAM method not normalised by pps)
new_claim: The new claimed coverage of the credible set obtained using the new required threshold (using averge GAM method not normalised by pps)
new_cs_corr: The corrected coverage of the new credible set, would expect this to be approximately 0.8 (using averge GAM method not normalised by pps)
new_true_cov: An empirical estimate of the true coverage (proportion of simulated credible sets with thr=req_thr that contain the CV)
max_pp: The maximum pp in the system
X2nd_max: The second largest pp in the system
trueCV_GAM_reqthr: The required threshold read off from the GAM built by considering the true CV as causal
maxpp_GAM_reqthr: The required threshold read off from the GAM built by considering the variant with maximum posterior probability as causal
maxz_GAM_reqthr: The required threshold read off from the GAM built by considering the variant with maximum absolute Z score as causal
req.thr.weighted: The required threshold found using our WEIGHTED algorithm that is needed to get corrected coverage ~ 0.8
new_claim_weighted: The new claimed coverage of the credible set obtained using the new required threshold
new_cs_cov_weighted: The corrected coverage of the new credible set
new_true_cov_weighted: An empirical estimate of the true coverage

##       N0  OR   true.mu    mu.est orig_thr orig_claim orig_corr
## 1: 20000 1.1  5.902253  6.685043      0.8  0.9094746 0.8606052
## 2:  5000 1.2  6.428665  6.673968      0.8  0.9975304 0.9875717
## 3: 40000 1.2 17.944435 18.852497      0.8  0.9211821 0.9309324
## 4: 40000 1.1  6.108856  5.010136      0.8  0.8262557 0.8408722
## 5: 20000 1.2  5.579133  4.063874      0.8  0.8017712 0.8359460
## 6: 20000 1.2 12.803091 13.352036      0.8  0.9901787 0.9198079
##    orig_true_cov          CV block   req.thr new_claim new_cs_cov
## 1:     0.8868333 pos46996137    21 0.7543082 0.9094746  0.8202562
## 2:     0.9831667 pos27447383     8 0.7347931 0.9975304  0.9894700
## 3:     0.9345000 pos27796846     8 0.7925602 0.9211821  0.9237940
## 4:     0.8081667 pos34802037    13 0.6852369 0.7258909  0.7352787
## 5:     0.8608333 pos46652063    21 0.6203681 0.6287460  0.6954543
## 6:     0.9103333 pos49337881    23 0.7831868 0.9901787  0.9148675
##    new_true_cov req.thr.weighted new_claim_weighted new_cs_cov_weighted
## 1:    0.8413333        0.7894359          0.9094746           0.8515612
## 2:    0.9811667        0.5565362          0.9975304           0.9867905
## 3:    0.9316667        0.7833951          0.9211821           0.9252737
## 4:    0.6906667        0.7438827          0.7772831           0.7881810
## 5:    0.7713333        0.6928091          0.6950363           0.7576621
## 6:    0.9023333        0.8107345          0.9901787           0.9249987
##    new_true_cov_weighted
## 1:             0.8755000
## 2:             0.9743333
## 3:             0.9275000
## 4:             0.7508333
## 5:             0.8130000
## 6:             0.9140000

dim(y)

## [1] 161  18

dim(y[y$true.mu>5,])

## [1] 121  18

LEFT: The first ‘original cs’ box plot is the corrected coverage for the original credible sets, the second two are the corrected coverage values for the new credible sets obtained using the required threshold dervied from averaging the GAMs (idea 2) and averaging and weighting the GAMs (idea 3) respectively. I would expect to see the second two closer to 0.8 as this was our aim.

RIGHT: As above, but the empirical estimate of the true coverage.

Idea 3 seems best because it has a smaller IQR and most points fall above the threshold. Idea 2 is problematic as it often produces credible sets with corrected coverage (and true coverage) below the threshold (really don’t want this).

I am running more simulations to compare ideas 1, 2 and 3

Implementing idea 3

I have written a function, new_cs, which uses the parameters, desired_cov, z, f, N0, N1, Sigma, W = 0.2, and outputs a new credible set with approximately the desired coverage, the corrected coverage of this new credible set and the new threshold used to get this new credible set. (In backward_method/final_idea3/ on hpc)
Currently running more simulations to further investigate the accuracy of idea 3.

Further ideas

Weight each model by its likelihood? Calculate the likelihood ratio for the most likely model with all the others.
Model stacking approaches
Could try a ‘permutation’ method whereby I sample points from each of the models based on some prior of how likely that model is, so I will be sampling more points from those models corresponding to those assuming a variant with high posterior probability is causal and less from those where an unlikely variant is considered causal. These sampled values could form a new dataset which I could build a GAM from and repeat until I get a distribution or something?
Try GLMs instead of GAMs. Have tried this but is no better (simulations in \backward_method\logit\).

Extra reading (‘ensembling’) (https://www.analyticsvidhya.com/blog/2017/02/introduction-to-ensembling-along-with-implementation-in-r/)

Bagging: Build multiple models from different subsamples of the training dataset.
Boosting: Building multiple models each of which learns to fix the prediction errors of a prior model in the chain.
Stacking: Building multiple models and supervisor model that learns how to best combine the predicitons of the primary model.
Averaging (idea 2 above does this)
Weighted average: Different weights are applied to predictions from multiple models then taking the average (idea 2 above does this)

# model stacking attempt using https://www.analyticsvidhya.com/blog/2017/02/introduction-to-ensembling-along-with-implementation-in-r/

####### ensemble
library('caret')

index <- sample(dim(data[[1]])[1],100000)
trainSet <- as.data.frame(data[[1]][ index,])
testSet <- as.data.frame(data[[1]][-index,])

#Defining the training control
fitControl <- trainControl(
  method = "cv",
  number = 10,
  savePredictions = 'final', # To save out of fold predictions for best parameter combinantions
  classProbs = T # To save the class probabilities of the out of fold predictions
)

#Defining the predictors and outcome
predictors<-"logitcpp"
outcomeName<-"cov"

#Training the random forest model
model_1<-train(cov~logitcpp,data=trainSet,method='gam',trControl=fitControl,tuneLength=3)

# predict
trainSet$OOF_pred_1<-model_1$pred$Y[order(model_1$pred$rowIndex)]

#Predicting probabilities for the test data
testSet$OOF_pred_1<-predict(model_1,testSet["logitcpp"],type='prob')$Y

#Predictors for top layer models 
predictors_top<-c('OOF_pred_1') #,'OOF_pred_knn','OOF_pred_lr') 

# GAM as top layer model 
model_gam<-
  train(trainSet[,predictors_top],trainSet[,outcomeName],method='gam',trControl=fitControl,tuneLength=3)

#predict using logictic regression top layer model
testSet$glm_stacked<-predict(model_gam,testSet[,predictors_top])

GAM Timings

I ran a simulation to time how long the following GAM methods took on average to build 200 (nsnp) GAMs from 500*200 data points (nrep*nsnp):

GAM (3.52e+11)
BAM (2.75e+12)
BFGS (4.24e+11)
CR (3.55e+11)

It seems that BAM actually takes longer for large datasets (?) so, for now, I am sticking with the normal GAM method.