(See just_ave
directory on hpc)
Now that we know that a GAM is not being fitted, our method is very simple.
We estimate the true effect at the causal variant and then simulate many credible sets from the same distribution. We then find the proportion of these which contain the true causal variant. We do this for each variant considered causal and weight each proportion by that variant’s posterior probability.
I am investigating how fast I can get this method. The slow step is forming credible sets. Chris has wrote a quicker function in Cpp to do this, which I am now using.
The results below suggest that the correction doesn’t change much between using 1000 (~55 secs) or 10,000 (~8 mins) simulated credible sets. Even the error bars don’t change much (???)
Aim: Obtain an estimated required threshold to use to get a credible set with the desired coverage.
We now have 200 GAMs built for each SNP considered causal. Each GAM is built from nrep*nsnp
data points of the logit(cumsum) of the pps and a binary covered indicator. Note that each simulated credible set now contributes nsnp
data points to fit the GAM to, rather than 1.
nrep
is currently set to 700 for speed, resulting in 200*700=140000
data points to fit a GAM to. Lots of these points are at the top end, so I cut out those with cumsum>0.999
which significantly decreases the number of data points to fit the GAM to.
Sometimes a GAM cannot be fit. For example if the CV holds most of the posterior probability then it is always the first one in the set and therefore the covered
column is all 1s. I have decided to discard these values as they only contribute one point, (1,1). Note that most of these are for variants with very low pp0 values, so the result would be down-weighted significantly anyway.
We want to find the corresponding \(x\) value (size) for \(y=0.8\) (I am focusing on the case where users want a credible set with 80% coverage of the causal variant). To do this, I use the approx
function in R as follows.
approx(x = gam$fitted, y = seq(0.0001, 0.999999, by = 0.01), xout = 0.8)$y
For each GAM, get the corresponding \(x\) coordinate from the desired coverage. Weight these by the pp and sum. This is what I am investigating in idea1.R
in backward_method
on the HPC.
If a GAM cannot be fit to the 700 simulations for each SNP considered causal, then report NA. Hence we weight the final threshold estimate as:
final <- sum(thr.est[which(thr.est!="NA")] * pp0[which(thr.est!="NA")]) / sum(pp0[which(thr.est!="NA")])
Note that the denominator often equals 1 due to R’s floating point precision (the pps for the things we get rid of are tiny).
Here, I have fixed the threshold at 0.8. The key outputs are:
orig_thr
: The original threshold (fixed at 0.8 for now)orig_claim
: The original claimed coverageorig_corr
: The original corrected coverage using thr=0.8
and the original claimed coverageorig_true_cov
: An empirical estimate of the true coverage (proportion of simulated credible sets with thr=0.8 that contain the CV)req_thr
: The required threshold found using our algorithm that is needed to get corrected coverage ~ 0.8. Would expect this to be \(>0.8\) for undercoverage and \(<0.8\) for overcoveragenew_claim
: The new claimed coverage of the credible set obtained using the new required thresholdnew_corr
: The corrected coverage of the new credible set, would expect this to be approximately 0.8new_true_cov
: An empirical estimate of the true coverage (proportion of simulated credible sets with thr=req_thr that contain the CV)
## N0 OR true.mu mu.est orig_thr orig_claim orig_corr
## 1: 5000 1.05 1.503374 2.037050 0.8 0.8011111 0.8590549
## 2: 5000 1.05 1.721908 1.777859 0.8 0.8013501 0.8475791
## 3: 5000 1.15 3.288080 4.134414 0.8 0.8516945 0.8816808
## 4: 5000 1.15 4.085090 3.932386 0.8 0.8022689 0.8855833
## 5: 20000 1.10 4.925089 4.882388 0.8 0.9830833 0.9925537
## 6: 40000 1.10 2.890369 3.723554 0.8 0.8067902 0.9158112
## orig_true_cov CV block req.thr new_claim new_corr
## 1: 0.7026 pos49799030 23 0.6786318 0.6832395 0.7481347
## 2: 0.6354 pos49308332 23 0.7117121 0.7118624 0.7666809
## 3: 0.8924 pos43095895 18 0.5635178 0.5981582 0.6699855
## 4: 0.9116 pos27788357 8 0.5569146 0.5905778 0.6950391
## 5: 0.9960 pos18114036 2 0.2802518 0.6400096 0.5965600
## 6: 0.9126 pos18065099 2 0.4570743 0.4745464 0.6258927
## new_true_cov max_pp X2nd_max
## 1: 0.5730 0.02712818 0.02112361
## 2: 0.5414 0.13389410 0.02998073
## 3: 0.7182 0.18530376 0.11038765
## 4: 0.7212 0.13946552 0.13497144
## 5: 0.5918 0.64000960 0.34307372
## 6: 0.6708 0.23727319 0.23727319
dim(y)
## [1] 280 16
dim(y[y$true.mu>5,])
## [1] 124 16
This method results in lots of credible sets with corrected/empirical coverage below the threshold. Ideally, our backward method wouldn’t lead to any credible sets with corrected/empirical coverage below the threshold as users will not be happy if we replace their credible set with say 85% coverage with one of say 75% coverage.
BUT, lets see what happens if we only look at those simulations where the true effect is greater than 5 (those likely to be considered for fine-mapping).
Consider an overall average method whereby we take the average of all the GAMs and predict from this. (See /backward_method/idea2+3.R) in hpc.
Predict \(y\) values for the same \(x\) values (x <- seq(0.0001, 0.999999, by = 0.01)
) from each of the GAMs. Average the predicted \(y\) values for each \(x\) value. These averaged \(y\) values and the corresponding \(x\) values form our new data set from which we can predict the required threshold off.
BUT, this does not incorporate any weighting of the GAMs based on pps.
I ran simulations implementing ideas 2 and 3 above. The output is:
orig_thr
: The original threshold (fixed at 0.8 for now)orig_claim
: The original claimed coverageorig_corr
: The original corrected coverage using thr=0.8
and the original claimed coverageorig_true_cov
: An empirical estimate of the true coverage (proportion of simulated credible sets with thr=0.8 that contain the CV)req.thr
: The required threshold found using our algorithm that is needed to get corrected coverage ~ 0.8. Would expect this to be \(>0.8\) for undercoverage and \(<0.8\) for overcoverage (using averge GAM method not normalised by pps)new_claim
: The new claimed coverage of the credible set obtained using the new required threshold (using averge GAM method not normalised by pps)new_cs_corr
: The corrected coverage of the new credible set, would expect this to be approximately 0.8 (using averge GAM method not normalised by pps)new_true_cov
: An empirical estimate of the true coverage (proportion of simulated credible sets with thr=req_thr that contain the CV)max_pp
: The maximum pp in the systemX2nd_max
: The second largest pp in the systemtrueCV_GAM_reqthr
: The required threshold read off from the GAM built by considering the true CV as causalmaxpp_GAM_reqthr
: The required threshold read off from the GAM built by considering the variant with maximum posterior probability as causalmaxz_GAM_reqthr
: The required threshold read off from the GAM built by considering the variant with maximum absolute Z score as causalreq.thr.weighted
: The required threshold found using our WEIGHTED algorithm that is needed to get corrected coverage ~ 0.8new_claim_weighted
: The new claimed coverage of the credible set obtained using the new required thresholdnew_cs_cov_weighted
: The corrected coverage of the new credible setnew_true_cov_weighted
: An empirical estimate of the true coverage
## N0 OR true.mu mu.est orig_thr orig_claim orig_corr
## 1: 20000 1.1 5.902253 6.685043 0.8 0.9094746 0.8606052
## 2: 5000 1.2 6.428665 6.673968 0.8 0.9975304 0.9875717
## 3: 40000 1.2 17.944435 18.852497 0.8 0.9211821 0.9309324
## 4: 40000 1.1 6.108856 5.010136 0.8 0.8262557 0.8408722
## 5: 20000 1.2 5.579133 4.063874 0.8 0.8017712 0.8359460
## 6: 20000 1.2 12.803091 13.352036 0.8 0.9901787 0.9198079
## orig_true_cov CV block req.thr new_claim new_cs_cov
## 1: 0.8868333 pos46996137 21 0.7543082 0.9094746 0.8202562
## 2: 0.9831667 pos27447383 8 0.7347931 0.9975304 0.9894700
## 3: 0.9345000 pos27796846 8 0.7925602 0.9211821 0.9237940
## 4: 0.8081667 pos34802037 13 0.6852369 0.7258909 0.7352787
## 5: 0.8608333 pos46652063 21 0.6203681 0.6287460 0.6954543
## 6: 0.9103333 pos49337881 23 0.7831868 0.9901787 0.9148675
## new_true_cov req.thr.weighted new_claim_weighted new_cs_cov_weighted
## 1: 0.8413333 0.7894359 0.9094746 0.8515612
## 2: 0.9811667 0.5565362 0.9975304 0.9867905
## 3: 0.9316667 0.7833951 0.9211821 0.9252737
## 4: 0.6906667 0.7438827 0.7772831 0.7881810
## 5: 0.7713333 0.6928091 0.6950363 0.7576621
## 6: 0.9023333 0.8107345 0.9901787 0.9249987
## new_true_cov_weighted
## 1: 0.8755000
## 2: 0.9743333
## 3: 0.9275000
## 4: 0.7508333
## 5: 0.8130000
## 6: 0.9140000
dim(y)
## [1] 161 18
dim(y[y$true.mu>5,])
## [1] 121 18
LEFT: The first ‘original cs’ box plot is the corrected coverage for the original credible sets, the second two are the corrected coverage values for the new credible sets obtained using the required threshold dervied from averaging the GAMs (idea 2) and averaging and weighting the GAMs (idea 3) respectively. I would expect to see the second two closer to 0.8 as this was our aim.
RIGHT: As above, but the empirical estimate of the true coverage.
Idea 3 seems best because it has a smaller IQR and most points fall above the threshold. Idea 2 is problematic as it often produces credible sets with corrected coverage (and true coverage) below the threshold (really don’t want this).
I am running more simulations to compare ideas 1, 2 and 3
I have written a function, new_cs
, which uses the parameters, desired_cov, z, f, N0, N1, Sigma, W = 0.2
, and outputs a new credible set with approximately the desired coverage, the corrected coverage of this new credible set and the new threshold used to get this new credible set. (In backward_method/final_idea3/ on hpc)
Currently running more simulations to further investigate the accuracy of idea 3.
Weight each model by its likelihood? Calculate the likelihood ratio for the most likely model with all the others.
Model stacking approaches
Could try a ‘permutation’ method whereby I sample points from each of the models based on some prior of how likely that model is, so I will be sampling more points from those models corresponding to those assuming a variant with high posterior probability is causal and less from those where an unlikely variant is considered causal. These sampled values could form a new dataset which I could build a GAM from and repeat until I get a distribution or something?
Try GLMs instead of GAMs. Have tried this but is no better (simulations in \backward_method\logit\
).
# model stacking attempt using https://www.analyticsvidhya.com/blog/2017/02/introduction-to-ensembling-along-with-implementation-in-r/
####### ensemble
library('caret')
index <- sample(dim(data[[1]])[1],100000)
trainSet <- as.data.frame(data[[1]][ index,])
testSet <- as.data.frame(data[[1]][-index,])
#Defining the training control
fitControl <- trainControl(
method = "cv",
number = 10,
savePredictions = 'final', # To save out of fold predictions for best parameter combinantions
classProbs = T # To save the class probabilities of the out of fold predictions
)
#Defining the predictors and outcome
predictors<-"logitcpp"
outcomeName<-"cov"
#Training the random forest model
model_1<-train(cov~logitcpp,data=trainSet,method='gam',trControl=fitControl,tuneLength=3)
# predict
trainSet$OOF_pred_1<-model_1$pred$Y[order(model_1$pred$rowIndex)]
#Predicting probabilities for the test data
testSet$OOF_pred_1<-predict(model_1,testSet["logitcpp"],type='prob')$Y
#Predictors for top layer models
predictors_top<-c('OOF_pred_1') #,'OOF_pred_knn','OOF_pred_lr')
# GAM as top layer model
model_gam<-
train(trainSet[,predictors_top],trainSet[,outcomeName],method='gam',trControl=fitControl,tuneLength=3)
#predict using logictic regression top layer model
testSet$glm_stacked<-predict(model_gam,testSet[,predictors_top])
nsnp
) GAMs from 500*200
data points (nrep*nsnp
):Compare ideas 1, 2 and 3 when simulations finish.
Update package so that it includes a function that makes a new credible set with desired coverage (likely parameters will be desired_cov, z, f, N0, N1, Sigma, W = 0.2
) using whichever idea wins.
Update package such that the corrected coverage function (uses parameters z, maf, N0, N1, Sigma, thr, W = 0.2
) uses the simple averaging method (rather than GAM).
Should I merge the above two functions to make a mega function that provides a new credible set but also reports the corrected coverage value of the credible set obtained if the original threshold was used?