Apply our method to the 44 regions in table 1.
For the new credible set method, accuracy was set to 0.0001 and maximum iterations to 50.
Only able to do 38/44 regions:
Cautious about row 35 (SNP rs917911).
rs2476601 (second last row) has OR = 1.89 and \(p<10^{−100}\). Notice that the corrected coverage of the new credible set is only 0.94, this is because the top two SNPs have pps 0.8640309 (rs2476601) and 0.1359691 (rs6679677) which sum to 1. The remaining SNPs have pps on the order of \(10^{-90}\). The maximum corrected coverage you can get with the credible set including just these two variants is ~95% and our method can’t add any more variants to increase this coverage because of R’s precision and how long the algorithm would take to get to these variants. E.g. got to "thr: 0.999999999883585 , cov: 0.926728061790903"
after 50 iterations.
Their credible sets are different to mine because they combine SIBs and trio data. Interesting that mine are almost always smaller. For this reason, in my paper I will compare my credible sets obtained using the standard Bayesian method and those obtained using the new required threshold method.
Investigating the median size of the credible set (as they do in the paper), it seems that applying my correction actually increases the size of the credible set - suggesting that there are more instances when the standard Bayesian approach has undercoverage (doesn’t actually reach 0.99) - but this might just be because there’s ‘more room’ below 99% than above it. Also, if we increase the number of variants in the set then we often increase it by quite a lot relative to how many we remove (because can be further below 99% than you can be above it).
Have added notch=TRUE
which provides roughly a 95% confidence interval about the median (\(median +/- 1.58*IQR/sqrt(n)\)). If the notches between groups differ then the medians are significantly different.
“The upper whisker extends from the hinge to the largest value no further than \(1.5 \times IQR\) from the hinge. The lower whisker extends from the hinge to the smallest value at most \(1.5 \times IQR\) of the hinge.”
These all look promising. Seems that our correction method performs well at ‘pulling down’ the true coverage of the credible set, to what we desire (therefore adding uncertaincy - removing variants - why is our median nvar higher in the T1D data then?).
In particular, our method decreases the variability between the relative error of estimates.
To convince myself that there is no further way to decrease the variability within my method, I compare simulations using 1000 replications and simulations using 10K replications.
There is almost no difference in results and the variability does not decrease with more replicates –> stick with 1000 replications (~ 5 secs).
Does the method help at all if the assumption fails? To investigate this, I simulate systems with 2 CVs and find the coverage estimates for at least 1 CV in the credible set. As usual, the claimed coverage is the size of the credible set but the corrected coverage is now the proportion of simulations for which at least 1 CV is in the set.
muhat
is calculated in the normal way, using the est_mu(z0, maf, NN, NN)
function. Note - this provides 1 muhat estimate, should there be one for each CV? The estimate is usually closer to that of the CV with the largest \(Z\)-score.CV2.credset
function which outputs list(credset = o[1:wh], claimed.cov = size, covered = any(CV %in% o[1:wh]), nvar = wh, each.contained = CV %in% o[1:wh])
.The corrected coverage estimate involves assuming each CV is causal (NOT each combination of 2 CVs causal - wayyyy too many to try - for 200 snps this would mean 19,900 combinations to try!) - so not sure how effective this will be.
It is promising that our method still seems to work when the 1 CV assumption is dropped, because in the real world this assumption does not often hold - so our method still has some credibility.
However, it seems that there are more cases where the claimed coverage (and our correction) are higher than the true coverage (relative error > 0), leading to non-conservative estimates which we were trying to avoid.
The plot below shows that the est_mu
function is estimating the value of \(\mu\) at the CV with the biggest effect (highest true mu) –> muhat estimates the largest true mu.
The relative error is the true mu at either the CV with the maximum true mu (CV_max_mu) or the CV with the minimum true mu (CV_min_mu) minus the muhat estimate.
Perhaps we should investigate the role of MAF and LD. I fix an LD matrix, mafs, OR (1.1 for both CVs) and NN (10000).
The SNP correlation matrix looks like this (recall we chose 100 SNPs at either end of the chromosome). Perhaps this has something to do with why our corrections seems to perform well.
I consider five scenarios with 2 CVs:
CVs in different blocks, both with low MAFs: maf1 = 0.1556467, maf2 = 0.1567046, LD = -0.0094
CVs in different blocks, both with high MAFs: maf1 = 0.6034118, maf2 = 0.6085692, LD = -0.01213
CVs in different blocks, one high and one low MAF: maf1 = 0.8501719 , maf2 = 0.13885216, LD = 0.00677
CVs in same block, high mafs and high LD: maf1 = 0.8501719, maf2 = 0.8508331, LD = 0.993243
CVs in same block, low mafs and low LD: maf1 = 0.1697964, maf2 = 0.1699286, LD = -0.2027
I run the script 100 times for each scenario and average the claimed and corrected coverage values.
We see that the correction factor works well when the 2 CVs come from different LD structures (the two squares in the correlation plot above). To investigate the role of the LD structure on our correction factor, I run two more simulations:
Re-run T1D with full LD data (so not cutting out SNPs) and position info for missing SNPs.
Think about attaching some uncertainty - even just comparing how accurate corrected is compared to claimed? Could bootstrap (resample with replacement and re-do everything).
In the original credset
function, we find the number of variants needed s.t. their posterior probabilities exceed the threshold. Hence, it fails for thr=1
. However, in the credsetC
function (utilised in the corrected_cs
function to find the required threshold), thr=1
is allowed. Why?
List credsetmat(NumericMatrix pp, NumericVector iCV, double threshold) {
int n=pp.nrow();
int m=pp.ncol();
NumericVector nvar(n); // number of variants in set
NumericVector setsize(n); // size
NumericVector covered(n); // start: iCV not in empty set
// std::vector<int> V(n);
for(int rep=0; rep<n; rep++) {
int iiCV=iCV(rep)-1; // 0-based
double csum=0.0;
int i=0;
NumericVector ppv = pp( rep, _);
IntegerVector V=seq(0, m-1);
// std::iota(V.begin(),V.end(),i++); //Initializing
std::sort( V.begin(),V.end(), [&](int i,int j){
return ppv(i)>ppv(j);
}
);
// return V;
for(i=0; i<m; i++) {
csum+= ppv(V(i));
if(iiCV==V(i))
covered(rep)=1;
if (csum > threshold) {
nvar(rep)=i+1;
setsize(rep)=csum;
break;
}
}
}
return List::create(
_["nvar"] = nvar,
_["claimed"] = setsize,
_["covered"] = covered
);
}
Why is W commonly a fixed value, rather than a vector of values for each SNP?
Should I make just one key function in the package which has optional arguments? E.g. they can specify Z-scores and MAFs or beta and SEs, rather than two seperate functions?
Does our estimate of mu “estimate the effect size at the causal SNP”/“estimate the true genetic effect at the causal variant”?
Does it matter that in all my simulations CVs are randomly chosen, i.e. not with \(maf>0.1\) (representing SNPs)?
Look further into the corrected coverage estimate for 2 CVs.
Check that the method (1 CV) still works when we only consider the first 100 SNPs in our system (i.e. not two sets of SNPs from opposite sides of the chromosome).