20March

1. T1D results

Apply our method to the 44 regions in table 1.
For the new credible set method, accuracy was set to 0.0001 and maximum iterations to 50.
Only able to do 38/44 regions:
- (2nd) rs6691977: No position info
- (5th) rs4849135: No position info
- (10th) rs2611215: No position info
- (34th) rs1052553: No position info
- rs34536443: Not in ic_t1d_onengut_allresults.csv file?
- (19th) rs689: Eliminated after quality control. “We integrated preexisting rs689 data with Immunochip data for the 6,670 UK GRID (UK Genetic Resource Investigating Diabetes) cases and 6,304 British 1958 Birth Cohort controls and found rs689 to be the most strongly associated SNP.” not sure where cred set in paper comes from

Cautious about row 35 (SNP rs917911).
rs2476601 (second last row) has OR = 1.89 and \(p<10^{−100}\). Notice that the corrected coverage of the new credible set is only 0.94, this is because the top two SNPs have pps 0.8640309 (rs2476601) and 0.1359691 (rs6679677) which sum to 1. The remaining SNPs have pps on the order of \(10^{-90}\). The maximum corrected coverage you can get with the credible set including just these two variants is ~95% and our method can’t add any more variants to increase this coverage because of R’s precision and how long the algorithm would take to get to these variants. E.g. got to "thr: 0.999999999883585 , cov: 0.926728061790903" after 50 iterations.
Their credible sets are different to mine because they combine SIBs and trio data. Interesting that mine are almost always smaller. For this reason, in my paper I will compare my credible sets obtained using the standard Bayesian method and those obtained using the new required threshold method.
Investigating the median size of the credible set (as they do in the paper), it seems that applying my correction actually increases the size of the credible set - suggesting that there are more instances when the standard Bayesian approach has undercoverage (doesn’t actually reach 0.99) - but this might just be because there’s ‘more room’ below 99% than above it. Also, if we increase the number of variants in the set then we often increase it by quite a lot relative to how many we remove (because can be further below 99% than you can be above it).

This means that in most cases, the 99% credible set obtained using the standard Bayesian approach for fine-mapping does not actually reach 99% coverage of the causal variant. E.g. the 99% credible set for the first SNP, rs10277986, actually only has ~97% coverage of the CV. I have used the R package to find the smallest credible set that does exceed 99% coverage. This credible set has an additional 4 variants in it, increasing the set size from 11 to 15 variants.

2. More Simulation Results

I have simulated many credible sets with varying sample size, OR and threshold values. I correct each credible set using the R package. For each threshold:
- The first plots show the true coverage of the original and the new credible sets. These show that our new credible sets have the desired true coverage but the original ones have over-coverage.
- The second plots show the relative error of the claimed and corrected coverage estimates of the new credible set. These show that claimed coverage is inaccurate for low mu and variable for high mu, whereas the corrected coverage is more accurate and less variable always.
Have added notch=TRUE which provides roughly a 95% confidence interval about the median (\(median +/- 1.58*IQR/sqrt(n)\)). If the notches between groups differ then the medians are significantly different.
“The upper whisker extends from the hinge to the largest value no further than \(1.5 \times IQR\) from the hinge. The lower whisker extends from the hinge to the smallest value at most \(1.5 \times IQR\) of the hinge.”

a) 80% Credible Sets (8500 sims)

b) 95% Credible Sets (7000 sims)

c) 99% Credible Sets (7500 sims)

These all look promising. Seems that our correction method performs well at ‘pulling down’ the true coverage of the credible set, to what we desire (therefore adding uncertaincy - removing variants - why is our median nvar higher in the T1D data then?).
In particular, our method decreases the variability between the relative error of estimates.

3. Decreasing variability check

To convince myself that there is no further way to decrease the variability within my method, I compare simulations using 1000 replications and simulations using 10K replications.
There is almost no difference in results and the variability does not decrease with more replicates –> stick with 1000 replications (~ 5 secs).

4. Two Causal Variants

Does the method help at all if the assumption fails? To investigate this, I simulate systems with 2 CVs and find the coverage estimates for at least 1 CV in the credible set. As usual, the claimed coverage is the size of the credible set but the corrected coverage is now the proportion of simulations for which at least 1 CV is in the set.
Method:
1. Simulate system with 2 randomly chosen CVs with each odds ratio sampled from 1.05, 1.1 and 1.15.
2. muhat is calculated in the normal way, using the est_mu(z0, maf, NN, NN) function. Note - this provides 1 muhat estimate, should there be one for each CV? The estimate is usually closer to that of the CV with the largest \(Z\)-score.
3. Use this to simulate many more credible sets, considering each SNP as causal. Use the CV2.credset function which outputs list(credset = o[1:wh], claimed.cov = size, covered = any(CV %in% o[1:wh]), nvar = wh, each.contained = CV %in% o[1:wh]).
4. Average the covered values for each SNP considered causal (the empirical probability of at least 1 of the CVs in the credible set) and normalise with that SNP’s posterior probability.
The corrected coverage estimate involves assuming each CV is causal (NOT each combination of 2 CVs causal - wayyyy too many to try - for 200 snps this would mean 19,900 combinations to try!) - so not sure how effective this will be.

It is promising that our method still seems to work when the 1 CV assumption is dropped, because in the real world this assumption does not often hold - so our method still has some credibility.
However, it seems that there are more cases where the claimed coverage (and our correction) are higher than the true coverage (relative error > 0), leading to non-conservative estimates which we were trying to avoid.

5. Why Does it Work?

The plot below shows that the est_mu function is estimating the value of \(\mu\) at the CV with the biggest effect (highest true mu) –> muhat estimates the largest true mu.
The relative error is the true mu at either the CV with the maximum true mu (CV_max_mu) or the CV with the minimum true mu (CV_min_mu) minus the muhat estimate.

Perhaps we should investigate the role of MAF and LD. I fix an LD matrix, mafs, OR (1.1 for both CVs) and NN (10000).
The SNP correlation matrix looks like this (recall we chose 100 SNPs at either end of the chromosome). Perhaps this has something to do with why our corrections seems to perform well.

I consider five scenarios with 2 CVs:

CVs in different blocks, both with low MAFs: maf1 = 0.1556467, maf2 = 0.1567046, LD = -0.0094
CVs in different blocks, both with high MAFs: maf1 = 0.6034118, maf2 = 0.6085692, LD = -0.01213
CVs in different blocks, one high and one low MAF: maf1 = 0.8501719 , maf2 = 0.13885216, LD = 0.00677
CVs in same block, high mafs and high LD: maf1 = 0.8501719, maf2 = 0.8508331, LD = 0.993243
CVs in same block, low mafs and low LD: maf1 = 0.1697964, maf2 = 0.1699286, LD = -0.2027

I run the script 100 times for each scenario and average the claimed and corrected coverage values.

The corrected coverage is accurate in all scenarios except for scenario 4, when the 2 CVs have high MAF and are in high LD.

We see that the correction factor works well when the 2 CVs come from different LD structures (the two squares in the correlation plot above). To investigate the role of the LD structure on our correction factor, I run two more simulations:

Limit the whole simulation to just the first LD square. I.e. only consider the first 100 SNPs (rather than the full 200).

It seems that when we consider a system like this, our corrected coverage estimate is often larger than the true coverage.

Consider the full 200 SNP system above, but limit the CVs to come from the same LD structure (square 1 in the correlation plot).

Notice that this doesn’t look too different to the corresponding relative error plot faceted by OR at the CVs above.

Next Steps

Re-run T1D with full LD data (so not cutting out SNPs) and position info for missing SNPs.
Think about attaching some uncertainty - even just comparing how accurate corrected is compared to claimed? Could bootstrap (resample with replacement and re-do everything).
In the original credset function, we find the number of variants needed s.t. their posterior probabilities exceed the threshold. Hence, it fails for thr=1. However, in the credsetC function (utilised in the corrected_cs function to find the required threshold), thr=1 is allowed. Why?

List credsetmat(NumericMatrix pp, NumericVector iCV, double threshold) {
  int n=pp.nrow();
  int m=pp.ncol();
  NumericVector nvar(n); // number of variants in set
  NumericVector setsize(n); // size
  NumericVector covered(n); // start: iCV not in empty set
  // std::vector<int> V(n);
  for(int rep=0; rep<n; rep++) {
    int iiCV=iCV(rep)-1; // 0-based
    double csum=0.0;
    int i=0;
    NumericVector ppv = pp( rep, _);
    IntegerVector V=seq(0, m-1);
    // std::iota(V.begin(),V.end(),i++); //Initializing
    std::sort( V.begin(),V.end(), [&](int i,int j){
      return ppv(i)>ppv(j);
    }
    );
    // return V;
    for(i=0; i<m; i++) {
      csum+= ppv(V(i));
      if(iiCV==V(i))
        covered(rep)=1;
      if (csum > threshold) {
        nvar(rep)=i+1;
        setsize(rep)=csum;
        break;
      }
    }
  }
  return List::create(
    _["nvar"] = nvar,
    _["claimed"] = setsize,
    _["covered"] = covered
  );
}

Why is W commonly a fixed value, rather than a vector of values for each SNP?
Should I make just one key function in the package which has optional arguments? E.g. they can specify Z-scores and MAFs or beta and SEs, rather than two seperate functions?
Does our estimate of mu “estimate the effect size at the causal SNP”/“estimate the true genetic effect at the causal variant”?
Does it matter that in all my simulations CVs are randomly chosen, i.e. not with \(maf>0.1\) (representing SNPs)?
Look further into the corrected coverage estimate for 2 CVs.
Check that the method (1 CV) still works when we only consider the first 100 SNPs in our system (i.e. not two sets of SNPs from opposite sides of the chromosome).