6thMarch

Obtain Required Threshold
Idea 4
Idea 4 Code
Fewer Replicates in Correction
Speeding Up corrcov Function
Next Steps

Obtain Required Threshold

Aim: Obtain a required threshold (\(x\)) from a desired coverage (\(y\)).

Idea 1: Obtain predicted threshold from each GAM (logit(covered)~logit(cumsum)), normalise each with pp and sum

Idea 2: Predict \(y\) for \(x \in (0,1)\) from each GAM, average these and use these to predict required threshold

Idea 3: As above but normalise each with pp

Plots above show that the methods do bring the coverage of the new credible sets closer to 0.8. Idea 2 can be ruled out as it results in too many credible sets with coverage below that desired.
Idea 3 only:

Idea 3 seems to work well at obtaining new credible sets with closer to the required coveraged of the true causal variant. Is there a way to not use GAMs at all? Idea 4 investigates this.

Idea 4

Apply correction factor repeatedly with different thresholds to find the required threshold to obtain the desired corrected coverage.
The slow step is simulating posterior probabilities. This does not need to be done everytime we call the function. Infact we now simulate pps FIRST and then input them (list of matrices) into the quick_corrcov function along with the threshold and the original posterior probabilities. This function uses C++ credsetmat function to quickly form credible sets and finds the proportion of these containing the CV, weighted by posterior probabilities. This is VERY quick (~4 secs for 1000 simulations for each snp considered causal = 2e+05 credible sets).

The quick_corrcov function takes approximately 4 seconds to report the corrected coverage estimate of the credible set, using a specified threshold.

## "if you use this threshold, the resultant credible set has this corrected coverage"

quick_corrcov <- function(thr, simulated.pps, pp){
  n_pps <- length(simulated.pps)
  args <- 1:nsnps
  
  d5 <- lapply(1:n_pps, function(x) {
    credset3(simulated.pps[[x]], CV = rep(args[x], dim(simulated.pps[[x]])[1]), thr = thr)
  })
  
  prop_cov <- lapply(d5, pred_na) %>% unlist()
  sum(prop_cov * pp)
}

This can be further developed to also provide the credible set obtained using this threshold.

The quick_corrcov_cs function takes approximately 4 seconds to report a list of the corrected coverage estimate, the new credible set, the threshold used and the size.

## "if you use this threshold, the resultant credible set is this and has this corrected coverage"

quick_corrcov_cs <- function(thr, simulated.pps, pp){
  n_pps <- length(simulated.pps)
  args <- 1:nsnps
  
  d5 <- lapply(1:n_pps, function(x) {
    credset3(simulated.pps[[x]], CV = rep(args[x], dim(simulated.pps[[x]])[1]), thr = thr)
  })
  
  prop_cov <- lapply(d5, pred_na) %>% unlist()
  corr_cov <- sum(prop_cov * pp)
  
  o <- order(pp, decreasing = TRUE)  
  cumpp <- cumsum(pp[o])  
  wh <- which(cumpp > thr)[1]  
  list(credset = names(pp)[o[1:wh]], corr.cov = corr_cov, thr = thr, size = cumpp[wh])
}

Idea 4 code

1. Vary threshold

BUT the aim of all of these ‘ideas’ is to provide the user with the threshold they should be using to obtain a credible set with the desired coverage. Of course, they could keep using quick_corrcov_cs above until the corrected coverage is where they want it to be, but we need to automate this.
Could use the following function to obtain a credible set with corrected coverage within some accuracy of the desired coverage. This uses C++ code to generate the credible sets from the simulated pps, do we really need to transfer the whole thing to C? Need to decide when to break the function - i.e. if it is impossible to get within this accuracy as there is one CV holding all the pp - include counter (this doesn’t seem to be stopping after 4 goes??)

new_cs <- function(desired.cov, acc = 0.05, simulated.pps, pp){
  
  corr.cov <- quick_corrcov(desired.cov, simulated.pps, pp)
  is_within_tolerance <- FALSE
  thr1 <- desired.cov
  counter <- 1
  
  while (counter < 4 | !is_within_tolerance) {
    if(dplyr::between(corr.cov, desired.cov, desired.cov+acc)){
      is_within_tolerance <- TRUE
    }
    
    # if corr.cov is too high
    if(corr.cov > desired.cov+acc){ # then run again with lower threshold (remove variants)
      thr1 <- thr1 - (desired.cov-corr.cov)
      corr.cov <- quick_corrcov(thr1, simulated.pps, pp)
      next
    }
    
    # if corr.cov is too low
    if(corr.cov < desired.cov){ # then run again with higher threshold (add variants)
      thr1 <- thr1 + (desired.cov-corr.cov)
      corr.cov <- quick_corrcov(thr1, simulated.pps, pp)
      next
    }
     counter <- counter + 1
  } 
  
  o <- order(pp, decreasing = TRUE)  # order index for true pp
  cumpp <- cumsum(pp[o])  # cum sums of ordered pps
  wh <- which(cumpp > thr1)[1]  # how many needed to exceed thr
  list(credset = o[1:wh], corr.cov = corr.cov, thr = thr1, counter=counter)
}

new_cs(0.8, 0.05, pps, pp0)

2. Systematically add/remove variants

A better (more robust) approach is the method below, which systematically removes/adds variants from the credible set until we have a new credible set with corrected coverage within some accuracy.
Odd scenarios I had to consider:

Only 1 variant in the set (so can’t get a smaller credible set)
Alternating endlessly between over desired.cov+accuracy and below desired.cov

desiredcov_cs <- function(desired.cov, acc = 0.05, simulated.pps, pp){
  n_pps <- length(simulated.pps)
  args <- 1:length(pp)
  thr <- desired.cov
  
  d5 <- lapply(1:n_pps, function(x) {
    credset3(simulated.pps[[x]], CV = rep(args[x], dim(simulated.pps[[x]])[1]), thr = thr)
  })
  
  prop_cov <- lapply(d5, pred_na) %>% unlist()
  corr_cov <- sum(prop_cov * pp)
  
  o <- order(pp, decreasing = TRUE)  
  cumpp <- cumsum(pp[o])  
  wh <- which(cumpp > thr)[1]  
  
  is_within_tolerance <- FALSE
  
  while (!is_within_tolerance) {
    if(dplyr::between(corr_cov, desired.cov, desired.cov+acc)){
      is_within_tolerance <- TRUE
    }
    
    if(wh==1){
      warning("Only 1 variant in the credible set, so can't make it any smaller")
      is_within_tolerance <- TRUE
    }
    
    # if corr.cov is too high
    if(!wh==1 & corr_cov > desired.cov+acc){ 
      wh <- wh-1 # remove a variant from the cs
      thr <- cumsum(pp[o][1:wh-1])[wh-1]+(cumsum(pp[o][1:wh])[wh]-cumsum(pp[o][1:wh-1])[wh-1])/2 # new thr required
      corr_cov <- quick_corrcov(thr, simulated.pps, pp)
      next
    }
    
    # if corr.cov is too low
    if(!wh==1 & corr_cov < desired.cov){ 
      wh <- wh+1 # add a variant to the cs
      thr <- cumsum(pp[o][1:wh-1])[wh-1]+(cumsum(pp[o][1:wh])[wh]-cumsum(pp[o][1:wh-1])[wh-1])/2 # new thr required
      corr_cov <- quick_corrcov(thr, simulated.pps, pp)
      
      if(corr_cov > desired.cov+acc){
        warning("This is the smallest credible set with corrected coverage above the desired coverage")
        is_within_tolerance <- TRUE
      }
      next
    }
  }
  list(credset = names(pp)[o[1:wh]], corr_cov = corr_cov, thr = thr, size = cumsum(pp[o])[wh], nvar = wh)
}

I’ve tried this with several datasets and it looks really promising - very quick and accurate.
Note that this function can be used to find the smallest credible set possible with corrected coverage greater than the desired coverage by setting acc to a very small value.

Can we use fewer replicates in the correction?

Last week I compared nrep=1000, 5000, 10K and the results were pretty much the same. Can we use even fewer reps?

It appears that there isn’t much difference.
I further investigate a region where there doesn’t seem to be a smooth gradient (e.g. OR = 1.1, thr = 0.99, N0=N1 = 5000) by running many more simulations from this region (this is in just_ave/check_nrep/ on hpc) and find that there is a smooth gradient and nrep=1000 is best.

I decide to stick with nrep=1000 as it doesn’t take long and seems the best.

Playing around

This function reports the cumsum of a vector of pps

cppFunction('List f2(NumericVector pp) {
                int n = pp.size();
                NumericVector out(n);
                
                List ret(1);
                
                out[0] = pp[0];
                for(int i = 1; i < n; ++i) {
                  out[i] = out[i - 1] + pp[i];
                }
                ret[0]=out;
                return(ret);
              }')

Speeding up `corrcov` function

I have worked through each line of the corrected coverage functions to find the quickest way to do each line. These are the new functions.

corrected_cov <- function(mu, V, Sigma, pp0, thresh, size, nrep = 1000) {
  nsnps <- length(pp0)

  # form joint z-score vectors
  temp <- diag(x = mu, nrow = nsnps, ncol = nsnps)
  zj <- lapply(seq_len(nrow(temp)), function(i) temp[i,]) # nsnp zj vectors for each snp considered causal

  # simulate pp systems
  pps <- mapply(zj_pp, zj, V, MoreArgs = list(nrep = nrep, Sigma = LD), SIMPLIFY = FALSE)

  # consider different CV as causal in each list
  n_pps <- length(pps)
  args <- 1:nsnps

  # obtain credible set for each simulation
  prop_cov <- lapply(1:n_pps, function(x) {
    tmp = credset3(pps[[x]], CV = rep(args[x], nrep), thr = thresh)
    sum(tmp[[2]])/length(tmp[[2]])
  }) %>% unlist()

  # final corrected coverage value
  sum(prop_cov * pp0)
}

corrcov <- function(z, f, N0, N1, Sigma, thr, W = 0.2) {
  ph0.tmp <- z0_pp(z, f, type = "cc", N = N0 + N1, s = N1/(N0 + N1), W = W)
  ph0 <- ph0.tmp[1]  # prob of the null
  pp0dash <- ph0.tmp[-1]  # pps of variants
  
  varbeta <- coloc:::Var.data.cc(f, N0 + N1, N1/(N0 + N1))  # variance of estimated effect size
  
  pp0 <- ppfunc(z, V = varbeta, W = W)  # posterior probs of system
  
  muhat <- est_mu(z, f, N0, N1)
  
  corrected_cov(mu = muhat, V = varbeta, Sigma = Sigma, pp0 = pp0, thresh = thr)
}

Obviously, if we simulate posterior probabilities outside of the function, then we can just use quick_corrcov.

Next Steps

Is there a quicker way to simulate posterior probabilities from zj? This would require transferring the following to C code..? rmvnorm already uses source C code by default so I’m not sure this would help too much (profiling shows that this is the slow stage). Takes ~13 secs for 1000 simulations for each 200 snps considered causal.

zj_pp <- function(Zj, V, nrep = 5000, W = 0.2, Sigma) {
    exp.zm = Zj %*% Sigma  # find E(Z_m)
    zstar = mvtnorm:::rmvnorm(nrep, exp.zm, Sigma)  # nrep is rows, nsnps is cols
    r <- W^2/(W^2 + V)
    bf = 0.5 * (log(1 - r) + (r * zstar^2))
    denom <- coloc:::logsum(bf)  # logsum(x) = max(x) + log(sum(exp(x - max(x)))) so sum is not inf
    pp.tmp <- exp(bf - denom)  # convert back from log scale
    pp.tmp/rowSums(pp.tmp)
}

pps <- mapply(zj_pp, zj, V, MoreArgs = list(nrep = nrep, Sigma = LD), SIMPLIFY = FALSE)

At the moment, corrcov uses the following parameters z, f, N0, N1, Sigma, thr, W = 0.2. In terms of replacing the \(Z\) scores and minor allele frequences with \(\hat\beta\) and \(V\), I don’t think this will work since maf are needed to find V (Var.data.cc) and ABFs (approx.bf.p) within the function body.
Can I get going with the real T1D data?
In terms of finding the required threshold, the desiredcov_cs function above is working very well. Need to consider other scenarioes where it may not work.

Questions

Any ideas where I can get my poster printed A0 - do I have to pay for this?
Tips for poster presentation (I am going to print off A4 versions to hand out if anyone is interested at the conference).
Where can I get the data from the T1D paper - how shall I get going with this?
Is my desiredcov_cs function good enough to put into my package. I’ll include a vignette on using this too.