Contents

  1. Bugs in Code
  2. Error in estimating mu
  3. Final Plots
    • Relative Error Plots
    • 2 CV Plot
    • Using Reference Panel
  4. Difference in nvar
  5. T1D Results
  6. Microbenchmarking
  7. Posterior Probabilities of Null Model
  8. Comments and To Do

1. Bugs in Code


We found a couple of bugs in the code. I have corrected these and re-ran the simulations.

ppfunc <- function(z, V, W = 0.2) {
    stopifnot(class(z)=="numeric") 
    r = W^2/(W^2 + V)
    bf = 0.5 * (log(1 - r) + (r * z^2)) # log BFs
    denom = logsum(bf) # denominator (on log scale so don't sum to 100)
    exp(bf - denom)  # convert back from log scale
}

but it was incorrect for when z is a matrix. How?

  1. Our (log) BFs were wrong because we were multiplying r by z column wise rather than row wise.

  2. I was calculating the logsum of the whole matrix of BFs (obtaining one value) rather than for each row seperately to obtain one for each simulation (although this didn’t matter as we normalised the PPs afterwards).

.zj_pp = function(Zj, int.Sigma, int.nrep, int.ERR, int.r) {
  stopifnot(class(Zj)=="numeric")
  stopifnot(class(int.ERR)=="matrix")
  stopifnot(class(int.r)=="numeric")
  exp.zm = Zj %*% int.Sigma
  mexp.zm = matrix(exp.zm, int.nrep, length(Zj), byrow = TRUE)  # matrix of Zj replicated in each row
  zstar = mexp.zm + int.ERR # MATRIX of z scores
  bf = 0.5 * t(log(1 - int.r) + (int.r * t(zstar^2))) # row wise
  denom = apply(bf, 1, logsum) # get different denom for each rep
  exp(bf - denom)  # convert back from log scale
}

2. Error in estimating mu





It seems that E (smallest absolute error) or F (avoiding negative bias) are the best. Having investigated how these perform in our correction, we decide to use muhatE = sum(abs(z0) * pp0) as this never estimates the coverage to be too high (better to be conservative).


3. Final Plots


Relative Error Plots





How much better is our corrected coverage estimate than the claimed coverage at genome wide significance?

  • Zoom into the above plots at the particularly interesting regions (close to genome wide significance).


2 CV Plot



Using Reference Panel


What happens when we use the 1000 Genomes reference panel for our UK10K simulations? It seems that our correction is now slightly more likely to over estimate the coverage. However, note that up to 15 SNPs are removed in these simulations as they cannot be matched. Moreover, the SNPs removed are not random - they are more likely to be those with small MAFs etc.

Perhaps it would be worth only using the simulations where at most 5 SNPs were dropped from the analysis? I.e. data <- data[data$nsnps>195,] (note only 1594/4000 simulations are left).

  • Why is our corrected coverage estimate too high in high powered simulations using a reference panel?

4. Difference in nvar


It is useful to see how the number of variants in the credible set changes both pre- and post- correction. Below is a plot for the difference in nvar in the credible sets, with the median value labeled. Negative values show when variants have been removed from the credible set.

Fairly often variants are removed from the credible set (improving the resolution) but as we move to extremely high powered simulations, the credible sets contain a single variant and stay the same.



5. T1D Results



rs34536443: Not corrected because the 95% credible set contains only the index SNP with PP = 0.9985 (credible set cannot be made smaller).


rs72928038: Not corrected because the 95% credible set contains only the index SNP with PP = 0.9810 (credible set cannot be made smaller).

Note: The plots for all the regions are avaliable here.



6. Microbenchmarking


How long does the corrcov function take to run?


7. Posterior Probabilities of Null Model


For my own interest, I investigate the relationship between \(\mu\) and the posterior probability of the null model of no genetic effect.

Note that the PP for the null model becomes negligable for \(\mu>6\).


8. Comments and to do