31st July

Bugs in Code
Error in estimating mu
Final Plots
- Relative Error Plots
- 2 CV Plot
- Using Reference Panel
Difference in nvar
T1D Results
Microbenchmarking
Posterior Probabilities of Null Model
Comments and To Do

1. Bugs in Code

We found a couple of bugs in the code. I have corrected these and re-ran the simulations.

In our correction, we find the PPs for many matrices of simulated Z-scores. Our code for calculating PPs from a vector of Z-scores is correct and shown below:

ppfunc <- function(z, V, W = 0.2) {
    stopifnot(class(z)=="numeric") 
    r = W^2/(W^2 + V)
    bf = 0.5 * (log(1 - r) + (r * z^2)) # log BFs
    denom = logsum(bf) # denominator (on log scale so don't sum to 100)
    exp(bf - denom)  # convert back from log scale
}

but it was incorrect for when z is a matrix. How?

Our (log) BFs were wrong because we were multiplying r by z column wise rather than row wise.
I was calculating the logsum of the whole matrix of BFs (obtaining one value) rather than for each row seperately to obtain one for each simulation (although this didn’t matter as we normalised the PPs afterwards).

The method of calculating BFs from the simulated Z-score matrices was BF <- 0.5 * (log(1 - r) + (r * zstar^2)) where zstar is a nsim*nsnp matrix of Z-scores and r is a vector of length nsnp of the shrinkage factors. So that r is being recycled into a matrix to do the multiplications, but it is being recycled column wise rather than row wise!!!!! This was identified because Chris tried to match my results by applying the ppfunc over each row of zstar and we were getting different results. To fix this, I have created a .zj_pp utility function that is used within all the corrected coverage functions (note the difference of calculating the BFs, which now recycles r row wise):

.zj_pp = function(Zj, int.Sigma, int.nrep, int.ERR, int.r) {
  stopifnot(class(Zj)=="numeric")
  stopifnot(class(int.ERR)=="matrix")
  stopifnot(class(int.r)=="numeric")
  exp.zm = Zj %*% int.Sigma
  mexp.zm = matrix(exp.zm, int.nrep, length(Zj), byrow = TRUE)  # matrix of Zj replicated in each row
  zstar = mexp.zm + int.ERR # MATRIX of z scores
  bf = 0.5 * t(log(1 - int.r) + (int.r * t(zstar^2))) # row wise
  denom = apply(bf, 1, logsum) # get different denom for each rep
  exp(bf - denom)  # convert back from log scale
}

2. Error in estimating mu

The code for estimating $\mu$ was incorrect and has now been ammended. We find that our standard 4 estimates of $\mu$ are no longer accurate, where:
- A = max(abs(z))
- B = sum(abs(z)*ppdash)
- C = (1-ppdash[h0])*max(abs(z))
- D = mean(B,C)

Let’s consider some alternative estimators:
- muhatA = max(abs(z))
- muhatE = sum(abs(z0) * pp0)
- muhatF = (muhatA+muhatE)/2
- muhatG = muhatE * pph0 + muhatA * (1-pph0)

It seems that E (smallest absolute error) or F (avoiding negative bias) are the best. Having investigated how these perform in our correction, we decide to use muhatE = sum(abs(z0) * pp0) as this never estimates the coverage to be too high (better to be conservative).

3. Final Plots

I have re-ran all the simulations using our updated corrected coverage R package (with the above bugs fixed).

Relative Error Plots

How much better is our corrected coverage estimate than the claimed coverage at genome wide significance?

Zoom into the above plots at the particularly interesting regions (close to genome wide significance).

2 CV Plot

Using Reference Panel

What happens when we use the 1000 Genomes reference panel for our UK10K simulations? It seems that our correction is now slightly more likely to over estimate the coverage. However, note that up to 15 SNPs are removed in these simulations as they cannot be matched. Moreover, the SNPs removed are not random - they are more likely to be those with small MAFs etc.

Perhaps it would be worth only using the simulations where at most 5 SNPs were dropped from the analysis? I.e. data <- data[data$nsnps>195,] (note only 1594/4000 simulations are left).

Why is our corrected coverage estimate too high in high powered simulations using a reference panel?

4. Difference in nvar

It is useful to see how the number of variants in the credible set changes both pre- and post- correction. Below is a plot for the difference in nvar in the credible sets, with the median value labeled. Negative values show when variants have been removed from the credible set.

Fairly often variants are removed from the credible set (improving the resolution) but as we move to extremely high powered simulations, the credible sets contain a single variant and stay the same.

5. T1D Results

Black - corrected credible set with -* fewer variants
Blue - corrected credible set stayed the same
Grey - credible set not corrected (either because it’s coverage was accurate or because a smaller credible set could not be found - see below)

rs34536443: Not corrected because the 95% credible set contains only the index SNP with PP = 0.9985 (credible set cannot be made smaller).

rs72928038: Not corrected because the 95% credible set contains only the index SNP with PP = 0.9810 (credible set cannot be made smaller).

Note: The plots for all the regions are avaliable here.

6. Microbenchmarking

How long does the corrcov function take to run?

7. Posterior Probabilities of Null Model

For my own interest, I investigate the relationship between $\mu$ and the posterior probability of the null model of no genetic effect.

Note that the PP for the null model becomes negligable for $\mu>6$.

8. Comments and to do

Simulations to plot:
- Final T1D results (sims running in PLOS/T1D/)
To discuss with Chris: Paper and figure plan.