7/12/18

Correcting Coverage

True.cov: Estimate of true coverage.
Claimed.cov: Size of the credible set we are trying to improve.
d.z: Average distance between simulated z and true z for each CV.
d.pp: Average distance between simulated pp and truepp for each CV.

Corrected.cov: Mean of predictions from GAM model using all simulations.
Corrected.cov.norm: As above, normalised by truepp.
Corr.cov.cv: Prediction from GAM model built from simulations with the CV as the true CV. Likely to be the best, not feasible in the real world.
cusum.z: Cusum method used to select which predictions from GAM model to use in final estimate, based on z distance. sum((final[o][1:w]*truepp[o][1:w]) / sum(truepp[o][1:w])), where o<-order(d.z) and w is those selected from z score cusum method.
cusum.z.norm: As above, normalised by truepp.
cusum.pp: Cusum method used to select which predictions from GAM model to use in final estimate, based on pp distance. sum((final[o1][1:w]*truepp[o1][1:w]) / sum(truepp[o1][1:w])), where o1<-order(d.pp) and w is those selected from pp cusum method.
cusum.pp.norm: As above, normalised by truepp.
cusum.KL: Cusum method used to select which predictions from GAM model to use in final estimate, based on KL scores. sum((final[o.KL][1:w]*truepp[o1][1:w]) / sum(truepp[o.KL][1:w])), where o.KL <- order(KLs) and w is those selected from pp cusum method.
cusum.KL.norm: As above, normalised by truepp.
KL.pred: KL-divergence found between true pp and mean pp from each CV simulation. Only the 25 CV simulations with the lowest KL are used for prediction.
KL.pred.norm: As above, normalised by truepp.

Ideas:

Would any of these methods improve if we considered 1000 SNP models. When I set up the big simulation, I’d like to change from nsnps=100 to nsnps=1000.
Could try all methods again using $z0$ for joint as opposed to $max(z0)$.
Each GAM model is being built using 5000 simulations, perhaps this could be upped to 10000. Often get the following error: ‘Iteration limit reached without full convergence - check carefully’.
Other ways to measure similarity between true pp and simulated pp? Could find KLs for each pp simulation and the true pp. Then in each CV simulation only select the simulations with lowest KL (could implement cusum to choose these). So that each GAM model is built using say 2000 simulations which most similarly reflect the true pp. Still end up with 100 GAM models for 100 possible CVs.

Areas that could change:

zj specification.
which simulations used to form gam models.
which predictions used from gam models.
normalisation of these predictions/corrected coverage estimates.

Choosing method:

Want an unbiased method that has minimal variance.
Run simulations for thr=0.9, 0.95, 0.99.
At each threshold, find the mean and sd of all the above measures. Report the summary of the number of variants in each credible set across the thresholds.

Queries

Would any of these methods be better if we considered 1000 SNP systems rather than 100 SNP systems? Would like to have ran this but my jobs arent running on the HPC (will take a long time ~ 1/2 hrs).
Get this error message sometimes when fitting the GAM:

Warning messages: 1: In newton(lsp = lsp, X = G$X, y = G$y, Eb = G$Eb, UrS = G$UrS, L = G$L, : Iteration limit reached without full convergence - check carefully

Possibly try all this again but without the $max(z0)$ thing, but this wasn’t encouring when we tried it before.

Code

library(mvtnorm)
library(magrittr)
library(data.table)
library(simGWAS) 
library(mgcv)
library(snpStats)
library(coloc)

##########################################################################
### Read in data and find variance (sigma squared hat)

data <- readRDS("/Users/anna/newdata/data2.RDS")
freq <-data$freq
maf <- data$maf
sigma <- data$LD
snps <- data$snps
CV <- data$CV
iCV <- data$iCV

# variance function from coloc
Var.data.cc <- function(f, N, s) {
  1 / (2 * N * f * (1 - f) * s * (1 - s))
}

# find sigma hat
var <- Var.data.cc(maf,N=10000,0.5) # N0=N1=5000
sd <- sqrt(var)

##########################################################################
### Find the z score and claimed coverage we want to correct (first row)
## Extra simulations ran to give an estimate of the true coverage

zsim.tmp <- simulated_z_score(N0=5000, # number of controls
                              N1=5000, # number of cases
                              snps=snps,
                              W=CV, # causal variants, subset of snps
                              gamma.W=log(1.1), # log odds ratios
                              freq=freq, # reference haplotypes
                              nrep=5000 # remainder will be used to get an estimate of true coverage
)

z0 <- zsim.tmp[1,] # treat first row at the true system we want to correct

# quick function: z scores --> pp
ppfunc <- function(z, V, omega=0.2^2) {
  r <- omega/(omega + V)
  bf = 0.5 * (log(1 - r) + (r * z^2))
  denom <- coloc:::logsum(bf)
  pp.tmp <- exp(bf - denom) # convert back from log scale
  pp.tmp / sum(pp.tmp)
}

truepp <- ppfunc(z0,var)

credset <- function(pp,CV=iCV, thr=0.6) {
  o <- order(pp,decreasing=TRUE) # order index for true pp
  cumpp <- cumsum(pp[o]) # cum sums of ordered pps
  wh <- which(cumpp > thr)[1] # how many needed to exceed thr
  size <- cumpp[wh] 
  contained=as.numeric(CV %in% o[1:wh])
  data.frame(claimed.cov=size, covered=contained)
}

cs <- credset(truepp,iCV)
cov <- cs$covered

# this is all the info experimentors would have and would want to correct

claim0 <- cs$claimed.cov # coverage we wish to correct
# z0 # z scores of the true system
# truepp # pps from the true system

##########################################################################
### Find an estimate of the true coverage so we can see how accurate we are

zsim.true <- zsim.tmp[-1,] # remove row of the 'true z'

pps <- t(apply(zsim.true, 1, ppfunc, V=var))

tmp <- apply(pps, 1, credset, iCV) %>% rbindlist()

true.cov.est <- mean(tmp$covered)

##########################################################################
### Method to find corrected coverage

# specify joint z scores for each snp causal
temp <- diag(x = max(z0), nrow = 100, ncol = 100)
zj <- do.call(c, apply(temp, 1, list))

Sigma <- sigma

avgdiff <- function(Zj,nrep=5000) {
  exp.zm=Zj %*% Sigma
  zstar=rmvnorm(nrep,exp.zm,Sigma)
  d <- colMeans((t(zstar) - z0)^2) # d is distance between our simulated z and the true z for each sim
  mean(d) # this is the mean diff for that scenario of SNP i being causal
}
d <- lapply(zj,avgdiff)  %>% unlist()

plot(truepp,d, main="Distance between z scores vrs true pp") 
abline(v=truepp[iCV])

###################

# quick function to find pps
bf.func <- function(Zj, V, thr=0.6, nrep=5000,omega=0.2^2,Sigma){
  exp.zm=Zj%*%Sigma # find E(X_m) (100 of these for each SNP being causal)
  zstar=rmvnorm(nrep,exp.zm,Sigma) # nrep is rows, nsnps is cols
  ## these two lines are quicker
  r <- omega/(omega + V)
  bf = 0.5 * (log(1 - r) + (r * zstar^2))
  denom <- coloc:::logsum(bf)
  pp.tmp <- exp(bf - denom) # convert back from log scale
  pp.tmp / rowSums(pp.tmp)
}

pps <- mapply(bf.func,zj,var, MoreArgs=list(Sigma=sigma),SIMPLIFY=FALSE)

# different CV as causal in each list
n_pps <- length(pps)
args <- 1:100

d5 <- lapply(1:n_pps, function(x){
  apply(pps[[x]],1,credset,args[x]) %>% rbindlist()
})

invlogit <- function(x) exp(x)/(1+exp(x))

##### get 100 GAM models, one for each SNP being causal. Use each of these to get a prediction of the corrected coverage

pred.func <- function(x){
  m <- gam(covered~s(claimed.cov), data=x, family="binomial")
  invlogit(predict(m,newdata=data.frame("claimed.cov"=claim0)))
}

# get vector of predicted coverage values
final <- lapply(d5,pred.func) %>% unlist()

o <- order(d)

# final[o] is vector of final predictions ordered by closeness of zstar to z0

cmean <- cumsum(final[o])/(1:length(final))
cvar <- cumsum((final[o]-cmean)^2)/((1:length(final)))
par(mfrow=c(2,2))
## plot(truepp,final)
plot(d,final) # d is ave distance from each set of sims from true z, final is final cov estiates
plot(d[o],cmean)
plot(d[o],cvar)
plot(d[o],sqrt(cvar)/cmean)

corr.cov <- mean(final)
corr.cov.norm <- sum(final*truepp) # use truepp, not pp
corr.cov.cv <- final[ iCV ]

################ how can we objectively choose which points to use?

library(bda)
y <- final[o]
x <- d[o]
ans <- cusum(y)
w <- min(ans$ubrk,ans$lbrk,na.rm=TRUE)

par(mfrow=c(2,2))
## plot(truepp,final)
plot(x,y,main="raw",col=ifelse(1:length(x) <= w, "red","black"))
plot(d[o],cmean,main="cmean",col=ifelse(1:length(x) <= w, "red","black"))
## plot(d[o],cvar,main="cvar")
plot(d[o],sqrt(cvar)/cmean,main="ccov",col=ifelse(1:length(x) <= w, "red","black"))

corr.cov.cusum <- sum((final[o][1:w]*truepp[o][1:w]) / sum(truepp[o][1:w])) # use truepp, not pp
corr.cov.cusum2 <- mean(final[o][1:w])

############### KL method to limit pps

# KL between true pp and simulated pp 

PP <- unname(truepp)
KL <- function(p) sum(p*log(p/PP))

KLs.tmp <- lapply(pps, function(x) apply(x,1,KL))

# keep the ones with smallest KL
KLs.lim <- lapply(KLs.tmp, function(x) which(rank(x)<2001))

for (ii in 1:length(pps)) {
  pps[[ii]] <- pps[[ii]][KLs.lim[[ii]], ]
}

# different CV as causal in each list
n_pps <- length(pps)
args <- 1:100

d5 <- lapply(1:n_pps, function(x){
  apply(pps[[x]],1,credset,args[x]) %>% rbindlist()
})

##### get 100 GAM models, one for each SNP being causal. Now only using 100 cs to build model

pred.func <- function(x){
  m <- gam(covered~s(claimed.cov), data=x, family="binomial")
  invlogit(predict(m,newdata=data.frame("claimed.cov"=claim0)))
}

final <- lapply(d5,pred.func) %>% unlist()

pp.lim <- mean(final)
pp.lim2 <- sum(final*truepp)

results <- data.frame("true.cov"=true.cov.est,"claimed.cov"=claim0,"corrected.cov"=corr.cov,"corrected.cov.norm"=corr.cov.norm,"corr.cov.cv"=corr.cov.cv,"corr.cov.cusum"=corr.cov.cusum,"corr.cov.cusum2"=corr.cov.cusum2,"pp.corr.cov"=pp.lim,"pp.corr.cov.norm"=pp.lim2)
saveRDS(results,"/Users/anna/results/results2.RDS")