1. Have identified problem (over/under-coverage)
Empirical and claimed coverage for each block of 100 simulations (nrep in simdata
) was calculated to produce new graphs.
Firstly, for N0=N1=5000 (800,000 simulations for ordered and unordered). The second graphs are for transforming the x axis to the sqrt scale but I can’t figure out how to get a correct \(y=x\) line?
- And for N0=N1=50000 (400,000 simulations for ordered and unordered). Why are these showing consistent undercoverage?
These plots look a bit weird.. The following are using the same data as that to make the bar charts in the sup meeting 8 files.
Including the threshold
Why did we have to include information on the threshold?
Doing the GAM method on all the data and then predicting on all the held out data gave an excellent prediction.
But when predicting on just a threshold set of held out data (i.e. all those cs formed with threshold 0.6), the prediction was terrible.
When learning the model using just a thresholded set and doing the predicitons it was great.
This shows that the shape of the calibration curves depends on the threshold in some way.
Slight issue is lack of efficiency. If it was that there was one calibration curve for all the data, then one simulation would give us lots of points, but now one simulation will only give us one point.
2. We can fix the problem if we know the CV and its true effect
- Simulations were ran implementing the GAM method for scenarioes where the CV and it’s true effect are known. The simulations are for different LD blocks and different true effect sizes.
Method:
Simulate 100 zstars from \(MVN(E(Z_m),\Sigma)\), where \(E(Z_m)\) is a vector of the true marginal effects.
Convert each of the simulated z vectors to posterior probabilities.
Form a credible set using the standard method.
Use the LOO GAM method to get a corrected coverage value (should this be the LOO method??)
f <- function(dd,thr) {
wh <- which(dd$x>thr)[1]
dd[wh,] # size of cred set
}
library(DescTools)
test1 <- function(d, thr){
a1 <- lapply(d,f,thr) %>% rbindlist() # a1 is dataframe of size and covered for cred sets obtained from each nrep of cali.func
invlogit <- function(x) exp(x)/(1+exp(x))
pred <- numeric(nrow(a1)) # empty vector to fill with predictions
for(i in 1:nrow(a1)) {
m1 <- gam(y ~ s(x), data=a1[-i,], family="binomial") # GAM model
pred[i] <- invlogit(predict(m1,newdata=a1[i,]))
}
error <- BinomCI(sum(a1$y==1), nrow(a1),
conf.level=0.95,
method = "jeffreys")
data.frame(Thr=thr,True.cov=mean(a1$y),Corr.cov=mean(pred), Claim.cov=mean(a1$x), upp=error[1,3], low=error[1,2])
}
# d is x and y co-ords for stepping along x axis until we hit the CV, when we continue stepping along y=1.
test1(d,0.6)
1. Generate data using datagen.R in /gam/CVknown/data/
2. Implement the GAM method using plot.R in /gam/CVknown/
3. Rbind the results and plot (code below).
OR
1. Use allin1.R rcode to go straight to results. Just submit as array job and will get lots of results files out that I can rbind to plot.
FALSE `geom_smooth()` using method = 'loess' and formula 'y ~ x'
FALSE `geom_smooth()` using method = 'loess' and formula 'y ~ x'
3. Can we still fix the problem if we don’t know the CV?
Method
Generate some reference data which will be kept constant. This includes LD, freq, maf, CV, snps.
From this generate a system which we want to correct. I.e. one z0 simulation and a credible set, with a claimed coverage (claim0).
(3. Generate some more simulations and credible sets and average the covered column to provide an estimate of the true coverage (this is done for 5000 simulations).)
Assume there are 100 SNPs in our system which we are considering. We want to obtain the marginal given the joint for each SNP being causal. The joint effect for SNP \(i\) being causal is a vector of 100 0’s except entry \(i\) which is the observed effect of that SNP. Use z0 for this.
We obtain 100 \(E(Z_m)\) vectors by multiplying each \(Z_J\) by \(\Sigma\).
From each of these we can simulate \(zstar^*\sim MVN(E(X_J),\Sigma)\). Each row of \(Z_m^*\) is a simulation whereby the columns represent the marginal effects of the SNPs. At this stage, we have 100 lists (one for each SNP being causal) of \(500*100\) matrices (rows=500 simulations, columns=100 SNPs). These zstars have been normalised.
For each simulation in each list, obtain the posterior probabiltiy system.
HOW?
- Convert z scores to Bayes factors where, \[BF=\frac{P(z|z\sim N(0,\omega'))}{P(z|z\sim N(0,1))}\]
- But what is \(\omega'\)? We know that \(\hat{\beta}\sim N(\beta,\hat{\sigma}^2)\) \[H_0: \beta=0\] \[H_1: \beta\sim N(0,\omega)\] So using \(\beta\sim N(0,\omega)\) and \(\hat{\beta}\sim N(\beta,\hat{\sigma}^2)\) we find, \[\hat{\beta}-\beta \sim N(0,\hat{\sigma}^2),\] \[\hat{\beta}-\beta+\beta \sim N(0,\hat{\sigma}^2+\omega).\]
I.e. \(var(\hat{\beta}|H_A)=\hat{\sigma}^2+\omega\).
Now, \(z=\frac{\hat{\beta}}{\hat{\sigma}}\) so, \[var(z|H_A)=\frac{\hat{\sigma}^2+\omega}{\hat{\sigma}^2},\] \[\omega'=\frac{\hat{\sigma}^2+\omega}{\hat{\sigma}^2}.\] So the BF is just the difference of two normal distributions with mean 0.
\[pp_i=\frac{BF_i}{\sum_{j=1}BF_j}\]
Nb, \(\hat{\sigma}\) is obtained using the variance function in coloc. It is a vector with length equal to nsnps.
For the chosen threshold, obtain the credible set using the normal method and report the size (claimed coverage) and whether the CV is present. At this stage, we have 100 lists of \(500*2\) dataframes, the claimed.cov column is the size of the credible set obtained from the simulation and the covered column is whether the CV is in the credible set. There are 500 rows for each \(Z_M^*\) sample.
Use these claimed coverage values as predictors in the GAM model to predict true coverage. One gam model per consideration of each SNP being causal (so 100).
Use each of these to predict the corrected coverage of the original claimed coverage (claim0).
The final estimate of corrected coverage is the weighted sum of these with the true pp.
Practical method (all in gam folder)
(Note that for all this GAM stuff, nsnps is fixed at 100).
Use the data0code.R file to generate lots of original data (submit as an array job and get dataaXX.RDS out). Each of these files contain the CV, iCV, maf, LD matrix, snps, freq (reference genotype matrix) and threshold.
Use the dataatoresults.R file to implement the method on these dataa files (submit as an array job to take in each of the dataXX.RDS files). The output is resultsXX.RDS files saved in results directory. Each of these is a 1*3 dataframe with the true coverage, the claimed coverage and the corrected coverage.
Use the resultsXXtoresults.R file to get summary of results from these (rbind each resultsXX file).
OR
- Use allin1.R rcode to go straight to resultsXX.RDS files.
FALSE `geom_smooth()` using method = 'loess' and formula 'y ~ x'
FALSE `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Discussion
If we know the CV, then we can simulate approximately equivalent realisations and get a very accurate corrected coverage. We need to find a way to simulate more accurate realisations when the CV is not known.
We take the observed z at each SNP as the best estimate of the true z if that SNP were causal. We then simulate zstar about this.
Using all the simulated z scores generated from \(MVN(E(Z_m),\Sigma)\) for each SNP being causal presents problems. If that SNP which is being considered as causal has low posterior probability, then it has a z score close to 0. This means that when we simulate around this, we get vectors of z all distributed about 0. When we convert these to posterior probabilities, the system produced will be very flat and not really reflect that posterior probability system we are trying to correct at all.
Low z score –> Simulate zstars about 0 –> Give very flat posterior probability systems –> Larger credible sets.
This means that if we use too small an effect size, the simulations are not equivalent. Instead, we tried scaling by maximum effect size zstar <- zstar * max(abs(zsim))/max(abs(zstar))
, but found these to not be equivalent either.
We therefore need to find a way to simulate better z or just select appropriate z from those that we simulate.
Filter the z that we simulate
We only want to keep the z simulations that are representative of the true z. Could filter out those which are not representative using Euclidean or Mahalanobis distance from the ‘true z’/ the mean of the simulations/the z scores from the system we are correcting?
The Euclidean distance is defined as, \[D_E=\sqrt{\sum_{i=1}^n(q_i-p_i)^2}.\]
It is basically the distance between two points, but is skewed when the points are measured in different units.
The Mahalanobis distance is defined as, \[D_M(\mathbf{x})=\sqrt((\mathbf{x}-\mathbf{u})^T\Sigma^{-1}(\mathbf{x}-\mathbf{u})).\]
Where the \(\mathbf{x}\) are the observations and the \(\mathbf{u}\) are the means.
This method will therefore transform z scores into a zero mean vector and then quantify how representative our simulations are of the ‘true z’/the mean(?). We can then filter out those which are ‘too far’ from the true z.
Mahalanobis distance seems more appropriate as it takes into account the correlations between variables, although it does require inverting the covariance matrix (LD) which is computationally difficult.
Questions and Ideas
Correct y=x line after transforming x axis in first simulations? Are these correct?
Can we filter the z values based on the similarity of the entropy of the true system and the entropy of the posterior probability system they reflect.
For the ‘we know the CV method’, should I be using a LOO GAM method or the method used for when we don’t know the CV of predicting from the claimed coverage of the credible set we are trying to correct? At the moment, the empirical coverage is the mean of the covered column, the corrected coverage is the single value obtained when using the LOO GAM method and the claimed coverage is the mean of the size of the credible sets. Also not sure my error bars are correct?
---
title: "Meeting 9"
output: html_notebook
---

---

### 1. Have identified problem (over/under-coverage)

---

* Empirical and claimed coverage for each block of 100 simulations (nrep in ```simdata```) was calculated to produce new graphs. 

* Firstly, for N0=N1=5000 (800,000 simulations for ordered and unordered). The second graphs are for transforming the x axis to the sqrt scale but I can't figure out how to get a correct $y=x$ line?

<p float="left">
  <img src="/Users/anna/Google Drive/PhD/barcharts/5000ver2.png" width="1000" />
</p>

<p float="left">
  <img src="/Users/anna/Google Drive/PhD/barcharts/5000transformed.jpeg" width="800" />
</p>

---

* And for N0=N1=50000 (400,000 simulations for ordered and unordered). Why are these showing consistent undercoverage?

<p float="left">
  <img src="/Users/anna/Google Drive/PhD/barcharts/50k.jpeg" width="800" />
</p>

These plots look a bit weird.. The following are using the same data as that to make the bar charts in the sup meeting 8 files. 
<p float="left">
  <img src="/Users/anna/Google Drive/PhD/barcharts/samedata.png" width="1000" />
</p>


---

#### Including the threshold

---

* Why did we have to include information on the threshold?

* Doing the GAM method on all the data and then predicting on all the held out data gave an excellent prediction.

* But when predicting on just a threshold set of held out data (i.e. all those cs formed with threshold 0.6), the prediction was terrible. 

* When learning the model using just a thresholded set and doing the predicitons it was great. 

* This shows that the shape of the calibration curves depends on the threshold in some way. 

* Slight issue is lack of efficiency. If it was that there was one calibration curve for all the data, then one simulation would give us lots of points, but now one simulation will only give us one point.

---

### 2. We can fix the problem if we know the CV and its true effect

---

* Simulations were ran implementing the GAM method for scenarioes where the CV and it's true effect are known. The simulations are for different LD blocks and different true effect sizes. 

Method:

1. Simulate 100 zstars from $MVN(E(Z_m),\Sigma)$, where $E(Z_m)$ is a vector of the true marginal effects. 

2. Convert each of the simulated z vectors to posterior probabilities.

3. Form a credible set using the standard method. 

4. Use the LOO GAM method to get a corrected coverage value (should this be the LOO method??)

```{r eval=FALSE}
f <- function(dd,thr) {
  wh <- which(dd$x>thr)[1]
  dd[wh,] # size of cred set
}


library(DescTools)

test1 <- function(d, thr){
  a1 <- lapply(d,f,thr)  %>% rbindlist() # a1 is dataframe of size and covered for cred sets obtained from each nrep of cali.func
  
  invlogit <- function(x) exp(x)/(1+exp(x))
  pred <- numeric(nrow(a1)) # empty vector to fill with predictions
  for(i in 1:nrow(a1)) {
    m1 <- gam(y ~ s(x), data=a1[-i,], family="binomial") # GAM model
    pred[i] <- invlogit(predict(m1,newdata=a1[i,]))
  }
  error <- BinomCI(sum(a1$y==1), nrow(a1), 
                   conf.level=0.95,
                   method = "jeffreys")
  data.frame(Thr=thr,True.cov=mean(a1$y),Corr.cov=mean(pred), Claim.cov=mean(a1$x), upp=error[1,3], low=error[1,2])
}

# d is x and y co-ords for stepping along x axis until we hit the CV, when we continue stepping along y=1.
test1(d,0.6)

```

```
1. Generate data using datagen.R in /gam/CVknown/data/
2. Implement the GAM method using plot.R in /gam/CVknown/
3. Rbind the results and plot (code below).

OR 
1. Use allin1.R rcode to go straight to results. Just submit as array job and will get lots of results files out that I can rbind to plot.
```

```{r echo=FALSE, warning=FALSE, comment=FALSE, error=FALSE}
library(gridExtra)
library(ggplot2)
df <- readRDS("/Users/anna/CVknowndata.RDS")

plot <- ggplot(data=df,aes(x=Corr.cov,y=True.cov))+geom_point()+geom_errorbar(aes(ymin=low, ymax=upp), size=0.2,alpha=0.5, width=.005)+geom_abline(a=0,b=1)
plot <- plot+geom_smooth()+ylab("Empirical Coverage")+xlab("Corrected Coverage")+ggtitle("Empirical and Corrected Coverage for \n GAM Method with Known CV")+ylim(0.55,1)+xlim(0.55,1)

plot1 <- ggplot(data=df, aes(x=Claim.cov,y=True.cov))+geom_point()+geom_errorbar(aes(ymin=low, ymax=upp), size=0.2,alpha=0.5, width=.005)+geom_abline(a=0,b=1)
plot1 <- plot1+geom_smooth()+ylab("Empirical Coverage")+xlab("Claimed Coverage")+ggtitle("Empirical and Claimed Coverage for \n GAM Method with Known CV")+ylim(0.6,1)+xlim(0.6,1)

library(cowplot)
plot_grid(plot+ theme(plot.title = element_text(size=12)), plot1+ theme(plot.title = element_text(size=12)), labels = "AUTO",align = 'v', scale=0.8)

###
#df1 <- data.frame("True.cov"=c(rep(df$True.cov,2)),"Coverage"=c(df$Corr.cov,df$Claim.cov),"upp"=df$upp,"low"=df$low,"type"=c(rep("Corrected",2*nrow(df)),rep("Claimed",2*nrow(df))))

#plot2 <-ggplot(data=df1, aes(x=True.cov,y=Coverage,colour=type))+geom_point()
#plot2+geom_smooth()
```

---

### 3. Can we still fix the problem if we don't know the CV?

---

### Method

---

1. Generate some reference data which will be kept constant. This includes LD, freq, maf, CV, snps.

2. From this generate a system which we want to correct. I.e. one z0 simulation and a credible set, with a claimed coverage (claim0).

(3. Generate some more simulations and credible sets and average the covered column to provide an estimate of the true coverage (this is done for 5000 simulations).)

4. Assume there are 100 SNPs in our system which we are considering. We want to obtain the marginal given the joint for each SNP being causal. The joint effect for SNP $i$ being causal is a vector of 100 0's except entry $i$ which is the observed effect of that SNP. Use z0 for this. 

5. We obtain 100 $E(Z_m)$ vectors by multiplying each $Z_J$ by $\Sigma$. 

6. From each of these we can simulate $zstar^*\sim MVN(E(X_J),\Sigma)$. Each row of $Z_m^*$ is a simulation whereby the columns represent the marginal effects of the SNPs. At this stage, we have 100 lists (one for each SNP being causal) of $500*100$ matrices (rows=500 simulations, columns=100 SNPs). These zstars have been normalised.

7. For each simulation in each list, obtain the posterior probabiltiy system.

---

HOW?

1. Convert z scores to Bayes factors where, 
$$BF=\frac{P(z|z\sim N(0,\omega'))}{P(z|z\sim N(0,1))}$$

* But what is $\omega'$?
We know that $\hat{\beta}\sim N(\beta,\hat{\sigma}^2)$
$$H_0: \beta=0$$
$$H_1: \beta\sim N(0,\omega)$$
So using $\beta\sim N(0,\omega)$ and $\hat{\beta}\sim N(\beta,\hat{\sigma}^2)$ we find,
$$\hat{\beta}-\beta \sim N(0,\hat{\sigma}^2),$$
$$\hat{\beta}-\beta+\beta \sim N(0,\hat{\sigma}^2+\omega).$$

I.e. $var(\hat{\beta}|H_A)=\hat{\sigma}^2+\omega$.

Now, $z=\frac{\hat{\beta}}{\hat{\sigma}}$ so,
$$var(z|H_A)=\frac{\hat{\sigma}^2+\omega}{\hat{\sigma}^2},$$
$$\omega'=\frac{\hat{\sigma}^2+\omega}{\hat{\sigma}^2}.$$
So the BF is just the difference of two normal distributions with mean 0. 

$$pp_i=\frac{BF_i}{\sum_{j=1}BF_j}$$

Nb, $\hat{\sigma}$ is obtained using the variance function in coloc. It is a vector with length equal to nsnps.

---

8. For the chosen threshold, obtain the credible set using the normal method and report the size (claimed coverage) and whether the CV is present. At this stage, we have 100 lists of $500*2$ dataframes, the claimed.cov column is the size of the credible set obtained from the simulation and the covered column is whether the CV is in the credible set. There are 500 rows for each $Z_M^*$ sample.

9. Use these claimed coverage values as predictors in the GAM model to predict true coverage. One gam model per consideration of each SNP being causal (so 100).

10. Use each of these to predict the corrected coverage of the original claimed coverage (claim0).

11. The final estimate of corrected coverage is the weighted sum of these with the true pp.

----

### Practical method (all in gam folder)

(Note that for all this GAM stuff, nsnps is fixed at 100).

1. Use the data0code.R file to generate lots of original data (submit as an array job and get dataaXX.RDS out). Each of these files contain the CV, iCV, maf, LD matrix, snps, freq (reference genotype matrix) and threshold.

2. Use the dataatoresults.R file to implement the method on these dataa files (submit as an array job to take in each of the dataXX.RDS files). The output is resultsXX.RDS files saved in results directory. Each of these is a 1*3 dataframe with the true coverage, the claimed coverage and the corrected coverage. 

3. Use the resultsXXtoresults.R file to get summary of results from these (rbind each resultsXX file). 

OR 

1. Use allin1.R rcode to go straight to resultsXX.RDS files.

```{r echo=FALSE}
results <- readRDS("/Users/anna/results.RDS")
results
```


```{r echo=FALSE, warning=FALSE, comment=FALSE}
plot <- ggplot(data=results,aes(x=Corrected.cov,y=True.cov))+geom_point()+geom_abline(a=0,b=1,)
plot <- plot+geom_smooth()+ylab("Empirical Coverage")+xlab("Corrected Coverage")+ggtitle("Empirical and Corrected Coverage for \n GAM Method with Unknown CV")+ylim(0.55,1)+xlim(0.65,1)

plot1 <- ggplot(data=results, aes(x=Claimed.cov,y=True.cov))+geom_point()+geom_abline(a=0,b=1,)
plot1 <- plot1+geom_smooth()+ylab("Empirical Coverage")+xlab("Claimed Coverage")+ggtitle("Empirical and Claimed Coverage for \n GAM Method with Unknown CV")+ylim(0.6,1)+xlim(0.6,1)

library(cowplot)
plot_grid(plot+ theme(plot.title = element_text(size=12)), plot1+ theme(plot.title = element_text(size=12)), labels = "AUTO",align = 'v',scale=0.8)
```

---

### Discussion

* If we know the CV, then we can simulate approximately equivalent realisations and get a very accurate corrected coverage. We need to find a way to simulate more accurate realisations when the CV is not known. 

* We take the observed z at each SNP as the best estimate of the true z if that SNP were causal. We then simulate zstar about this. 

* Using all the simulated z scores generated from $MVN(E(Z_m),\Sigma)$ for each SNP being causal presents problems. If that SNP which is being considered as causal has low posterior probability, then it has a z score close to 0. This means that when we simulate around this, we get vectors of z all distributed about 0. When we convert these to posterior probabilities, the system produced will be very flat and not really reflect that posterior probability system we are trying to correct at all. 

    **Low z score --> Simulate zstars about 0 --> Give very flat posterior probability systems --> Larger credible sets.**

* This means that if we use too small an effect size, the simulations are not equivalent. Instead, we tried scaling by maximum effect size ```zstar <- zstar * max(abs(zsim))/max(abs(zstar))```, but found these to not be equivalent either. 

* We therefore need to find a way to simulate better z or just select appropriate z from those that we simulate. 

---

### Filter the z that we simulate

* We only want to keep the z simulations that are representative of the true z. Could filter out those which are not representative using Euclidean or Mahalanobis distance from the 'true z'/ the mean of the simulations/the z scores from the system we are correcting?

* The Euclidean distance is defined as,
$$D_E=\sqrt{\sum_{i=1}^n(q_i-p_i)^2}.$$

* It is basically the distance between two points, but is skewed when the points are measured in different units. 

* The Mahalanobis distance is defined as, $$D_M(\mathbf{x})=\sqrt((\mathbf{x}-\mathbf{u})^T\Sigma^{-1}(\mathbf{x}-\mathbf{u})).$$

* Where the $\mathbf{x}$ are the observations and the $\mathbf{u}$ are the means.

* This method will therefore transform z scores into a zero mean vector and then quantify how representative our simulations are of the 'true z'/the mean(?). We can then filter out those which are 'too far' from the true z. 

* Mahalanobis distance seems more appropriate as it takes into account the correlations between variables, although it does require inverting the covariance matrix (LD) which is computationally difficult.

---

### Questions and Ideas

* Correct y=x line after transforming x axis in first simulations? Are these correct?

* Can we filter the z values based on the similarity of the entropy of the true system and the entropy of the posterior probability system they reflect.

* For the 'we know the CV method', should I be using a LOO GAM method or the method used for when we don't know the CV of predicting from the claimed coverage of the credible set we are trying to correct? At the moment, the empirical coverage is the mean of the covered column, the corrected coverage is the single value obtained when using the LOO GAM method and the claimed coverage is the mean of the size of the credible sets. Also not sure my error bars are correct?

---

