Our estimate for \(\mu\) is unbiased but has a fairly large variance.
I have preliminary results showing that our method of averaging results over all SNPs works when the true effect at the CV (\(\mu\)) is known and used in \(Z_j\).
I check these results using UK10K data and low, medium and high LD regions using the 1000 genomes data.
These results are in the 24july/knownmu/
directory.
Number of simulations making up each bin:
##
## (0,1] (1,2] (2,2.5] (2.5,3] (3,4] (4,4.75] (4.75,6] (6,8]
## 262 453 264 310 665 436 665 666
## (8,13] (13,16] (16,Inf]
## 680 254 270
Now for the high/medium/low LD regions using 1000 genomes data:
##
## (0,1] (1,2] (2,2.5] (2.5,3] (3,4] (4,4.75] (4.75,6] (6,8] (8,13]
## low 35 299 126 92 215 168 200 158 223
## high 23 265 171 68 205 153 187 176 259
## medium 28 257 181 94 189 189 189 169 250
##
## (13,16] (16,Inf]
## low 44 94
## high 33 104
## medium 32 115
Findings: Our correction METHOD does not work very well on medium and high LD regions using 1000 Genomes data. It is the method that is dodgey as even when using the true \(\mu\) value (ze[iCV]
- expected Z score at CV) in our correction method, the variability is large. I.e. the problem is not about the noisy estimate.
“Is the bias down more obvious when the true CV is in high LD with other SNPs or lower?”
To investigate this I look at the error of the corrected coverage estimate (corrcov - true.cov) against some measure of whether the CV is in high/low LD with other SNPs. The proxies I use for this measure are:
These look unbiased, instead let’s look only at the cases where our correction is biased in high LD but not low LD (which by eye is when z0[iCV]>4
).
These results show that our corrected coverage estimate is too low when the CV is highly correlated with other SNPs in the region. –> slight correlation between the CV being correlated to lots of SNPs in the region and the performance of the correction, but not too much.
These plots show that there is a correlation between the error of our estimate and whether the CV is in a high LD sub-region (i.e. doesn’t just depend on the background LD of the region). Allows us to seperate out region effects from CV effects.
Problem: Our correction method is not as accurate in regions of high LD.
Ideas:
As LD increases, as does the difficulty in seperating out signals.
Higher values of \(\mu\) help to seperate out the true effect, as the seperability between the CV and the other SNPs is increased.
Factors affecting seperability: LD, power (sample size and MAFs).
## mu_obs muhat true claimed corrected
## 1 18.66619 18.65872 0.9812 0.9574063 0.9140482
Let’s now consider the effect of LD and effect size on the seperability (i.e. how well the true signal can be picked out).
Possible measures of LD:
maxr2
)meanr2
)medianr2
)r20.2
or r20.1
etc.)Possible measures of seperability:
nvar_cs
)PP_iCV
)To visualise this, I plot \(\mu\) against the LD for many simulations, with contours added reflecting the seperability.
Below shows that the PP at the CV is much higher in low LD than in high LD for the same effect. It also does look not unlinear with \(\mu\)…
I then take vertical slices of the plots on the right hand side (for low, medium and high LD) to investigate the relationship between the value of \(\mu\) and the seperability of the signals. We are looking to see if there is a way we can model this relationship.
It is perhaps better if we look at the logit of the seperability.
For this I focus on the “max \(r^2\) with the CV” LD measure.
Method:
logit(PP_iCV)
column to original data (removing any Inf rows)loess(logit_PP_iCV ~ maxr2 * true.mu, data = data)
maxr2
and true.mu
)0.3<maxr2<0.35
)This looks promising as the relationship now looks almost linear (note differing axis values - not \(y=x\) line).
Think that this shows that the penalty for an estimate being too high is now the same for an estimate being too low…
In high LD it is harder to seperate out the signals, and a larger effect size at the CV may help with this. This is why in our proportion covered plots, lots of these proportions are lower than the threshold (the PP doesn’t get allocated to the CV correctly - may be spread amongst other variants in high LD).
To investigate this, let’s plot the proportion of simulated credible sets (where the CV is assumed causal) that contain the CV (propcov_cv
) against our 3 different measures of LD discussed above.
Indeed, my results show that there is a negative relationship between the amount of LD and the empirical coverage proportions where the true CV is assumed causal –> thus our corrected coverage estimate is consistently too low in high LD regions.
Explanation: In our correction, we are averaging something that has an upper bound of 1 (the empirical coverage proportions), where lots of the data points lie close to the upper bound. This means that it is “easier” for the empirical coverage proportions to be too low than too high, and lower estimates are more detrimental than higher estimates. This may explain why our corrected coverage estimate is consistently too low in high powered scenarios.
Potential fix would be to take the logit of the prop_cov quantities used in the final corrected coverage estimate weighting, so that they are no longer bounded.
Our final corrected coverage estimate is then given by:
logit <- function(x) log(x/(1-x))
invlogit <- function(x) exp(x)/(1 + exp(x))
corrcov_tmp <- sum(pp0 * logit(prop_cov))
corrcov <- invlogit(corrcov_tmp)
UK10K haplotypes (7562 of these) for Z scores and MAFs, 1000 Genomes haplotypes (1005 of these) for LD.
For each simulation, I calculate the corrected coverage using the “true UK10K” haplotypes and also using 1000 Genomes reference haplotypes.
Method:
simGWAS
to generate the system to be correctedThe number of simulations in each bin:
##
## (0,1] (1,2] (2,2.5] (2.5,3] (3,4] (4,4.75] (4.75,6] (6,8]
## 237 390 259 277 565 461 615 575
## (8,13] (13,16] (16,Inf]
## 665 238 258
Side note: Why does the normal corrected coverage estimate not look as good as normal? Perhaps because these simulations are smaller - or because we are trimming SNPs not in the 1000 Genomes data?
Correcting with known \(\mu\)
Results are approximately as accurate to using our estimate for \(\mu\).
Our corrected coverage estimate struggles in high LD due to the correction METHOD (not the estimate for \(\mu\)).
Investigating the poor performance of our correction in high LD
In higher powered scenarios where the correction is biased in high LD but not in low LD (z0[iCV]>4
) there is a slight correlation between error and whether the CV is in high LD with the other SNPs –> Method struggles more when the CV is in a high LD sub-region.
As the LD of the region increases, as does the difficulty in seperating out the signals. Higher power is needed to seperate out these signals. Taking vertical slices (seperability vrs true effect) of the contour plots showed that the relationship is not too far off logistic. –> Plotting logit(seperability) against true effect almost gives a linear relationship –> SO WHAT? Is this why we should try the logit(prop_cov)
correction method?
I have found that there is a negative relationship between the proportion of simulated credible sets that contain the CV (assuming the true CV is the CV) and the LD of the region –> supporting our results that the corrected coverage estimate is consistently too low in high LD regions and potentially honing in on where and why our correction struggles in high LD. Links to the above point that as the LD increases, as does the seperability of signals
We are taking the average of something bounded (the empirical coverage proportions) which may be problematic –> convert these proportions to the logit scale (unbounded) and remember to un-logit for the final correction –> results are basically the same as our standard correction.
Using reference panel for LD
Results are similar to using the “true LD”, but slightly worse in higher powered scenarios.
Will revisit this when we have the correction working for high LD/ 1000 Genomes data.
logit(prop_cov)
in correction).Surely an ‘MAF’ cannot be greater than 0.5 –> why in our simulations do we select a CV with 0.05<MAF<0.95
? We’ve been lazy with terminology, we’re actually looking at the allele frequencies.
\(\mu\) is the Z score at the CV, Paul said that this is not equivalent to the true effect. Specifically, the Z score is the true effect if the outcome is normalised and the predictor is normalised? He’s talking about the effect being \(\beta\), but we are using Z as a measure of effect (but it’s not the same as \(\beta\) obviously) - need to be careful with terminology in paper!
Paul and Jenn thought it was confusing to introduce PP’ (including the null) to derive \(\hat\mu\) but then go back to normal PPs for the correction - can we really not just use max(abs(Z)) for \(\mu\)?
Does our method work with linear effects? Paul said we should show this off.
Empirical PP calibration plot - too good to be true?