24th July

Correction with Known \(\mu\)
Seperating Signals
Empirical Coverage Proportions
Possible Correction: Logit Proportions
Using Reference Panel for Correction
Summary of Findings
Next Steps and Questions

1. Correction with Known \(\mu\)

Our estimate for \(\mu\) is unbiased but has a fairly large variance.
I have preliminary results showing that our method of averaging results over all SNPs works when the true effect at the CV (\(\mu\)) is known and used in \(Z_j\).
I check these results using UK10K data and low, medium and high LD regions using the 1000 genomes data.
These results are in the 24july/knownmu/ directory.

UK10K Data

Number of simulations making up each bin:

## 
##    (0,1]    (1,2]  (2,2.5]  (2.5,3]    (3,4] (4,4.75] (4.75,6]    (6,8] 
##      262      453      264      310      665      436      665      666 
##   (8,13]  (13,16] (16,Inf] 
##      680      254      270

1000 Genomes Data

Now for the high/medium/low LD regions using 1000 genomes data:

The number of simulations making up each bin:

##         
##          (0,1] (1,2] (2,2.5] (2.5,3] (3,4] (4,4.75] (4.75,6] (6,8] (8,13]
##   low       35   299     126      92   215      168      200   158    223
##   high      23   265     171      68   205      153      187   176    259
##   medium    28   257     181      94   189      189      189   169    250
##         
##          (13,16] (16,Inf]
##   low         44       94
##   high        33      104
##   medium      32      115

Findings: Our correction METHOD does not work very well on medium and high LD regions using 1000 Genomes data. It is the method that is dodgey as even when using the true \(\mu\) value (ze[iCV] - expected Z score at CV) in our correction method, the variability is large. I.e. the problem is not about the noisy estimate.

Why is it bad in high LD?

Notice that our usual correction tends to be larger than that when using the true \(\mu\) value.

“Is the bias down more obvious when the true CV is in high LD with other SNPs or lower?”
To investigate this I look at the error of the corrected coverage estimate (corrcov - true.cov) against some measure of whether the CV is in high/low LD with other SNPs. The proxies I use for this measure are:

The mean \(r^2\) of the CV with all the other SNPs
The number of SNPs with an \(r^2\) value greater than 0.2 with the CV

As these values increase on the \(x\) axis, we expect to see some correlation with the error. For example, if the CV is in a very high LD region then the estimate is worse.

These look unbiased, instead let’s look only at the cases where our correction is biased in high LD but not low LD (which by eye is when z0[iCV]>4).
These results show that our corrected coverage estimate is too low when the CV is highly correlated with other SNPs in the region. –> slight correlation between the CV being correlated to lots of SNPs in the region and the performance of the correction, but not too much.
These plots show that there is a correlation between the error of our estimate and whether the CV is in a high LD sub-region (i.e. doesn’t just depend on the background LD of the region). Allows us to seperate out region effects from CV effects.

2. Seperating Signals

Problem: Our correction method is not as accurate in regions of high LD.

Ideas:

As LD increases, as does the difficulty in seperating out signals.
Higher values of \(\mu\) help to seperate out the true effect, as the seperability between the CV and the other SNPs is increased.
Factors affecting seperability: LD, power (sample size and MAFs).

For the region investigated here (high LD region with corrected coverage estimate too low), we can plot the proportion of simulated credible sets containing the assumed CV against the PP of the SNP considered causal. The true CV is coloured in red.

##     mu_obs    muhat   true   claimed corrected
## 1 18.66619 18.65872 0.9812 0.9574063 0.9140482

Effect of LD and effect size on seperability

Let’s now consider the effect of LD and effect size on the seperability (i.e. how well the true signal can be picked out).

Possible measures of LD:

Max \(r^2\) between the CV and any other SNP (maxr2)
Average \(r^2\) value of CV with all other SNPs (meanr2)
Median \(r^2\) value of CV with all other SNPs (medianr2)
Number of SNPs in LD with CV (i.e. with \(r^2>0.9\)) (r20.2 or r20.1 etc.)

Possible measures of seperability:

The number of variants in the credible set (nvar_cs)
The PP at the CV (will be larger if the CV has been well seperated) (PP_iCV)

To visualise this, I plot \(\mu\) against the LD for many simulations, with contours added reflecting the seperability.

Below shows that the PP at the CV is much higher in low LD than in high LD for the same effect. It also does look not unlinear with \(\mu\)…

I then take vertical slices of the plots on the right hand side (for low, medium and high LD) to investigate the relationship between the value of \(\mu\) and the seperability of the signals. We are looking to see if there is a way we can model this relationship.

It is perhaps better if we look at the logit of the seperability.
For this I focus on the “max \(r^2\) with the CV” LD measure.

Method:

Add logit(PP_iCV) column to original data (removing any Inf rows)
Fit a loess loess(logit_PP_iCV ~ maxr2 * true.mu, data = data)
Generate a sequence of predictor values (for maxr2 and true.mu)
Predict the corresponding values of logit_PP_iCV
Convert results to long data table format (melt function)
Select rows of results corresponding to LD bin (i.e. 0.3<maxr2<0.35)
Plot logit_PP_iCV against true.mu

This looks promising as the relationship now looks almost linear (note differing axis values - not \(y=x\) line).
Think that this shows that the penalty for an estimate being too high is now the same for an estimate being too low…

3. Empirical Coverage Proportions

In high LD it is harder to seperate out the signals, and a larger effect size at the CV may help with this. This is why in our proportion covered plots, lots of these proportions are lower than the threshold (the PP doesn’t get allocated to the CV correctly - may be spread amongst other variants in high LD).
To investigate this, let’s plot the proportion of simulated credible sets (where the CV is assumed causal) that contain the CV (propcov_cv) against our 3 different measures of LD discussed above.
Indeed, my results show that there is a negative relationship between the amount of LD and the empirical coverage proportions where the true CV is assumed causal –> thus our corrected coverage estimate is consistently too low in high LD regions.

4. Possible Correction: Logit Proportions

Explanation: In our correction, we are averaging something that has an upper bound of 1 (the empirical coverage proportions), where lots of the data points lie close to the upper bound. This means that it is “easier” for the empirical coverage proportions to be too low than too high, and lower estimates are more detrimental than higher estimates. This may explain why our corrected coverage estimate is consistently too low in high powered scenarios.

Potential fix would be to take the logit of the prop_cov quantities used in the final corrected coverage estimate weighting, so that they are no longer bounded.

Our final corrected coverage estimate is then given by:

logit <- function(x) log(x/(1-x))
invlogit <- function(x) exp(x)/(1 + exp(x))

corrcov_tmp <- sum(pp0 * logit(prop_cov))
corrcov <- invlogit(corrcov_tmp)

The results don’t really change.

5. Using Reference Panel

UK10K haplotypes (7562 of these) for Z scores and MAFs, 1000 Genomes haplotypes (1005 of these) for LD.
For each simulation, I calculate the corrected coverage using the “true UK10K” haplotypes and also using 1000 Genomes reference haplotypes.

Method:

Select 200 SNPs from UK10K data
Find the corresponding SNPs (by their genomic position) in the 1000 genomes data
Remove SNPs from the analysis if they cannot be found or if they map to multiple rows in the 1000 genomes data (removes ~2-5 SNPs on average)
Find the haplotypes for the remaining SNPs in the UK10K data and use simGWAS to generate the system to be corrected
Find the corrected coverage using the “true LD” from the UK10K haplotypes
Find the corrected coverage using the LD estimated from the 1000 Genomes reference panel

The plot above shows what we expect, that when the ‘true’ LD is calculated from the ‘true’ haplotypes, there are more instances where the relative error of the corrected coverage estimate is 0. The points have been coloured by the value of \(\mu\) in the simulation. Notice that the reference panel method struggles more for higher values of \(\mu\) relative to the normal method?

The number of simulations in each bin:

## 
##    (0,1]    (1,2]  (2,2.5]  (2.5,3]    (3,4] (4,4.75] (4.75,6]    (6,8] 
##      237      390      259      277      565      461      615      575 
##   (8,13]  (13,16] (16,Inf] 
##      665      238      258

There isn’t too much of a difference in the corrected coverage estimates using the true LD and using the estimated LD (except in high mu, which was also found in the previous plot).

Side note: Why does the normal corrected coverage estimate not look as good as normal? Perhaps because these simulations are smaller - or because we are trimming SNPs not in the 1000 Genomes data?

It’s probably because we are not removing a random sample of SNPs, we are removing those SNPs with lower MAFs and not in LD with anything.

6. Summary of Findings

Correcting with known \(\mu\)
- Results are approximately as accurate to using our estimate for \(\mu\).
- Our corrected coverage estimate struggles in high LD due to the correction METHOD (not the estimate for \(\mu\)).
Investigating the poor performance of our correction in high LD
- In higher powered scenarios where the correction is biased in high LD but not in low LD (z0[iCV]>4) there is a slight correlation between error and whether the CV is in high LD with the other SNPs –> Method struggles more when the CV is in a high LD sub-region.
- As the LD of the region increases, as does the difficulty in seperating out the signals. Higher power is needed to seperate out these signals. Taking vertical slices (seperability vrs true effect) of the contour plots showed that the relationship is not too far off logistic. –> Plotting logit(seperability) against true effect almost gives a linear relationship –> SO WHAT? Is this why we should try the logit(prop_cov) correction method?
- I have found that there is a negative relationship between the proportion of simulated credible sets that contain the CV (assuming the true CV is the CV) and the LD of the region –> supporting our results that the corrected coverage estimate is consistently too low in high LD regions and potentially honing in on where and why our correction struggles in high LD. Links to the above point that as the LD increases, as does the seperability of signals
- We are taking the average of something bounded (the empirical coverage proportions) which may be problematic –> convert these proportions to the logit scale (unbounded) and remember to un-logit for the final correction –> results are basically the same as our standard correction.
Using reference panel for LD
- Results are similar to using the “true LD”, but slightly worse in higher powered scenarios.
- Will revisit this when we have the correction working for high LD/ 1000 Genomes data.

7. Next Steps and Questions

Explanation of logit seperability slices method (regain linear relationship) and how this relates to logit empirical coverage probabilities method (logit(prop_cov) in correction).

From Upgrading

Surely an ‘MAF’ cannot be greater than 0.5 –> why in our simulations do we select a CV with 0.05<MAF<0.95? We’ve been lazy with terminology, we’re actually looking at the allele frequencies.
\(\mu\) is the Z score at the CV, Paul said that this is not equivalent to the true effect. Specifically, the Z score is the true effect if the outcome is normalised and the predictor is normalised? He’s talking about the effect being \(\beta\), but we are using Z as a measure of effect (but it’s not the same as \(\beta\) obviously) - need to be careful with terminology in paper!
Paul and Jenn thought it was confusing to introduce PP’ (including the null) to derive \(\hat\mu\) but then go back to normal PPs for the correction - can we really not just use max(abs(Z)) for \(\mu\)?

Does our method work with linear effects? Paul said we should show this off.
Empirical PP calibration plot - too good to be true?
- They suggested that I tried to break this to try to understand why it can be that the posterior probabilities are a perfect predictor of probability of being the CV.

It was suggested that the bias could be due to the use of ABFs. To investigate this, I could use model based BFs (that are more accurate than ABFs) such as the ones used in JAM and CAVIAR. Can talk to Paul about using these (although need to be transformed for use on logistic data).