Contents

  1. Correction with Known \(\mu\)
  2. Seperating Signals
  3. Empirical Coverage Proportions
  4. Possible Correction: Logit Proportions
  5. Using Reference Panel for Correction
  6. Summary of Findings
  7. Next Steps and Questions

1. Correction with Known \(\mu\)



UK10K Data

Number of simulations making up each bin:

## 
##    (0,1]    (1,2]  (2,2.5]  (2.5,3]    (3,4] (4,4.75] (4.75,6]    (6,8] 
##      262      453      264      310      665      436      665      666 
##   (8,13]  (13,16] (16,Inf] 
##      680      254      270

1000 Genomes Data

Now for the high/medium/low LD regions using 1000 genomes data:

  • The number of simulations making up each bin:
##         
##          (0,1] (1,2] (2,2.5] (2.5,3] (3,4] (4,4.75] (4.75,6] (6,8] (8,13]
##   low       35   299     126      92   215      168      200   158    223
##   high      23   265     171      68   205      153      187   176    259
##   medium    28   257     181      94   189      189      189   169    250
##         
##          (13,16] (16,Inf]
##   low         44       94
##   high        33      104
##   medium      32      115

Findings: Our correction METHOD does not work very well on medium and high LD regions using 1000 Genomes data. It is the method that is dodgey as even when using the true \(\mu\) value (ze[iCV] - expected Z score at CV) in our correction method, the variability is large. I.e. the problem is not about the noisy estimate.


Why is it bad in high LD?

  • Notice that our usual correction tends to be larger than that when using the true \(\mu\) value.


  • “Is the bias down more obvious when the true CV is in high LD with other SNPs or lower?”

  • To investigate this I look at the error of the corrected coverage estimate (corrcov - true.cov) against some measure of whether the CV is in high/low LD with other SNPs. The proxies I use for this measure are:

  1. The mean \(r^2\) of the CV with all the other SNPs
  2. The number of SNPs with an \(r^2\) value greater than 0.2 with the CV
  • As these values increase on the \(x\) axis, we expect to see some correlation with the error. For example, if the CV is in a very high LD region then the estimate is worse.

  • These look unbiased, instead let’s look only at the cases where our correction is biased in high LD but not low LD (which by eye is when z0[iCV]>4).

  • These results show that our corrected coverage estimate is too low when the CV is highly correlated with other SNPs in the region. –> slight correlation between the CV being correlated to lots of SNPs in the region and the performance of the correction, but not too much.

  • These plots show that there is a correlation between the error of our estimate and whether the CV is in a high LD sub-region (i.e. doesn’t just depend on the background LD of the region). Allows us to seperate out region effects from CV effects.


2. Seperating Signals


Problem: Our correction method is not as accurate in regions of high LD.

Ideas:


##     mu_obs    muhat   true   claimed corrected
## 1 18.66619 18.65872 0.9812 0.9574063 0.9140482


Effect of LD and effect size on seperability


Let’s now consider the effect of LD and effect size on the seperability (i.e. how well the true signal can be picked out).

Possible measures of LD:

  • Max \(r^2\) between the CV and any other SNP (maxr2)
  • Average \(r^2\) value of CV with all other SNPs (meanr2)
  • Median \(r^2\) value of CV with all other SNPs (medianr2)
  • Number of SNPs in LD with CV (i.e. with \(r^2>0.9\)) (r20.2 or r20.1 etc.)

Possible measures of seperability:

  • The number of variants in the credible set (nvar_cs)
  • The PP at the CV (will be larger if the CV has been well seperated) (PP_iCV)

To visualise this, I plot \(\mu\) against the LD for many simulations, with contours added reflecting the seperability.

Below shows that the PP at the CV is much higher in low LD than in high LD for the same effect. It also does look not unlinear with \(\mu\)


I then take vertical slices of the plots on the right hand side (for low, medium and high LD) to investigate the relationship between the value of \(\mu\) and the seperability of the signals. We are looking to see if there is a way we can model this relationship.



  • It is perhaps better if we look at the logit of the seperability.

  • For this I focus on the “max \(r^2\) with the CV” LD measure.

Method:

  1. Add logit(PP_iCV) column to original data (removing any Inf rows)
  2. Fit a loess loess(logit_PP_iCV ~ maxr2 * true.mu, data = data)
  3. Generate a sequence of predictor values (for maxr2 and true.mu)
  4. Predict the corresponding values of logit_PP_iCV
  5. Convert results to long data table format (melt function)
  6. Select rows of results corresponding to LD bin (i.e. 0.3<maxr2<0.35)
  7. Plot logit_PP_iCV against true.mu
  • This looks promising as the relationship now looks almost linear (note differing axis values - not \(y=x\) line).

  • Think that this shows that the penalty for an estimate being too high is now the same for an estimate being too low…



3. Empirical Coverage Proportions




4. Possible Correction: Logit Proportions


Explanation: In our correction, we are averaging something that has an upper bound of 1 (the empirical coverage proportions), where lots of the data points lie close to the upper bound. This means that it is “easier” for the empirical coverage proportions to be too low than too high, and lower estimates are more detrimental than higher estimates. This may explain why our corrected coverage estimate is consistently too low in high powered scenarios.

Potential fix would be to take the logit of the prop_cov quantities used in the final corrected coverage estimate weighting, so that they are no longer bounded.

Our final corrected coverage estimate is then given by:

logit <- function(x) log(x/(1-x))
invlogit <- function(x) exp(x)/(1 + exp(x))

corrcov_tmp <- sum(pp0 * logit(prop_cov))
corrcov <- invlogit(corrcov_tmp)


5. Using Reference Panel



Method:

  1. Select 200 SNPs from UK10K data
  2. Find the corresponding SNPs (by their genomic position) in the 1000 genomes data
  3. Remove SNPs from the analysis if they cannot be found or if they map to multiple rows in the 1000 genomes data (removes ~2-5 SNPs on average)
  4. Find the haplotypes for the remaining SNPs in the UK10K data and use simGWAS to generate the system to be corrected
  5. Find the corrected coverage using the “true LD” from the UK10K haplotypes
  6. Find the corrected coverage using the LD estimated from the 1000 Genomes reference panel


The number of simulations in each bin:

## 
##    (0,1]    (1,2]  (2,2.5]  (2.5,3]    (3,4] (4,4.75] (4.75,6]    (6,8] 
##      237      390      259      277      565      461      615      575 
##   (8,13]  (13,16] (16,Inf] 
##      665      238      258

Side note: Why does the normal corrected coverage estimate not look as good as normal? Perhaps because these simulations are smaller - or because we are trimming SNPs not in the 1000 Genomes data?


6. Summary of Findings



7. Next Steps and Questions


From Upgrading

  • Surely an ‘MAF’ cannot be greater than 0.5 –> why in our simulations do we select a CV with 0.05<MAF<0.95? We’ve been lazy with terminology, we’re actually looking at the allele frequencies.

  • \(\mu\) is the Z score at the CV, Paul said that this is not equivalent to the true effect. Specifically, the Z score is the true effect if the outcome is normalised and the predictor is normalised? He’s talking about the effect being \(\beta\), but we are using Z as a measure of effect (but it’s not the same as \(\beta\) obviously) - need to be careful with terminology in paper!

  • Paul and Jenn thought it was confusing to introduce PP’ (including the null) to derive \(\hat\mu\) but then go back to normal PPs for the correction - can we really not just use max(abs(Z)) for \(\mu\)?

  • Does our method work with linear effects? Paul said we should show this off.

  • Empirical PP calibration plot - too good to be true?

    • They suggested that I tried to break this to try to understand why it can be that the posterior probabilities are a perfect predictor of probability of being the CV.

  • It was suggested that the bias could be due to the use of ABFs. To investigate this, I could use model based BFs (that are more accurate than ABFs) such as the ones used in JAM and CAVIAR. Can talk to Paul about using these (although need to be transformed for use on logistic data).