15th July

I have found that our correction method struggles in high LD regions, which may be due to our estimate for \(\mu\).
Interestingly, in the T1D results there seems to be a negative correlation between corrected coverage value and \(\mu\). This reflects the results of our simulations that in high powered scenarioes, our correction consistently estimates the corrected coverage to be too low.
I re-run the simulations saving the 4 estimates for \(\mu\) that we considered.
These simulations are in PLOS/mus/

In the real world setting, researchers have Z scores, MAFs and sample sizes from their study. They may then use a reference panel (such as the 1000 Genomes project) to get an estimate for the LD patterns between their SNPs.
To see how our method performs in this setting, I use UK10K data to simulate some result of a typical study (Z scores, MAF and sample sizes) and then use the 1000 Genomes data to get an estimate of the LD.
This analysis is on the HPC in PLOS/ref_1000g/use_refdata.R
The following plots are for the claimed and corrected coverage of UK10K based simulations firstly using the derived LD and secondly using the 1000 Genomes reference panel for the LD.

## 
## Attaching package: 'cowplot'

## The following object is masked from 'package:ggplot2':
## 
##     ggsave

Note that these results are preliminary and not for the same data. I will rerun this analysis so that the corrected coverage is calculated both using the real data and the reference data in a way that the results can be better compared.

Chris has sent better code that I can use to simulate haplotypes in the function examples (and vignettes?) in my R package.

Why is the intercept not written in equation 1? Is the error term really normally distributed?
\(D_{-i}\text{ ind } \beta_i|D_i\) thing?
Any intuition as to why our estimate for \(\mu\) is pretty decent (but not in low \(\mu\))?
JAM, CAVIAR and FINEMAP seem to be concerned with writing the marginal summary statistics in a multivariate formultion but state that individual level data is required for the likelihood and so this needs to be approximated?
Robs suggestion “examine the effect of sampling from, and re-weighting the prior”. And “pinpointing the cause of the coverage bias and developing additional correction methods”.
Hi-C stuff, “I aim to specify W within a hierarchical framework that considers multiple baits simultaneoulsy”.
Difference between CHi-C and HiChIP (something to do with a protein being used?)
HiChIP is a combination of Hi-C and ChIP-seq. It provides a protein-centric view of genome architecture
Gnatt chart - “Integrate information on T cell TF binding sites with peaky model”.