14th August

1. UK10K Reference Panel

Aim: Investigate how the UK10K data performs as a reference panel for the 1000 Genomes data.

NOTE: The “low” LD region comes from Africans so do not attempt to use the UK10K reference panel to correct this. Instead focus on the medium and high LD regions. This explains why the MAFs do not match up between original data (1000G) and reference data (UK10K) in the low LD region:

Lots of the SNPs are not found in the UK10K data (matched by position).

Medium: 1000G haps = 1006, UK10K (ref) haps = 7562, 1000G nsnps = 706, UK10K (ref) nsnps = 578.
- Loose 128 SNPs
High: 1000G haps = 1006, UK10K (ref) haps = 7562, 1000G nsnps = 718, UK10K (ref) nsnps = 416.
- Loose 302 SNPs

Why are so many SNPs dropped?

Perhaps the SNPs that we loose are not in the UK10K data because they are prevalent in Africans (and thus are in the 1000 Genomes data). If this were the case, we would expect the lost SNPs to have smaller allele frequencies in the 1000 Genomes data that we use.

lost_high <- readRDS("lost_high.RDS")
lost_medium <- readRDS("lost_medium.RDS")

boxplot(colMeans(lost_medium$matched), colMeans(lost_medium$unmatched), names = c("Matched", "Lost"), main = "Medium LD (578 SNPs matched, 128 lost)", ylab = "MAF")

boxplot(colMeans(lost_high$matched), colMeans(lost_high$unmatched), names = c("Matched", "Lost"), main = "High LD (416 SNPs matched, 302 lost)", ylab = "MAF")

But at least using a reference panel seems to work!

2. 2 Causal Variants

2 CVs in high LD (\(r^2>0.7\))

2 CVs in low LD (\(r^2<0.01\))

Fine because the SNP with the largest effect tends to be picked out. Struggles more when the 2 CVs both have a large effect, where corrcov over-estimates the coverage. WHY?

3. Distribution of P values in 2 CV plot

## Warning: Removed 58 rows containing non-finite values (stat_boxplot).