Aim: Investigate how the UK10K data performs as a reference panel for the 1000 Genomes data.
NOTE: The “low” LD region comes from Africans so do not attempt to use the UK10K reference panel to correct this. Instead focus on the medium and high LD regions. This explains why the MAFs do not match up between original data (1000G) and reference data (UK10K) in the low LD region:
Lots of the SNPs are not found in the UK10K data (matched by position).
Perhaps the SNPs that we loose are not in the UK10K data because they are prevalent in Africans (and thus are in the 1000 Genomes data). If this were the case, we would expect the lost SNPs to have smaller allele frequencies in the 1000 Genomes data that we use.
lost_high <- readRDS("lost_high.RDS")
lost_medium <- readRDS("lost_medium.RDS")
boxplot(colMeans(lost_medium$matched), colMeans(lost_medium$unmatched), names = c("Matched", "Lost"), main = "Medium LD (578 SNPs matched, 128 lost)", ylab = "MAF")
boxplot(colMeans(lost_high$matched), colMeans(lost_high$unmatched), names = c("Matched", "Lost"), main = "High LD (416 SNPs matched, 302 lost)", ylab = "MAF")
But at least using a reference panel seems to work!
## Warning: Removed 58 rows containing non-finite values (stat_boxplot).