ldak-vignette.Rmd
The flexible_cfdr
function requires as input the indices of an independent subset of SNPs. These indices indicate the (p,q)
pairs considered independent observations for the purpose of the KDE fitting procedure.
In practice, we identify independent SNPs as those assigned a non-zero weighting by the LDAK package’s weighting calculation procedure. These weightings were originally developed as a means of adjusting for the unequal tagging of (causal) SNPs across the genome when estimating heritability. The calculation procedure requires haplotype information, which we obtain in the form of the 1000 Genomes (1000G) Phase 3 data.
The purpose of this vignette is to sketch out our own approach to generating LDAK weightings for use in fcfdr
. Whilst RMarkdown was used to generate this HTML document, the code snippets contained in this vignette are intended to serve as static illustrations and are not intended to be run without further modification.
This workflow was written with the use of LDAK v5.1
(download here) and PLINK v1.90b6.21 64-bit (19 Oct 2020)
(download here). We also use code from the R packages bigsnpr and data.table.
NB: In this section we discuss how to obtain and process haplotype data from the 1000G project. We have made available the end result, the quality-controlled, European-only data, as a ~280MB download at https://doi.org/10.5281/zenodo.4709547. If your principal p-values were taken from a GWAS conducted in a European population, you can skip the steps here and use the files in the download instead. You will still have to run LDAK. Note that the data use coordinates from the hg19 genome assembly.
We first download the 1000G Phase 3 data in the vcf
format. These data use the hg19
genome assembly for SNP coordinates.
for i in {1..22}; do
wget -O "chr"$i".vcf.gz" "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr"$i".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
done
wget -O "chrX.vcf.gz" "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz"
wget -O "chrY.vcf.gz" "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz"
Subsequent processing depends on the data being in the PLINK-compatible bed
, bim
, and fam
file formats. We use PLINK to convert the vcf
files to these formats.
for i in {1..22}; do
plink --vcf "chr"$i".vcf.gz" --make-bed --out "chr"$i
done
plink --vcf chrX.vcf.gz --make-bed --out chrX
plink --vcf chrY.vcf.gz --make-bed --out chrY
The 1000G haplotype data were obtained by sequencing individuals from a variety of populations. In practice, we take a subset of the 1000G samples to ensure that the ancestry of the individuals from which we obtain haplotype data matches the ancestry of the individuals in the GWAS of interest (from which our principal p-values come). Sample ancestry information for the 1000G project can be found at the data portal of the International Genome Sample Resource.
We specify the desired sample IDs in a fam
file. The fam
files generated from the vcf
files by plink --make-bed
should be identical for all chromosomes, so we use a single custom fam
file, euro.fam
, to downsample all files. euro.fam
was obtained by subsetting chr1.fam
to retain only those entries with European sample IDs. We write the new files out to the directory euro_only
.
for i in {1..22}; do
plink --bfile "chr"$i --keep euro.fam --make-bed --silent --out "euro_only/chr"$i
done
plink --bfile chrX --keep euro.fam --make-bed --silent --out euro_only/chrX
plink --bfile chrY --keep euro.fam --make-bed --silent --out euro_only/chrY
The fam
files omit sex information. This is problematic only when we wish to carry out QC on the Y chromosome SNPs: PLINK drops heterozygous Y genotypes if we do not affirm the sex of the samples in chrY.fam
. We do this by changing the values in the sex code column from 0 (‘unknown’) to 1 (‘male’).
sed -i 's/0 0 0 -9/0 0 1 -9/' euro_only/chrY.fam
We then use a function from the R package bigsnpr
to carry out some basic QC on the 1000G data and write out the duly filtered data to another directory, euro_only_qc
.
for(i in 1:22) {
bigsnpr::snp_plinkQC(plink.path = '/bin/plink', prefix.in = paste0('euro_only/chr', i),
prefix.out = paste0('euro_only_qc/chr', i),
geno = 0, maf = 0.01, hwe = 1e-10)
}
bigsnpr::snp_plinkQC(plink.path = '/bin/plink', prefix.in = 'euro_only/chrX',
prefix.out = 'euro_only_qc/chrX',
geno = 0, maf = 0.01, hwe = 1e-10)
bigsnpr::snp_plinkQC(plink.path = '/bin/plink', prefix.in = 'euro_only/chrY',
prefix.out = 'euro_only_qc/chrY',
geno = 0, maf = 0.01, hwe = 1e-10)
NB: The following processing was carried out with a GWAS containing only autosomal SNPs, so we omit reference to the sex chromosomes henceforth.
If you skipped the previous section on processing the 1000G data then you can download our pre-processed files:
wget https://zenodo.org/record/4709547/files/euro_only_qc.tar.gz
tar -xvf euro_only_qc.tar
For the purpose of fitting the KDE in flexible_cfdr
we care only about the dependence structure of the SNPs at which our p
and q
values were obtained, so we filter the 1000G SNPs to retain those for which we have GWAS p-values. Ideally, the 1000G SNPs would form a superset of those in the GWAS, but this is typically not the case; with ~5.5 million SNPs in our GWAS we would expect to lose several tens of thousands of SNPs. The absence of this relatively paltry number of SNPs should have a neglible influence upon the KDE.
To filter the SNPs we use plink --extract
, which allows one to extract SNPs based on:
rsIDs are not always consistent between data sets and we have found that extracting SNPs using genomic coordinates typically yields a larger intersection of 1000G and GWAS SNPs. Thus we use plink --extract
with the --range
argument (documentation here), which requires that we pass a white space-separated text file specifying a list of genomic ‘ranges’ (hereafter the ‘range file’) in rows in the format chr bp bp ID
. In our use case, each SNP corresponds to a range, albeit one of length one: the first and second bp
columns which give the start and end coordinates of the range are duplicated. We use rsIDs as range identifiers, but these are essentially superfluous when using the --range
argument and will not be used by plink
to filter the SNPs.
The coordinates of the GWAS SNPs are usually readily available and a simple approach to generating the requisite range files for plink --extract
would copy the genomic coordinates of each SNP to the appropriate chromosome-specific range file (albeit in the plink
-compliant chr bp bp rsID
format). However, this approach overlooks the possibility that there exist multiple SNPs with different reference/alternative allele pairings at the same locus, that is, duplicates.
We can account for this when preparing the range files by first joining the 1000G SNPs in each chromosome’s bim
file data to our GWAS SNPs and checking that the allele pairings match. This approach also allows us to remove duplicated rows in the range files; as noted above, it is not essential for a good fit that we retain every SNP.
The following exemplifies the sort of R code one can use to write out the range files whilst carrying out the aforementioned checks.
library(data.table)
system("mkdir plinkRanges")
system("mkdir filtered")
system("mkdir ldak")
# We assume this contains columns SNPID, CHR19 and BP19 (so-called because of hg19), REF, and ALT
# note that gwas_dat needs to be a data.table
gwas_dat <- fread('gwas_sum_stats.tsv.gz', sep = '\t', header = T)
# Iterating over chromosomal bim files
for(i in 1:22) {
# bim files have no header
bim_dat <- fread(sprintf('euro_only_qc/chr%d.bim', i), sep = '\t', header = F, col.names = c('Chr', 'ID', 'Cm', 'BP19', 'A1', 'A2') )
bim_join <- merge(bim_dat, gwas_dat[CHR19 == i], by.x = 'BP19', by.y = 'BP19')
# Make sure alleles match, although for two-sided association p-values we don't care whether ref/alt is reversed
bim_join <- bim_join[(REF == A1 & ALT == A2) | (REF == A2 & ALT == A1)]
bim_join <- bim_join[, .(Chr, BP19, BP19, SNPID)]
if(any(duplicated(bim_join, by='BP19'))) {
warning(sprintf('%d duplicates removed from output', sum(duplicated(bim_join, by = 'BP19'))))
}
# Remove duplicates
bim_join <- unique(bim_join, by='BP19')
fwrite(bim_join, file = sprintf('plinkRanges/chr%d.tsv', i), row.names = F, sep = '\t', col.names = F, quote = F)
}
Note that we took care to use the hg19
assembly coordinates to match the assembly used in the 1000G data we downloaded above.
Using these range files we can now filter the 1000G data.
for i in {1..22}; do
plink --silent --bfile "euro_only_qc/chr"$i --extract "plinkRanges/chr"$i".tsv" --range --make-bed --out "filtered/chr"$i
done
The weighting calculation procedure entails a preprocessing step, ldak --cut-weights
, and a calculation step, ldak --calc-weights-all
. We refer the reader to the LDAK documentation for further guidance. An idiosyncracy of our approach is that we process the data in a set of chromosome-specific files. This is an artifact of the format of the 1000G data.
We create the directory ldak
and within it subdirectories labelled chrx
, where x ranges from 1 to 22, to hold the results of the procedure. (note that you may have to replace the executable ldak
with ldak5.1.linux
depending on how you downloaded the LDAK software).
for i in {1..22}; do
ldak --cut-weights "ldak/chr"$i --bfile "filtered/chr"$i
ldak --calc-weights-all "ldak/chr"$i --bfile "filtered/chr"$i
done
LDAK will write out the procedure’s results to files named ldak/chrx/weights.all
. We join these chromosome-specific files into one.
for i in {1..22}; do
# We omit the header in each file
sed "s/$/ $i/" <(tail -n +2 "ldak/chr$i/weights.all") >> "ldak/combined_weights.all"
done
# We add back in a single header for the combined file
sed -i '1 i\Predictor Weight Neighbours Tagging Info Check Chr' "ldak/combined_weights.all"
flexible_fcfdr
In practice, it is necessary to merge the weightings contained in combined_weights.all
back into the file containing the p
and q
values so that all three vectors can be made available to flexible_cfdr
. This poses a problem, however, as the rsIDs in the Predictor
column of combined_weights.all
are derived from the 1000G bim
files, not the GWAS file rsIDs specified in the range files (which are ignored by plink
as noted above). Merging 1000G and GWAS SNPs using genomic coordinates is to be preferred over the use of rsIDs as the latter are not always consistent between data sets.
combined_weights.all
as written out by LDAK does not contain SNP basepair coordinates. This means we must use the bim
files contained in the filtered/chrx
directories to recover the basepair coordinates and reference/alternative allele pairings. This can be accomplished by matching the SNPs in combined_weights.all
to those in the bim
files. As the rsIDs in combined_weights.all
are derived from these bim
files, it is appropriate in this case to merge these data with the use of rsIDs. We provide the following snippets as an example of how this can be accomplished.
To simplify matters, we first concatenate all chromosomal bim
files.
for i in {1..22}; do
cat "filtered/chr"$i".bim" >> "filtered/chr_all.bim"
done
We then add the genomic metadata we need to the combined_weights.all
LDAK output file.
library(data.table)
bim_dat <- fread('filtered/chr_all.bim', sep = '\t', header = F, col.names = c('Chr', 'ID', 'Cm', 'BP19', 'A1', 'A2'))
weights_dat <- fread('ldak/combined_weights.all', sep = ' ', header = T)
# Drop rows with a missing ID or weight value
weights_dat <- na.omit(weights_dat, cols = c('Predictor', 'Weight'))
join_dat <- merge(weights_dat, bim_dat[, .(ID, Chr, BP19, A1, A2)], all.x = T, by.x = c('Predictor', 'Chr'), by.y = c('ID', 'Chr'), sort = F)
fwrite(join_dat, file = 'ldak/combined_weights_meta.all', sep = ' ', col.names = T, row.names = F, quote = F)
The format of the file in which your p
and q
are stored will of course vary, but the snippet below illustrates how combined_weights_meta.all
can be merged with it.
weights_dat <- fread('ldak/combined_weights_meta.all', sep = ' ', header = T, select = c('Predictor', 'Weight', 'Chr', 'BP19', 'A1', 'A2'))
# We assume this contains columns SNPID, CHR19 and BP19 (so-called because of hg19), REF, and ALT
gwas_dat <- fread('gwas_sum_stats.tsv.gz', sep = '\t', header = T)
gwas_dat <- merge(gwas_dat, weights_dat, by.x = c('CHR19', 'BP19'), by.y = c('Chr', 'BP19'), sort = F)
# Drop rows where the ref/alt allele pairing differs from that already present
gwas_dat <- gwas_dat[((REF == A1 & ALT == A2) | (REF == A2 & ALT == A1))]
# Drop now-redundant allele columns
gwas_dat[, c('A1', 'A2') := NULL ]