26th November

Reading

\(\textit{INS}\) gene region

The \(\textit{INS}\) gene region confers about 10% of the genetic susceptibility to T1D. A very nice (but old) review of the role of the insulin gene in T1D is as available here.
\(\textit{INS}\) transcription is mediated by the binding of the transcription factor, Pur1 to the VNTR (approximately 0.5 kb upsream of \(\textit{INS}\)). There are three different classes of the VNTR, which may affect the binding of this TF and therefore \(\textit{INS}\) transcription. Class I VNTR has the fewest repeats and is therefore the shortest (30-60 range), whilst class III has the most repeats (120-170 range). The frequency of these VNTR classes vary across populations.

There are several “stories” for the relationship between the \(\textit{INS}\) gene region and T1D:

Negative-selection of auto-reactive T cells in the thymus: “The fact that negative selection of autoreactive thymocytes is dose-dependent suggested the hypothesis that different VNTR classes may modulate tolerance to insulin by affecting insulin expression levels in the thymus”. Indeed class I VNTRs correlate with low \(\textit{INS}\) mRNA levels in the thymus, whilst class III VNTRs correlate with high \(\textit{INS}\) mRNA levels and “high insulin levels in the thymus may more efficiently induce negative selection of insulin specific T-lymphocytes”.
Beta cell stress: Beta cell ER stress can contribute to the pathogenesis of T1D prior to the onset of hyperglycemia. For example, the unfolded protein response (UPR) in beta cells is activated under conditions of inflammation, oxidative stress and insulin production/folding imbalance. This results in rapid translational inhibition to alleviate the deleterious effects of accumulating misfolded proteins in the ER, and can lead to apoptosis. It’s very easy to measure beta-cell ER stress by comparing the ratio of circulating proinsulin relative to insulin or C-peptide (more proinsulin since the ER is struggling to convert it). A clinical trial (Sims et al., 2016) found that beta-cell ER dysfunction precedes type 1 diabetes onset, especially in younger children - suggesting that this ratio may have utility in predicting the onset of T1D in the presymptomatic phase. See Sims et al. (2018) for a nice review.

Fine-mapping the \(\textit{INS}\) gene region

Onengut-Gumuscu et al. (2015) found that the 11p15.5 region (for which \(\textit{INS}\) and \(\textit{INS-IGF2}\) are the candidate genes) showed evidence of multiple independent signals. Specifically, after conditioning on rs689 (which tags the VNTR alleles - what exactly does this mean?), rs72853903 still exhibited significant evidence for an independent association with T1D. [Although confusingly they state that rs689 (−23HphI), rs3842753 (c.1140A>C) and the 5′ variable-number tandem repeat (VNTR) polymorphism were the most likely causal candidates, but they didn’t have sufficient data to fully analyse them].
Robertson et al. (2020) fine-mapped the chr11:2157974-2177435 region using GUESSFM and found evidence of four independent signals. But rs689 has a very small PP?
The PPsum of the four regions identified by GUESSFM in Robertson et al. (2020) are shown below. What does it mean for SNPs to have PPsum>1?

x <- data.frame(read_excel("robertson_supp.xlsx", sheet = "ST11"))
region_11p15.5 <- x[which(x$Region=="chr11:2157974-2177435"),]
unique(region_11p15.5$PPsum)

## [1] 0.8117721 1.0000457 0.9908292 0.9875925

The PP plots for each signal found by GUESSFM are plotted below, with the SNPs highlighted by Onengut-Gumuscu et al. (2015) in red.

The 3 SNPs with highest PP in the first signal (index SNP rs3842761) are rs3842761 (intronic variant in \(\textit{INS-IGF2}\)), rs17879158;rs775181992 and rs3842763 (intronic variant in \(\textit{INS-IGF2}\)).
The 3 SNPs with highest PP in the second signal (index SNP rs3842754) are rs3842754, rs3842755 and rs3842752 (all intronic variant in \(\textit{INS-IGF2}\) but evidence on Open Targets Genetics to be linked with \(\textit{TH}\) or \(\textit{C11orf21}\)). The latter is a missense variant in \(\textit{INS}\).
The SNP in signal 3 is rs3842749, which is an intronic variant in \(\textit{INS-IGF2}\) and has not be significantly associated with any phenotype on Open Targets Genetics.
The SNP with \(PP\approx 0.2\) in signal 4 is rs10770140. This has been super strongly linked to T1D by Onengut et al. but has eQTL evidence for \(\textit{TH}\) or \(\textit{C11orf21}\) on Open Targets Genetics.

Ikaros family

Three members of the Ikaros family of transcription factors have been linked to T1D: Ikaros (encoded by \(\textit{IKZF1}\): 7p12.2), Aiolos (\(\textit{IKZF3}\): 17q12) and Eos (\(\textit{IKZF4}\): 12q13.2). I can’t find any evidence of \(\textit{IKZF2}\) or \(\textit{IKZF5}\) being linked to T1D (although \(\textit{IKZF2}\) has been linked to SLE and \(\textit{IKZF5}\) has been linked to UC).

Ikaros has been reported to be transcriptional repressor of \(\textit{IL2}\) (Thomas et al. (2007)). Specifically, “CD4+ T cells with reduced Ikaros DNA-binding activity no longer require signals from the TCR or CD28 for histone acetylation at the \(\textit{IL2}\) promoter, and no longer require CD28 costimulation for expression of the \(\textit{IL2}\) gene”.

\(\textit{IKZF1}\) resides in genomic region chr7:50,303,453-50,405,101 (GRCh38/hg38) which lies within the 7p12.2 region (chr7:50221511-51009338). In her paper, Mary states that “there are two ImmunoChip regions which overlap \(\textit{IKZF1}\)” for which the 5’ end has colocalizing associations with MS and T1D, whilst the 3’ end appears specific to T1D. This is shown in Supp Figure 7).

Aiolos is required for Th17 differentiation, by directly silencing \(\textit{IL2}\) expression (Quintana et al. (2012)). Interestingly, the \(\textit{IKZF3}\) gene regions has been shown to be differentially associated between T1D patients with different age-of-onsets (see Inshaw et al. (2020)). I.e. there was evidence that rs12453507 showed heterogeneity in effect size between the \(<7\) and \(>=13\) age groups of patients.

Eos is highly expressed in Tregs and maintains the phenotype of Treg cells together with Foxp3 (Pan et al. (2009)). An \(\textit{IKZF4}\) polymorphism (rs705704) is strongly associated with IAA positivity at the time of T1D diagnosis.. (see https://pubmed.ncbi.nlm.nih.gov/23721563/). Also associated with Treg (see Intracellular signaling molecules section - https://www.jimmunol.org/content/jimmunol/197/10/3762.full.pdf).

Further sources: https://www.nature.com/articles/s41541-018-0062-8 and https://www.pnas.org/content/pnas/111/12/E1111.full.pdf.

Fine-mapping Ikaros, Aiolos and Eos

IKZF1

\(\textit{IKZF1}\) resides in the 7p12.2 region (chr7:50221511-51009338).

I download the GUESSFM fine-mapping results for this region from Robertson et al. (2020). There is evidence of 2 independent signals in this region (PPsum>0.9), with index SNPs rs6944602 and rs10277986. The PPsum of all 6 signals are shown below.

x <- data.frame(read_excel("robertson_supp.xlsx", sheet = "ST11"))
region_7p12.2 <- x[which(x$Region=="chr7:50221511-51009338"),]
unique(region_7p12.2$PPsum)

## [1] 0.13930737 0.99545994 0.01149905 0.05444663 0.05729436 0.99930561

signal_1 <- region_7p12.2[which(region_7p12.2$Index_Rsid=="rs6944602"),]
signal_2 <- region_7p12.2[which(region_7p12.2$Index_Rsid=="rs10277986"),]

Onengut-Gumuscu et al. (2015) did not find evidence of a secondary signal in this region and reported rs62447205 as the index SNP (show in red in signal 1 below).

The variants responsible for signal 2 are not linked to \(\textit{IKZF1}\), but to \(\textit{GRB10}\) (insulin receptor) and \(\textit{COBL}\) (based on Ensembl/RefSeq annotations in Robertson et al.). So it seems that signal 1 is the one of interest here, for which rs6944602 has the highest PP. Using information from Open Targets Genetics (https://genetics.opentargets.org/variant/7_50406053_G_A), this SNP is an eQTL for \(\textit{IKZF1}\) (which is the closest protein coding gene to the SNP, 101985 bp away).

IKZF3

\(\textit{IKZF3}\) resides in the 17q12 region (chr17:39747478-40657572). I download the GUESSFM fine-mapping results for this region from Robertson et al. (2020). There’s evidence of two independent signals here, with index SNPs rs2941522 and rs74252458;rs11464691.

region_17q12 <- x[which(x$Region=="chr17:39747478-40657572"),]
unique(region_17q12$PPsum)

## [1] 0.97580612 0.01349632 0.96002236

signal_1 <- region_17q12[which(region_17q12$Index_Rsid=="rs2941522"),]
signal_2 <- region_17q12[which(region_17q12$Index_Rsid=="rs74252458;rs11464691"),]

Onengut-Gumuscu et al. (2015) did not find evidence of a secondary signal in this region and reported rs12453507 as the index SNP (show in red in signal 1 below).

Only half of the variants responsible for signal 1 (before dashed line) were mapped to \(\textit{IKZF3}\) (based on Ensembl/RefSeq annotations). The others were mapped to \(\textit{ZPBP2}\). The index SNP for signal 1 has been linked to both \(\textit{PGAP3}\) and \(\textit{IKZF3}\) on Open Targets Genetics, based on eQTL data. This SNP is also super strongly associated with neutrophill count (\(p<1e-100\)), white blood cell count and age of asthma diagnosis.

None of the variants responsible for signal 2 were mapped to \(\textit{IKZF3}\) (these mapped to \(\textit{SMARCE1}\)).

IKZF4

\(\textit{IKZF4}\) resides in the 12q13.2 region (chr12:55976127-56360038). I download the GUESSFM fine-mapping results for this region from Robertson et al. (2020). There’s evidence of thee independent signals here, with index SNPs rs71459332, rs77508451 and rs796916887 (although how can PPsum>1?)

region_12q13.2 <- x[which(x$Region=="chr12:55976127-56360038"),]
unique(region_12q13.2$PPsum)

## [1] 0.7298748 1.0381058 0.1035972 0.9395634

signal_1 <- region_12q13.2[which(region_12q13.2$Index_Rsid=="rs71459332"),]
signal_2 <- region_12q13.2[which(region_12q13.2$Index_Rsid=="rs77508451"),]
signal_3 <- region_12q13.2[which(region_12q13.2$Index_Rsid=="rs796916887"),]

Onengut-Gumuscu et al. (2015) did not find evidence of multiple signals in this region, and called the index SNP rs705705 (which has NA PP in the Robertson et al. dataset).

Only a few of the variants in signal 1 were mapped to \(\textit{IKZF4}\) (but the lead SNP is intronic in \(\textit{IKZF4}\) (based on RefSeq) and has been linked by eQTL data). Same for signal 2 (but the two top SNPs (rs77508451 and rs78865465) were intergeneic and intronic for \(\textit{IKZF4}\) respectively, but not linked to \(\textit{IKZF4}\) on Open Targets). Same for signal 3 (but the top SNPs do not map to \(\textit{IKZF4}\), but to RPS26/ERBB3).

Transcription factor binding sites

The JASPAR database can be used to search for TF binding site motifs. However, out of the 5 Ikaros TFs, only Ikaros (IKZF1) has a reported motif in the database, suggesting that the binding motifs for the other Ikaros TFs have not yet been identified. Do these definitely directly bind to the DNA and not just form a complex with themselves (e.g. all binding to IKZF1 which is the only one physically bound to the DNA)?

Whilst knowing the binding motifs of TFs is useful, it does not necessarily tell us where a particular TF is binding in a particular cell type (only \(\approx 1\%\) of motifs are occupied by a TF at any given time (Neph et al. 2012), meaning that we’d massively over-esimate the number of binding events if we assumed that the TF was binding at every instance of the motif in every cell type).

For this reason, CHiP-Seq datasets are still the gold standard as these can be used to identify specifically where the TF binds in certain cell types. However, we don’t have CHiP-Seq data for all transcription factors in all primary disease relevant human tissues.

Instead we may opt to use genomic footprinting, which comes in two flavors: (1) ignore motifs and observe DNase cleavage events along a sliding window of the genome or (2) begin with the TF binding motif and model the DNase cleavage patterns around it.

Predicting TF binding sites

Funk et al. (2020) have recently developed a framework to predict the binding sites of 1515 transcription factors in 27 human tissues. Brief method:

1. Use HINT and Wellington algorithms to identify genomic footprints from DNase-seq data (using 20 bp or 16 bp seed lengths for the first step of alignment, i.e. first find all regions matching the "seed" and then extend the alignment around that).
2. To predict which TF occupied each footprint, use Find Individual Motif Occurrences (FIMO).

Using this method off-the-bat gives massive numbers of TF binding sites throughout the genome, and when using CHiP-Seq datasets to evaluate the results, this approach gave huge false positive rates (\(FDR>0.9\)). For this reason, they considered thresholding, defining two scores at each genomic location:

1. The best footprint score, defined as the highest score at this location in any sample.
2. The footprint fraction, defined as the proportion of independent samples with a non-zero footprint score.

They then investigated the linear relationship between these scores and a true-positive binding site from CHiP-Seq, finding that the most accurate predictor was the best HINT20 score.

They also investigated whether incorporating addditional data (such as a score for the strength of the match to the sequence motif, TF class, GC content, and distance to a TSS) along with their scores (“comprehensive model”) was helpful. They found that “the maximum MCCs of the HINT20 footprint-only versus comprehensive models were identical (0.42), the footprint-only model had a relatively small threshold window within which both true-positive and false-negative error rates were well controlled. Therefore, incorporating information about genomic context does not dramatically improve prediction accuracy but could potentially improve the robustness of these predictions”.

Ultimately, they recommend that a HINT score \(>200\) and a Wellington score \(<-27\) are optimal filtering thresholds to control both false-positive and false-negative errors. For the rest of the paper they use HINT20 footprints with scores \(> 200\), so I use this threshold in my analysis below.

Side comments:

Note that in all their analyses, the true and false positives are “soft” because CHiP-Seq results are not necessarily the complete truth. Indeed, whilst the number of “false discoveries” may be high - at least some of these findings could actually be true, and just not identified by CHiP-Seq for various reasons.

Another consideration is that although a TF may be binding, it may not be active. The active versus inactive binding sites consideration is an interesting direction of research.

(See “Fingerprinting Ikaros” and Schjerven et al. (2013) for information about the Ikaros binding motif.)

My analysis

In Table S2, they supply the Motif-to-TF mappings. It says that IKZF1.p2 was the “motif identifier” for IKZF1 and IKZF2.p2 for IKZF2 (from “ISMARA: Integrated System for Motif Activity Response Analysis” from the Babraham institute - suggests that the binding site of IKZF2 has been identified?).

I’ve downloaded the HINT20 binding sites for lymphoblast tissues (from http://data.nemoarchive.org/other/grant/sament/sament/footprint_atlas/bed/) and extract rows for the Ikaros TFs using the search term “IKZF”. [Note that the extended version may be better: see http://data.nemoarchive.org/other/grant/sament/sament/footprint_atlas/extended/columnIDs_README].

They have results for Hsapiens-HOCOMOCOv10-IKZF1_HUMAN.H10MO.C, NA-SwissRegulon-IKZF2.SwissRegulon and Mmusculus-HOCOMOCOv10-IKZF1_MOUSE.H10MO.C.

For now I focus on Hsapiens-HOCOMOCOv10-IKZF1_HUMAN.H10MO.C. The name suggests that this motif is used: https://hocomoco10.autosome.ru/motif/IKZF1_HUMAN.H10MO.C, but this is different to the motif identifier in Supp Table S2 (https://www.bioinformatics.babraham.ac.uk/ismara_report_hg19/pages/IKZF1.p2.html)?

The table has 538,113 rows of binding sites, which are all 7 base pairs long (since the motif is 7 bp).

head(ikzf1_pred)

##     V1      V2      V3                                       V4 V5 V6
## 1 chr9 2177224 2177231 Hsapiens-HOCOMOCOv10-IKZF1_HUMAN.H10MO.C  5  -
## 2 chr9 2760399 2760406 Hsapiens-HOCOMOCOv10-IKZF1_HUMAN.H10MO.C  5  +
## 3 chr9 2844009 2844016 Hsapiens-HOCOMOCOv10-IKZF1_HUMAN.H10MO.C  8  -
## 4 chr9 3197360 3197367 Hsapiens-HOCOMOCOv10-IKZF1_HUMAN.H10MO.C  7  +
## 5 chr9 4446765 4446772 Hsapiens-HOCOMOCOv10-IKZF1_HUMAN.H10MO.C  5  -
## 6 chr9 4462513 4462520 Hsapiens-HOCOMOCOv10-IKZF1_HUMAN.H10MO.C  4  +

As per their recommendation, I limit analysis to those binding sites with a HINT20 score \(>200\), leaving 29,342 rows.

ikzf1_pred <- ikzf1_pred[which(ikzf1_pred$V5>200),]
dim(ikzf1_pred)

## [1] 29342     6

But some of the regions appear twice (with different HINT20 scores). For now, I limit the data set to just the unique regions (13,676 of these).

out <- paste0(ikzf1_pred$V1, ikzf1_pred$V2, ikzf1_pred$V3)
ikzf1_pred <- ikzf1_pred[-which(duplicated(out)),]
dim(ikzf1_pred)

## [1] 13676     6

(Note - I’m really not sure how there are regions appearing twice with different HINT scores - I need to think about this).

Next, I want to compare these predicted binding sites to those identifed by CHiP-Seq (also in lymphoblast cells).

For my CHiP-Seq data, I first remove peaks mapping to weirdly named chromosomes (“chr1_KI270706v1_random”, “chr1_KI270713v1_random”, “chr17_GL000205v2_random”, “chr17_KI270729v1_random”, “chr22_KI270733v1_random”, “chrUn_GL000195v1”, “chrUn_GL000216v2”, “chrUn_GL000218v1”, “chrUn_GL000219v1”, “chrUn_GL000220v1”,“chrUn_GL000224v1”; only loses 34 peaks).

As before, I also remove duplicated regions - leaving me with 60,776 regions identified by CHiP-Seq.

head(ikzf1_chip)

##     V1        V2        V3 V4   V5 V6       V7 V8      V9 V10
## 1 chr3  15667631  15668267  .  734  . 16.25951 -1 0.57814 318
## 2 chr6 156396171 156396807  .  548  . 16.25955 -1 0.57815 318
## 3 chr2  42975128  42975764  .  974  . 16.26102 -1 0.57819 318
## 4 chr7 121339144 121339780  .  750  . 16.27006 -1 0.57872 318
## 5 chr2  88691253  88691974  . 1000  . 16.27493 -1 0.57901 537
## 6 chr6  21208416  21209052  .  944  . 16.28223 -1 0.57955 318

dim(ikzf1_chip)

## [1] 60776    10

There are two differences which are immediately apparant between the predicted and “empirical” CHiP-Seq data:

There are more occupied binding sites identifed by CHiP-Seq than predicted by Funk et al. (60,776 versus 13,676). This is likely an artefact of the thresholding. I.e. the CHiP-Seq binding sites are called using “conservative_idr_threshold” (read up on this) whereas the predicted binding sites are called by a relatively arbitrary HINT score threshold.
The CHiP-Seq occupied binding sites are much larger than those predicted by Funk et al. This is because all of the predicted binding sites are 7 bp in length since motif information is used, whereas those from CHiP-Seq can be larger (mean 567 bp). Note that this may also account for fewer binding sites in the CHiP-Seq data than those predicted, perhaps the CHiP-Seq binding sites overlap multiple predicted binding sites? Check what proportion of predicted binding sites fall within the CHiP-Seq peaks.

As a sanity check, I see if the relative frequency of predicted and “empirical” (from CHiP-Seq) occupied binding sites across chromosomes look similar between the two data sets. The distribution across chromosomes looks pretty similar. The big difference is the sex chromosomes.

Removing the sex chromosomes, I investigate whether the proportion of binding sites across chromosomes are similar. Indeed they are.

I now investigate whether there is much overlap between the predicted and the empirical binding sites. For the 60,776 CHiP-Seq peaks, 54280 (89%) did not overlap a predicted binding site. For the remainder, most housed one predicted binding site and a few housed a couple.

ikzf1_pred_chip_match <- fread("ikzf1_pred_chip_match.txt")
no_overlap <- which(ikzf1_pred_chip_match$V6==-1)

ikzf1_pred_chip_match_overlap <- ikzf1_pred_chip_match[-no_overlap,]
plot(table(table(ikzf1_pred_chip_match_overlap$V2)), xlab = "Number of predicted binding sites in a CHiP-Seq peak", ylab = "Frequency")

(Note that there is an option to intersect multiple files, so I could intersect T1D SNPs here too. See https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html)

Next, I match my T1D SNPs to both CHiP-Seq peaks and predicted peaks (note I need to download a newer version of bedtools to simultaneously intersect so I’ve done it seperately for now). Of the 715,023 T1D SNPs, only 50 mapped to predicted CHiP-Seq peaks (likely due to the tiny predicted binding sites).

I check whether there is enrichment for smaller \(p\)-values for the SNPs that fell in predicted and empirical peaks. Of the 715,023 T1D SNPs, only 50 mapped to predicted CHiP-Seq peaks (likely due to the tiny predicted binding sites). The QQ plot is not very informative due to a small number of data points.

Of the 715,023 T1D SNPs, 20,067 mapped to CHiP-Seq peaks. Note that some SNPs mapped to multiple peaks, showing that the peaks overlap. The QQ plot shows that SNPs in peaks have smaller \(p\)-values.

p_chip_peak <- ikzf1_chip_t1d_match$V5[which(ikzf1_chip_t1d_match$V8!=-1)]
p_chip_nopeak <- ikzf1_chip_t1d_match$V5[which(ikzf1_chip_t1d_match$V8==-1)]

qqplot(-log10(p_chip_nopeak), -log10(p_chip_peak), ylab = "SNPs in CHiP-Seq peak", xlab = "SNPs not in peak", pch = 16, main = "QQ plot for (-log10) p-vals\n(Chip-seq peaks)")
abline(0, 1, lty = "dashed")

Distance to gene confounder

It is possible that the distance to the nearest gene is acting as a confounder when investigating the relationship between T1D GWAS \(p\)-values and TF peaks.

To investigate this, I match each of the T1D GWAS SNPs to the nearest Gencode gene (release 35; both in GRCh38) and record the distance in bp.

I construct a heatmap for the correlations between \(p\) and our CHiP-Seq results (recall from last week: q_signalvalue is the signal value for the peak that the SNP resides in (0 if SNP doesn’t reside in peak) and q_binary is a binary indicator for whether the the SNP resides in an empirical CHiP-Seq peak or not). This looks good.

res <- readRDS("robertson_aux.RDS")

# remember to take abs(distance)
pheatmap(cor(cbind(res[,c(13,17,18)], abs(res$dist))), display_numbers = T)

To investigate this confounding further, I then regress T1D GWAS \(p\)-values against distance to nearest gene and binary indicator of peak. Both terms are significant, which suggests that both are providing some seperate information.

model <- lm(pphaseIandII ~ abs(dist) + q_binary, data = res)
summary(model)

## 
## Call:
## lm(formula = pphaseIandII ~ abs(dist) + q_binary, data = res)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58618 -0.29141 -0.02784  0.26886  0.61038 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.152e-01  3.943e-04 1053.07   <2e-16 ***
## abs(dist)    4.338e-07  2.154e-08   20.14   <2e-16 ***
## q_binary    -2.562e-02  2.366e-03  -10.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3125 on 715020 degrees of freedom
## Multiple R-squared:  0.0007496,  Adjusted R-squared:  0.0007468 
## F-statistic: 268.2 on 2 and 715020 DF,  p-value: < 2.2e-16

Similarly, when using distance to nearest gene and signal value of peak as covariates, both are significant.

model <- lm(pphaseIandII ~ abs(dist) + q_signalvalue, data = res)
summary(model)

## 
## Call:
## lm(formula = pphaseIandII ~ abs(dist) + q_signalvalue, data = res)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58763 -0.29132 -0.02772  0.26888  0.76721 
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)    4.148e-01  3.912e-04 1060.435  < 2e-16 ***
## abs(dist)      4.378e-07  2.154e-08   20.326  < 2e-16 ***
## q_signalvalue -8.008e-05  1.153e-05   -6.948 3.71e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3125 on 715020 degrees of freedom
## Multiple R-squared:  0.0006531,  Adjusted R-squared:  0.0006503 
## F-statistic: 233.7 on 2 and 715020 DF,  p-value: < 2.2e-16

But when including all three terms, q_signalvalue is no longer significant as q_binary provides all the relevant information. The final model picked out by stepwise regression is p ~ abs(dist) + q_binary.

model <- lm(pphaseIandII ~ abs(dist) + q_binary + q_signalvalue, data = res)
summary(model)

## 
## Call:
## lm(formula = pphaseIandII ~ abs(dist) + q_binary + q_signalvalue, 
##     data = res)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58617 -0.29141 -0.02782  0.26886  0.61016 
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)    4.152e-01  3.943e-04 1053.065   <2e-16 ***
## abs(dist)      4.338e-07  2.154e-08   20.134   <2e-16 ***
## q_binary      -2.483e-02  2.985e-03   -8.318   <2e-16 ***
## q_signalvalue -6.341e-06  1.454e-05   -0.436    0.663    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3125 on 715019 degrees of freedom
## Multiple R-squared:  0.0007498,  Adjusted R-squared:  0.0007456 
## F-statistic: 178.8 on 3 and 715019 DF,  p-value: < 2.2e-16

stepAIC(model, direction = "both")

## Start:  AIC=-1663522
## pphaseIandII ~ abs(dist) + q_binary + q_signalvalue
## 
##                 Df Sum of Sq   RSS      AIC
## - q_signalvalue  1     0.019 69810 -1663524
## <none>                       69810 -1663522
## - q_binary       1     6.755 69817 -1663455
## - abs(dist)      1    39.577 69849 -1663119
## 
## Step:  AIC=-1663524
## pphaseIandII ~ abs(dist) + q_binary
## 
##                 Df Sum of Sq   RSS      AIC
## <none>                       69810 -1663524
## + q_signalvalue  1     0.019 69810 -1663522
## - q_binary       1    11.450 69821 -1663409
## - abs(dist)      1    39.581 69849 -1663121

## 
## Call:
## lm(formula = pphaseIandII ~ abs(dist) + q_binary, data = res)
## 
## Coefficients:
## (Intercept)    abs(dist)     q_binary  
##   4.152e-01    4.338e-07   -2.562e-02

Comments and notes

“Leveraging existing GWAS summary data of genetically correlated and uncorrelated traits to improve power for a new GWAS” (2020): Useful paper in Genetic Epidemiology that leverages GWAS test statistics with lots of other GWAS results using weighted hypothesis testing. Different to cFDR since the GWAS traits don’t need to be highly genetically correlated - “We emphasize that our approach does not necessarily require the use of other GWAS data with related traits; our approach leverages not only genetic pleiotropy (Andreassen, Djurovic et al., 2013; Chung et al., 2014), but also other shared genetic mechanisms of trait‐associated SNPs, for example, in transcriptional regulation as expression QTL”. Their method is based on a weighted analysis controlling the FWER, whereas cFDR is based on stratified analysis controlling the FDR.
- Pretty simple method. They first exclude GWAS’ with evidence of overlapping samples (by calculating the correlation between estimated null SNPs) to construct weights (to ensure that the constructed weights and the test statistics were independent). The weight is simply the normalised sum of the absolute Z-scores for the remaining GWAS’, each raised to the power of \(x\) - the power index. “The power index serves as a tuning parameter to data-adaptively combine information across multiple GWAS data sets”. In practise, they use the power index value that gives the most associations.
- They bin SNPs into 100 equally sized groups to get “more stable and smoother weights”.
- They even found that leveraging genetically uncorrelated traits was useful, and list several reasons why this is the case. (1) “due to the limitation of measuring only a linear relationship of the causal variant effect sizes between two traits by a genetic correlation, genetically uncorrelated traits may still share overlapping causal variants (Frei et al., 2019).” (2) “given the well hypothesized functional, especially regulatory, roles of GWAS loci, coupled with the emerging theory of the omnigenic model for GWAS (Boyle, Li, & Pritchard, 2017), it is reasonable to assume that many GWAS loci for some seemingly unrelated GWAS traits may turn out to be shared”..
- They’re heavy on the control of type 1 error rate that weighted hypothesis testing has, but address that for this to hold, independent data must be used to constuct the weights (i.e. we can’t even upweight a GWAS if it is very correlated with the target GWAS as this uses information on the target GWAS).
The DICE database is useful: https://dice-database.org/ (https://www.sciencedirect.com/science/article/pii/S009286741831331X?via%3Dihub)