[See explanation of some of the annotations in the “Full baseline model” section here].
I firstly remove the 17 non-functional annotations (MAF and LD-based) included in the baseline-LD model V2.2, leaving me with 79 functional annotations. I group these into larger groups:
## original_annots groups numeric_groups
## 67 BivFlnk bivalent 1
## 68 BivFlnk.flanking.500 bivalent 1
## 1 Coding_UCSC coding 2
## 2 Coding_UCSC.flanking.500 coding 2
## 59 synonymous coding 2
## 60 non_synonymous coding 2
## 3 Conserved_LindbladToh conserved 3
## 4 Conserved_LindbladToh.flanking.500 conserved 3
## 53 GERP.NS conserved 3
## 54 GERP.RSsup4 conserved 3
## 61 Conserved_Vertebrate_phastCons46way conserved 3
## 62 Conserved_Vertebrate_phastCons46way.flanking.500 conserved 3
## 63 Conserved_Mammal_phastCons46way conserved 3
## 64 Conserved_Mammal_phastCons46way.flanking.500 conserved 3
## 65 Conserved_Primate_phastCons46way conserved 3
## 66 Conserved_Primate_phastCons46way.flanking.500 conserved 3
## 5 CTCF_Hoffman CTCF 4
## 6 CTCF_Hoffman.flanking.500 CTCF 4
## 7 DGF_ENCODE DGF 5
## 8 DGF_ENCODE.flanking.500 DGF 5
## 9 DHS_peaks_Trynka DHS 6
## 10 DHS_Trynka DHS 6
## 11 DHS_Trynka.flanking.500 DHS 6
## 16 FetalDHS_Trynka DHS 6
## 17 FetalDHS_Trynka.flanking.500 DHS 6
## 58 BLUEPRINT_DNA_methylation_MaxCPP DNA_meth 7
## 12 Enhancer_Andersson enhancer 8
## 13 Enhancer_Andersson.flanking.500 enhancer 8
## 14 Enhancer_Hoffman enhancer 8
## 15 Enhancer_Hoffman.flanking.500 enhancer 8
## 39 SuperEnhancer_Hnisz enhancer 8
## 40 SuperEnhancer_Hnisz.flanking.500 enhancer 8
## 51 WeakEnhancer_Hoffman enhancer 8
## 52 WeakEnhancer_Hoffman.flanking.500 enhancer 8
## 71 Human_Enhancer_Villar enhancer 8
## 72 Human_Enhancer_Villar.flanking.500 enhancer 8
## 75 Ancient_Sequence_Age_Human_Enhancer enhancer 8
## 76 Ancient_Sequence_Age_Human_Enhancer.flanking.500 enhancer 8
## 77 Human_Enhancer_Villar_Species_Enhancer_Count enhancer 8
## 55 GTEx_eQTL_MaxCPP eQTL 9
## 18 H3K27ac_Hnisz H3K27ac 10
## 19 H3K27ac_Hnisz.flanking.500 H3K27ac 10
## 20 H3K27ac_PGC2 H3K27ac 10
## 21 H3K27ac_PGC2.flanking.500 H3K27ac 10
## 56 BLUEPRINT_H3K27acQTL_MaxCPP H3K27ac 10
## 22 H3K4me1_peaks_Trynka H3K4me1 11
## 23 H3K4me1_Trynka H3K4me1 11
## 24 H3K4me1_Trynka.flanking.500 H3K4me1 11
## 57 BLUEPRINT_H3K4me1QTL_MaxCPP H3K4me1 11
## 25 H3K4me3_peaks_Trynka H3K4me3 12
## 26 H3K4me3_Trynka H3K4me3 12
## 27 H3K4me3_Trynka.flanking.500 H3K4me3 12
## 28 H3K9ac_peaks_Trynka H3K9ac 13
## 29 H3K9ac_Trynka H3K9ac 13
## 30 H3K9ac_Trynka.flanking.500 H3K9ac 13
## 31 Intron_UCSC intron 14
## 32 Intron_UCSC.flanking.500 intron 14
## 33 PromoterFlanking_Hoffman promoter 15
## 34 PromoterFlanking_Hoffman.flanking.500 promoter 15
## 35 Promoter_UCSC promoter 15
## 36 Promoter_UCSC.flanking.500 promoter 15
## 69 Human_Promoter_Villar promoter 15
## 70 Human_Promoter_Villar.flanking.500 promoter 15
## 73 Ancient_Sequence_Age_Human_Promoter promoter 15
## 74 Ancient_Sequence_Age_Human_Promoter.flanking.500 promoter 15
## 78 Human_Promoter_Villar_ExAC promoter 15
## 79 Human_Promoter_Villar_ExAC.flanking.500 promoter 15
## 37 Repressed_Hoffman repressed 16
## 38 Repressed_Hoffman.flanking.500 repressed 16
## 41 TFBS_ENCODE TFBS 17
## 42 TFBS_ENCODE.flanking.500 TFBS 17
## 43 Transcr_Hoffman transcribed 18
## 44 Transcr_Hoffman.flanking.500 transcribed 18
## 45 TSS_Hoffman TSS 19
## 46 TSS_Hoffman.flanking.500 TSS 19
## 47 UTR_3_UCSC UTR 20
## 48 UTR_3_UCSC.flanking.500 UTR 20
## 49 UTR_5_UCSC UTR 20
## 50 UTR_5_UCSC.flanking.500 UTR 20
I then take the group averages and add noise to ensure that the auxiliary data is continuous (required to fit our KDE). For the noise term I use rnorm(n, mean = 0, sd = .1/#{in group}
(histograms are shown further down this document).
I then run a lasso on log(p)
against the averaged (+ noise) annotations to choose which to iterate over. Interestingly, lambda.1se
picks out no annotations whilst lambda.min
picks out 16, shown below.
However, the cross validation plot for lambda doesn’t look great..
I look at the results when using stepwise regression instead. Here, we need to check that predictors aren’t highly correlated:
In summary, Lasso picks out 16/20 annotations (dropping coding, conserved, CTCF, H3K4me3) whilst stepwise regression selects 13/20 (dropping those above and UTR, eQTL, TSS).
##
## Call:
## lm(formula = logp ~ DGF + DHS + enhancer + H3K27ac + H3K4me1 +
## H3K9ac + intron + promoter + repressed + TFBS + transcribed +
## DNA_meth + bivalent, data = reg_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.449 -0.372 0.327 0.734 1.184
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.004484 0.004637 -216.600 < 2e-16 ***
## DGF -0.030906 0.007187 -4.300 1.70e-05 ***
## DHS -0.022386 0.010962 -2.042 0.041128 *
## enhancer -0.031509 0.022044 -1.429 0.152904
## H3K27ac -0.029117 0.012600 -2.311 0.020839 *
## H3K4me1 -0.101524 0.025503 -3.981 6.87e-05 ***
## H3K9ac 0.020067 0.003934 5.101 3.39e-07 ***
## intron -0.050061 0.014410 -3.474 0.000513 ***
## promoter -0.038967 0.015582 -2.501 0.012394 *
## repressed -0.024064 0.014741 -1.632 0.102581
## TFBS -0.051527 0.013861 -3.717 0.000201 ***
## transcribed -0.045417 0.020112 -2.258 0.023935 *
## DNA_meth -0.013537 0.004823 -2.807 0.005007 **
## bivalent -0.063862 0.012174 -5.246 1.56e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.12 on 509702 degrees of freedom
## Multiple R-squared: 0.0006513, Adjusted R-squared: 0.0006258
## F-statistic: 25.55 on 13 and 509702 DF, p-value: < 2.2e-16
I begin iterating over the annotations selected by stepwise regression, in order of correlation with p (although the order shouldn’t matter) (takes 6.5 hours to iterate over 13 annotations) [see /rds/user/ah2011/hpc-work/asthma/stepwise_annots
].
cor <- c()
for(i in 1:ncol(stepwise_annots)){
cor[i] <- cor(p[indep_index], stepwise_annots[indep_index,i])
}
names(cor) <- names(stepwise_annots)
sort(abs(cor), decreasing = TRUE)
## DGF bivalent enhancer H3K27ac DHS DNA_meth
## 0.0080990567 0.0070612517 0.0068018127 0.0065218768 0.0062893199 0.0062539180
## H3K4me1 TFBS promoter repressed intron transcribed
## 0.0062409250 0.0050972464 0.0043306519 0.0041150544 0.0038560319 0.0015013272
## H3K9ac
## 0.0001683775
#asthma_qs <- stepwise_annots[,names(sort(abs(cor), decreasing = TRUE))]
First, lets look at the histogram of the \(q\) that we iterate over (horizontally scrollable):
Now, lets look at the relationship between p and q in each iteration:
Now, looking at p vrs v in each iteration:
Now looking at the -log10 plots:
Iteration 1:
Small q are not getting shrunk. I’ve looked at increasing ngridp
to try and fix this but this seems hacky. I think that it is ok to leave as it is (artefact of how the cFDR curves look, due to very few points in the lowest “hump”).
Iteration 6 (DNA_meth), 7 (H3K4me1), 8 (TFBS), …, (promoter) (repressed) (transcribed) etc. Are the values too variable within stripes (representing each mode i.e. each discrete average value)? Technically these all have the same q value (except noise).
E.g. for iteration 8 (TFBS) - the grouped annotation averages can take 3 values. We then add noise around these. The plot below shows that SNPs with the same TFBS value (+noise) can be given extremely different v-values (coloured by the three values the grouped annotation can take).
A possible solution is to decrease the SD for the noise term (currently .1/number of annotations in group).
Iteration 9 (promoter) and 10 (repressed) are things shrunk too much? These iterations mean that loads more things are called significant.
E.g. Iteration 10 (repressed) grouped-annotation (before noise) can take on 3 distinct values (0 - black, 0.5 - green, 1 - red) but they are shrunk massively.
Spline correction is causing the super low lines on the H3K4me1, promoter and repressed iterations.
E.g. without spline correction, iteration 10 (repressed) looks like this (but the -log10 plots are messed up):
The results obviously look better for “nicer” q (properly continuous). E.g. iteration 13 (H3K9ac). Is there anyway we can summarise the data so that each iteration looks more like this?
I have re-ran my simulations so that they include a column for the maximum \(r^2\) values with the CVs. This allows me to alter the \(r^2\) thresholds used to call associated and not-associated SNPs.
As I decrease the \(r^2\) threshold used to call not-associated SNPs, the specificity increases, showing that the drop in specificity is an LD problem.
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
In my final simulation plots, “associated” is \(r^2>0.8\) with a CV (approx. 1500 SNPs) and “not associated” is \(r^2<0.05\) with all CVs (approx. 67,000 SNPs). See the seperate figure PDF for these plots.
For GenoWAP/FINDOR comparison should I be using GWS or FDR threshold to call significant things?
GenoWAP comparison figure?
FINDOR comparison figure?
Make R package
Binary cFDR - not needed for this manuscript. Should I write another short manuscript (e.g. for bioinformatics with the cFDR software - including binary implementation?)