10th September 2020

Asthma annotations

[See explanation of some of the annotations in the “Full baseline model” section here].

I firstly remove the 17 non-functional annotations (MAF and LD-based) included in the baseline-LD model V2.2, leaving me with 79 functional annotations. I group these into larger groups:

##                                     original_annots      groups numeric_groups
## 67                                          BivFlnk    bivalent              1
## 68                             BivFlnk.flanking.500    bivalent              1
## 1                                       Coding_UCSC      coding              2
## 2                          Coding_UCSC.flanking.500      coding              2
## 59                                       synonymous      coding              2
## 60                                   non_synonymous      coding              2
## 3                             Conserved_LindbladToh   conserved              3
## 4                Conserved_LindbladToh.flanking.500   conserved              3
## 53                                          GERP.NS   conserved              3
## 54                                      GERP.RSsup4   conserved              3
## 61              Conserved_Vertebrate_phastCons46way   conserved              3
## 62 Conserved_Vertebrate_phastCons46way.flanking.500   conserved              3
## 63                  Conserved_Mammal_phastCons46way   conserved              3
## 64     Conserved_Mammal_phastCons46way.flanking.500   conserved              3
## 65                 Conserved_Primate_phastCons46way   conserved              3
## 66    Conserved_Primate_phastCons46way.flanking.500   conserved              3
## 5                                      CTCF_Hoffman        CTCF              4
## 6                         CTCF_Hoffman.flanking.500        CTCF              4
## 7                                        DGF_ENCODE         DGF              5
## 8                           DGF_ENCODE.flanking.500         DGF              5
## 9                                  DHS_peaks_Trynka         DHS              6
## 10                                       DHS_Trynka         DHS              6
## 11                          DHS_Trynka.flanking.500         DHS              6
## 16                                  FetalDHS_Trynka         DHS              6
## 17                     FetalDHS_Trynka.flanking.500         DHS              6
## 58                 BLUEPRINT_DNA_methylation_MaxCPP    DNA_meth              7
## 12                               Enhancer_Andersson    enhancer              8
## 13                  Enhancer_Andersson.flanking.500    enhancer              8
## 14                                 Enhancer_Hoffman    enhancer              8
## 15                    Enhancer_Hoffman.flanking.500    enhancer              8
## 39                              SuperEnhancer_Hnisz    enhancer              8
## 40                 SuperEnhancer_Hnisz.flanking.500    enhancer              8
## 51                             WeakEnhancer_Hoffman    enhancer              8
## 52                WeakEnhancer_Hoffman.flanking.500    enhancer              8
## 71                            Human_Enhancer_Villar    enhancer              8
## 72               Human_Enhancer_Villar.flanking.500    enhancer              8
## 75              Ancient_Sequence_Age_Human_Enhancer    enhancer              8
## 76 Ancient_Sequence_Age_Human_Enhancer.flanking.500    enhancer              8
## 77     Human_Enhancer_Villar_Species_Enhancer_Count    enhancer              8
## 55                                 GTEx_eQTL_MaxCPP        eQTL              9
## 18                                    H3K27ac_Hnisz     H3K27ac             10
## 19                       H3K27ac_Hnisz.flanking.500     H3K27ac             10
## 20                                     H3K27ac_PGC2     H3K27ac             10
## 21                        H3K27ac_PGC2.flanking.500     H3K27ac             10
## 56                      BLUEPRINT_H3K27acQTL_MaxCPP     H3K27ac             10
## 22                             H3K4me1_peaks_Trynka     H3K4me1             11
## 23                                   H3K4me1_Trynka     H3K4me1             11
## 24                      H3K4me1_Trynka.flanking.500     H3K4me1             11
## 57                      BLUEPRINT_H3K4me1QTL_MaxCPP     H3K4me1             11
## 25                             H3K4me3_peaks_Trynka     H3K4me3             12
## 26                                   H3K4me3_Trynka     H3K4me3             12
## 27                      H3K4me3_Trynka.flanking.500     H3K4me3             12
## 28                              H3K9ac_peaks_Trynka      H3K9ac             13
## 29                                    H3K9ac_Trynka      H3K9ac             13
## 30                       H3K9ac_Trynka.flanking.500      H3K9ac             13
## 31                                      Intron_UCSC      intron             14
## 32                         Intron_UCSC.flanking.500      intron             14
## 33                         PromoterFlanking_Hoffman    promoter             15
## 34            PromoterFlanking_Hoffman.flanking.500    promoter             15
## 35                                    Promoter_UCSC    promoter             15
## 36                       Promoter_UCSC.flanking.500    promoter             15
## 69                            Human_Promoter_Villar    promoter             15
## 70               Human_Promoter_Villar.flanking.500    promoter             15
## 73              Ancient_Sequence_Age_Human_Promoter    promoter             15
## 74 Ancient_Sequence_Age_Human_Promoter.flanking.500    promoter             15
## 78                       Human_Promoter_Villar_ExAC    promoter             15
## 79          Human_Promoter_Villar_ExAC.flanking.500    promoter             15
## 37                                Repressed_Hoffman   repressed             16
## 38                   Repressed_Hoffman.flanking.500   repressed             16
## 41                                      TFBS_ENCODE        TFBS             17
## 42                         TFBS_ENCODE.flanking.500        TFBS             17
## 43                                  Transcr_Hoffman transcribed             18
## 44                     Transcr_Hoffman.flanking.500 transcribed             18
## 45                                      TSS_Hoffman         TSS             19
## 46                         TSS_Hoffman.flanking.500         TSS             19
## 47                                       UTR_3_UCSC         UTR             20
## 48                          UTR_3_UCSC.flanking.500         UTR             20
## 49                                       UTR_5_UCSC         UTR             20
## 50                          UTR_5_UCSC.flanking.500         UTR             20

I then take the group averages and add noise to ensure that the auxiliary data is continuous (required to fit our KDE). For the noise term I use rnorm(n, mean = 0, sd = .1/#{in group} (histograms are shown further down this document).

I then run a lasso on log(p) against the averaged (+ noise) annotations to choose which to iterate over. Interestingly, lambda.1se picks out no annotations whilst lambda.min picks out 16, shown below.

However, the cross validation plot for lambda doesn’t look great..

I look at the results when using stepwise regression instead. Here, we need to check that predictors aren’t highly correlated:

In summary, Lasso picks out 16/20 annotations (dropping coding, conserved, CTCF, H3K4me3) whilst stepwise regression selects 13/20 (dropping those above and UTR, eQTL, TSS).

## 
## Call:
## lm(formula = logp ~ DGF + DHS + enhancer + H3K27ac + H3K4me1 + 
##     H3K9ac + intron + promoter + repressed + TFBS + transcribed + 
##     DNA_meth + bivalent, data = reg_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.449  -0.372   0.327   0.734   1.184 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -1.004484   0.004637 -216.600  < 2e-16 ***
## DGF         -0.030906   0.007187   -4.300 1.70e-05 ***
## DHS         -0.022386   0.010962   -2.042 0.041128 *  
## enhancer    -0.031509   0.022044   -1.429 0.152904    
## H3K27ac     -0.029117   0.012600   -2.311 0.020839 *  
## H3K4me1     -0.101524   0.025503   -3.981 6.87e-05 ***
## H3K9ac       0.020067   0.003934    5.101 3.39e-07 ***
## intron      -0.050061   0.014410   -3.474 0.000513 ***
## promoter    -0.038967   0.015582   -2.501 0.012394 *  
## repressed   -0.024064   0.014741   -1.632 0.102581    
## TFBS        -0.051527   0.013861   -3.717 0.000201 ***
## transcribed -0.045417   0.020112   -2.258 0.023935 *  
## DNA_meth    -0.013537   0.004823   -2.807 0.005007 ** 
## bivalent    -0.063862   0.012174   -5.246 1.56e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.12 on 509702 degrees of freedom
## Multiple R-squared:  0.0006513,  Adjusted R-squared:  0.0006258 
## F-statistic: 25.55 on 13 and 509702 DF,  p-value: < 2.2e-16

I begin iterating over the annotations selected by stepwise regression, in order of correlation with p (although the order shouldn’t matter) (takes 6.5 hours to iterate over 13 annotations) [see /rds/user/ah2011/hpc-work/asthma/stepwise_annots].

cor <- c()
for(i in 1:ncol(stepwise_annots)){
  cor[i] <- cor(p[indep_index], stepwise_annots[indep_index,i])
}

names(cor) <- names(stepwise_annots)

sort(abs(cor), decreasing = TRUE)

##          DGF     bivalent     enhancer      H3K27ac          DHS     DNA_meth 
## 0.0080990567 0.0070612517 0.0068018127 0.0065218768 0.0062893199 0.0062539180 
##      H3K4me1         TFBS     promoter    repressed       intron  transcribed 
## 0.0062409250 0.0050972464 0.0043306519 0.0041150544 0.0038560319 0.0015013272 
##       H3K9ac 
## 0.0001683775

#asthma_qs <- stepwise_annots[,names(sort(abs(cor), decreasing = TRUE))]

First, lets look at the histogram of the \(q\) that we iterate over (horizontally scrollable):

Now, lets look at the relationship between p and q in each iteration:

Now, looking at p vrs v in each iteration:

Now looking at the -log10 plots:

Problems:

Iteration 1:

Small q are not getting shrunk. I’ve looked at increasing ngridp to try and fix this but this seems hacky. I think that it is ok to leave as it is (artefact of how the cFDR curves look, due to very few points in the lowest “hump”).

Iteration 6 (DNA_meth), 7 (H3K4me1), 8 (TFBS), …, (promoter) (repressed) (transcribed) etc. Are the values too variable within stripes (representing each mode i.e. each discrete average value)? Technically these all have the same q value (except noise).

E.g. for iteration 8 (TFBS) - the grouped annotation averages can take 3 values. We then add noise around these. The plot below shows that SNPs with the same TFBS value (+noise) can be given extremely different v-values (coloured by the three values the grouped annotation can take).

A possible solution is to decrease the SD for the noise term (currently .1/number of annotations in group).

Iteration 9 (promoter) and 10 (repressed) are things shrunk too much? These iterations mean that loads more things are called significant.

E.g. Iteration 10 (repressed) grouped-annotation (before noise) can take on 3 distinct values (0 - black, 0.5 - green, 1 - red) but they are shrunk massively.

Spline correction is causing the super low lines on the H3K4me1, promoter and repressed iterations.

E.g. without spline correction, iteration 10 (repressed) looks like this (but the -log10 plots are messed up):

The results obviously look better for “nicer” q (properly continuous). E.g. iteration 13 (H3K9ac). Is there anyway we can summarise the data so that each iteration looks more like this?

Simulations

I have re-ran my simulations so that they include a column for the maximum \(r^2\) values with the CVs. This allows me to alter the \(r^2\) thresholds used to call associated and not-associated SNPs.

As I decrease the \(r^2\) threshold used to call not-associated SNPs, the specificity increases, showing that the drop in specificity is an LD problem.

FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables
FALSE No id variables; using all as measure variables

In my final simulation plots, “associated” is \(r^2>0.8\) with a CV (approx. 1500 SNPs) and “not associated” is \(r^2<0.05\) with all CVs (approx. 67,000 SNPs). See the seperate figure PDF for these plots.

To do

For GenoWAP/FINDOR comparison should I be using GWS or FDR threshold to call significant things?
GenoWAP comparison figure?
FINDOR comparison figure?
Make R package
Binary cFDR - not needed for this manuscript. Should I write another short manuscript (e.g. for bioinformatics with the cFDR software - including binary implementation?)

10th September 2020

Anna Hutchinson

Asthma annotations

Problems:

Simulations

To do