2nd June

1. cFDR method

I have made a functional_cfdr_v2 function which has the following amendments:

No leave-one-chromosome-out (LOCO) method
Use earlier density estimate of \(Q|H0\) rather than estimating it again in the final integration step
Remove the contour segment in problematic low \(q\).
- The cFDR method struggles to find contours for very low \(q\), where there are very few data points. This results in segmented contour lines from the contourLines function (the first segment in the problematic region). Interestingly, these segmented contour lines occur only for large \(p\).
- I remove the first segment of the segmented contours (that in the problematic low \(q\) region), and let the contours drop straight down. James does something similar in his vly function, whereby he sets the coordinate equal to the cFDR value that we are evaluating the contour at, or the value evaluated before this point.

This function takes ~15 minutes to run for the 121,000 SNPs.

Testing robustness

To test the robustness of the iterations, I begin by simulating some independent \(q\) vectors that are:

1. Sampled from the standard uniform

The KDE (red) for \(Q\) (estimated from \(Q,P\leq 1\)) fits the data pretty well.

As does the KDE (red) for \(Q|H0\) (estimated by \(Q|P>1/2\)).

Looking at the results over each iteration (horizontally scrollable figure), things look ok.

2. Sampled from various mixture normals

The KDE (red) for \(Q|H0\) (estimated from \(Q,P\leq 1|H0\)) fits the data pretty well.

As does the KDE for \(Q|H0\) (estimated from \(Q|P>1/2\)).

The results when iterating look mostly ok, except for some \(v\) values for big \(p\) getting shrunk to \(~0\), even after the first iteration.

I further investigate the problematic \(p\) values getting shrunk to approximately 0 (bottom row of points). The problematic points where the v values drop to \(~0\) for big \(p\) are exactly those indices with the smallest \(q\) values.

problematic <- c(512,8114,9945,22150,26221,28257,36334,47651,48171,50027,65198,68725,69214,83775,86758,97396,98061,111930,117896,117954)

df[problematic,c("p","v1","q1")]

##                p           v1        q1
## 512    0.9902070 3.559145e-04 -7.960117
## 8114   0.6573259 6.757383e-05 -8.619876
## 9945   0.7612904 3.220159e-04 -8.139427
## 22150  0.3233042 9.780110e-05 -8.573460
## 26221  0.8690088 3.586122e-04 -8.029130
## 28257  0.2148284 4.269887e-05 -8.723939
## 36334  0.9264683 1.251823e-05 -9.711855
## 47651  0.9996029 1.531063e-04 -8.391671
## 48171  0.9530344 2.607506e-04 -8.153745
## 50027  0.8703830 2.233774e-04 -8.206734
## 65198  0.7480449 1.851715e-04 -8.312892
## 68725  0.9473514 3.721990e-04 -7.955326
## 69214  0.6426269 1.521694e-04 -8.405320
## 83775  0.6636454 2.954917e-04 -8.199608
## 86758  0.4817340 1.076933e-04 -8.453504
## 97396  0.8404744 1.870133e-04 -8.303718
## 98061  0.5647271 1.494818e-04 -8.432636
## 111930 0.9932387 6.757383e-05 -8.646424
## 117896 0.8053611 1.943965e-04 -8.278441
## 117954 0.4469641 3.465357e-04 -8.256962

I step through the code.

Firstly, it seems that the KDE of \(Q|H0\) (red) fits the data fairly well in these regions.

kgrid estimates \(P(P<=p, Q<=q)/P(Q<=q|H0)\), we see that for low \(q\) and big \(p\) (low ZP), \(P(P<=p, Q<=q)<P(Q<=q|H0)\).

This impacts the cFDR curves (cgrid estimates cFDR ~ p/kgrid) for low q.

Bandwidth

The default bandwidth used for the KDE estimation (using kde2d) is:

bandwidth.nrd(c(q, q))

## [1] 0.9563462

I had a quick look at using a smaller bandwidth (0.3) and a larger bandwidth (2) and the results are:

Smaller bandwidth increases the number of problematic points

Larger bandwidth decreases the number of problematic points.

Possible solution: Use variable KDE (where the bandwidth adapts to the number of data points) –> sparser data = larger BW. (see https://rdrr.io/cran/ks/man/vkde.html).

2. Using LDSC for functional annotations

Sources:

(The following text is taken from FINDOR manuscript)

In S-LDSC, the association statistic at SNP \(j\) measured or imputed in \(N_j\) individuals is expressed in terms of its tagging or studied annotations. Specifically,

\[\begin{equation} E(\chi_j^2)=N_j \sum_C \tau_C \ell(j,C) + N_j\alpha +1 \end{equation}\]

where \(\alpha\) represents confounding biases, \(\tau_C\) is the effect size on per-SNP heritability of annotation C and \(\ell(j,C)\) is the LD score that indicates the degree to which SNP \(j\) tags annotation C:

\[\begin{equation} \ell(j,C)=\sum_k C(k)r^2_{k,j} \end{equation}\]

where, \(C(k)\) is the value of annotation C at SNP \(k\) and \(r^2_{k,j}\) is the squared Pearson correlation coefficient between SNPs \(k\) and \(j\).

\(\tau_C\) can be interpreted as the strength of enrichment (or depletion) of heritability within annotation \(C\) and are obtained through a multivariate (weighted) regression of the observed \(\chi^2\) statistics against the corresponding \(\ell(j,C)\) value (also see “regression weights” section here).

Perhaps we could use \(\tau_C\) as our monotonic auxillary functional data. Note that it is not as straight forward as regressing LD scores on \(\chi^2\) statistics to get \(\tau_C\). There are two problems that need to be dealt with:

1. \(\chi^2\) statistics at SNPs in LD are correlated.

Conventionally, GLS would be used to deal with this problem, using the variance-covariance matrix of \(\chi^2\) statistics, but this matrix is intractable.
Instead, variant \(j\) is weighted by the recipricol of the LD score of variant \(j\), counting LD only with other SNPs included in the regression. I.e. we weight by \(1/max(\ell_j(S),1)\) (where the max is to obtain only positive regression weights):

\[\begin{equation} \ell_j(S)=1+\sum_{k \in S} r^2_{jk}. \end{equation}\]

2. Heteroscedascity: variance of \(\chi^2\) statistics increase as LD scores increase.

To account for this, weight by

\[\begin{equation} (1+Nh^2_g\ell_j/M)^2 \end{equation}\]

which is the reciprocal of the conditional variance function \(var(\chi^2_j|\ell_j)\) under our model if we make the additional assumption that the effect sizes for each normalised genotype are normally distributed.

So are there are two regression weights applied consequtively?

Note: “Regression weights don’t affect the expectaion of the parameter estimates, only the standard error” - are they really needed here?

Manual implementation (without weighting, chr 21 only)

At the moment I am using the baseline-LD model v2.2 which includes 97 non-cell-type-specific annotations (see here).
The FINDOR paper finds that estimating \(\tau_C\) from the entire GWAS dataset does not introduce false positives (but if we’re worried we can use the leave-one-chromosome-out method).

I visualise the results using a forest plot. I do a quick sanity check that the results make sense:

Coding and Enhancer regions decrease \(\chi^2\) values (?).
GERP (conserved region) decreases \(\chi^2\) values.
Enhancers with ancient sequence age increases \(\chi^2\) values (?)
Promoters with ancient sequence age decreases \(\chi^2\) values (?)
CpG content increases \(\chi^2\) values (?).
Synonoymous and non-synonomous both increase \(\chi^2\) values (syn more so (?))

Implementation using ldsc software

Rather than coding the weighted regression up myself, I think it would be better to use the ldsc software.

Method:

Generate asthma summary statistics file (only keeping HapMap3 SNPs as recommended in the vignette - this reduces the number of SNPs from 2,001,256 to 1,045,340).
Download v2.2 baseline model LD scores (1000G Phase 3), regression weights and allele frequencies.
Run ldsc (1,032,395 SNPs remain after merging with baseline files). [lambda_GC=1.0255, mean Chi^2=1.0717].

The first few rows of the results file look like this:

##                                  Category Prop._SNPs   Prop._h2
## 1:                               baseL2_0 1.00000000 1.00000000
## 2:                        Coding_UCSCL2_0 0.01425914 0.11594246
## 3:           Coding_UCSC.flanking.500L2_0 0.04936288 0.14009295
## 4:              Conserved_LindbladTohL2_0 0.02467054 0.10974047
## 5: Conserved_LindbladToh.flanking.500L2_0 0.30553136 0.39245781
## 6:                       CTCF_HoffmanL2_0 0.02381466 0.05795815
##    Prop._h2_std_error Enrichment Enrichment_std_error Enrichment_p
## 1:         0.00000000   1.000000            0.0000000           NA
## 2:         0.05063785   8.131098            3.5512556   0.04093102
## 3:         0.10124352   2.838022            2.0510050   0.37507980
## 4:         0.09706498   4.448240            3.9344493   0.36020882
## 5:         0.18035825   1.284509            0.5903101   0.62953300
## 6:         0.09742828   2.433717            4.0911045   0.72282693
##      Coefficient Coefficient_std_error Coefficient_z-score
## 1: -3.127618e-09          9.470911e-09          -0.3302341
## 2:  1.363337e-07          9.436131e-08           1.4448047
## 3:  6.104273e-09          2.313859e-08           0.2638135
## 4: -4.470751e-08          6.923433e-08          -0.6457419
## 5:  2.212767e-09          1.290613e-08           0.1714509
## 6:  7.786252e-09          4.013776e-08           0.1939882

Note that I’ve removed some anomoly annotations. (see https://www.nature.com/articles/s41588-018-0177-x, https://xiangzhu.github.io/mvp-cad/mvpcad_ldsc_baselineLD.html etc.)

Plotting the enrichment values (\(p\) values given above) (horizontally scrollable figure):

Note this FAQ:

Q. Why is one of my partitioned h2 estimates negative?

A. If your partitioned h2 estimate is non-significantly negative, then this may just be sampling noise. If your partitioned h2 estimate is significantly negative, this may indicate model misspecification. Fitting well-specified models of partitioned heritability is hard. I hate to link to a paper in the middle of an FAQ, but this was the main technical challenge in the paper we wrote about partitioned heritability, so I don’t know of a better resource. See Finucane, Bulik-Sullivan et al, 2015.

Plotting the coefficient values (\(p\) values given above):

The idea was to pick out the significant annotations and use these to generate our \(q\) values (which should be monotonic in \(p\)). E.g. If I take the annotation BLUEPRINT_H3K27acQTL_MaxCPP (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6030458/) which has the most significant coefficient (\(\tau_C=1.969628e-07\); \(p=0.0015\)), then Chris suggested that \(T_c=\tau_c \ell(j,C)\) would be monotonic in \(p\) (\(\ell(j,C)\) is LD score).

Comments

LDSC requires large GWAS sample sizes, and if using precomoputed LD scores, then these are currently limited to European and East Asian populations. Also shouldn’t be used for custom genotyping arrays.
When I do want to include the cell-type specific annotations, I need to include each of these in addition to the baseline model seperately to control for overlap. (i.e. for 220 cell-type specific annotations, will have 220 seperate models).

“When generating these 220 cell type–specific annotations, we wanted to control for overlap with the functional categories in the full baseline model but not for overlap with the 219 other cell type–specific annotations. Thus, we added these annotations individually to the baseline model, creating 220 separate models, each with 54 annotations. Then, for a given phenotype, we ran LD score regression once each on the 220 models and ranked the cell type–specific annotations by the P value of the coefficient \(\tau_C\) of the annotation in the corresponding analysis. This P value tests whether the annotation contributes significantly to SNP heritability after controlling for the effects of the annotations in the full baseline model. We also divided the 220 cell type–specific annotations into 10 groups: adrenal and pancreas, CNS, cardiovascular, connective and bone, gastrointestinal, immune and hematopoietic, kidney, liver, skeletal muscle and other. We took a union of the cell type–specific annotations within each group, resulting in ten new cell type group annotations (for example, SNPs with any of the four histone modifications in any CNS cell type). We then repeated the cell type–specific analysis described above with these ten cell type groups instead of the 220 cell type–specific annotations.”

Quite a few papers report standardised coefficients. E.g. Here: “we standardize estimated effect sizes to report per-standardized-annotation effect sizes \(\tau*\), defined as the proportionate change in per-SNP heritability associated with a 1 s.d. increase in the value of the annotation; we note that \(\tau*\) can be compared across annotations and across traits.”

Per-standardised-annotation effect sizes \(\tau_C^*\) are defined as the additive change in per-SNP heritability associated with a 1 s.d increase in the value of the annotation, divided by the average per-SNP heritability over all SNPs for the trait:

\[\begin{equation} \tau_C^*=\dfrac{M_{h_g^2}sd_C}{h_g^2} \hat\tau_C \end{equation}\]

where \(h_g^2\) is the estimated SNP heritability of the trait computed as:

\[\begin{equation} h_g^2=\sum_j var(\beta_j)=\sum_j\sum_C a_c(j)\rho_C \end{equation}\]

and \(M_{h_g^2}\) is the number of SNPs used to comput \(h_g^2\) and \(sd_c\) is the standard deviation of the annotation \(a_c\). The standard error of \(\hat\tau_C\) was computed using a block jackknife over SNPs with 200 equally sized blocks of adjacent SNPs and was used to compute a z score to test for the significance of \(\tau_C^*\).