SuSiE

Introduction
Simple Example
Realistic Example
Comparison with Bayesian Approach
Notes and Queries

Introduction

The SUm of SIngle Effects (SuSiE) model is an extension of single effect regression (SER), where the vector of coefficients has one non-zero entry representing the single effect variable. The SER method is extended to allow for multiple effect variables by writing the vector of coefficients, \(\bf{b}\), as the sum of multiple single-effect vectors, \(\bf{b_1},...,\bf{b_L}\). So instead of just a single \(\bf{b}\) vector, there are now \(L\) of these. I.e.

\[\pmb{y}=\pmb{Xb}+\pmb{e}\] \[\pmb{e} \sim N(0, \sigma^2I_n)\]

So that \(\sigma^2\) is the residual variance.

\[\pmb{b}=\sum_{l=1}^L \pmb{b_l}\] So that there are now \(L\) vectors with one non-zero element.

\[\pmb{b_l}=\pmb{\gamma_l} b_l\]

Non-bold \(b_l\) is a scalar for the effect size of the non-zero element, with

\[b_l \sim N(0, \sigma^2_{0l})\] and \(\pmb\gamma_l\) is a binary vector with exactly one non-zero element.

\[\pmb{\gamma_l} \sim Mult(1, \pmb{\pi})\]

Notice, that \(\sigma^2_{0l}\) can vary among components, and if \(L=1\) then this is just the standard SER model. For now, \(\sigma^2\) (residual variance) and \(\bf{\sigma^2_{0}}\) (prior variance of the non-zero effect) are assumed known, but can be estimated as hyper-parameters using an empirical Bayes approach.

Some of the \(\bf{b_l}\) vectors may be the same (having the same non-zero entry) and thus, at most \(L\) variables have non-zero coefficients in the model.

Two key advantages:

Simple method for computing approximate posterior distributions.
Simple way to calculate credible sets.
- Each \(\bf{b_l}\) captures only 1 effect, and therefore the posterior distribution on each \(\gamma_l\) can be used to compute credible sets that have high probability of containing an effect variable.

Fitting SuSiE

The authors have developed “Iterative Bayesian Stepwise Selection” (IBSS) to fit the model. This was developed using the fact that if \(\bf{b_1},...,\bf{b_{L-1}}\) are known, then estimating \(\bf{b_L}\) involves fitting the simpler SER model (i.e. subtract the known \(\bf{b}\)s out of \(\bf{y}\) and do SER (single SNP BFs etc)), and thus an iterative approach should be adopted.

Basic idea: Go through \(L\)’s and at each point calculate the residuals (removing all the effects except 1), use the SER model to fit the one we’ve left out and iterate (pretending we know the others).

This is similar to forward selection, but instead of choosing the single best variable at each step, it computes a distribution on which variable to select, thus capturing uncertainty in the selected variable (which can be used to form credible sets). I.e. for each SNP \(j\) calculate it’s BF and PP, which gives a ‘weight’ on how good that SNP is (some quantification rather than forward selection which just includes the best one). When calculating the residuals, a weighted average of the SNPs is removed. It does this “backwards” as well.

It is shown in the paper that overestimating the value for \(L\) is better than underestimating, so it should be treated as an upper bound on the number of causal SNPs. This doesn’t turn out to be much of a problem, because if there are say 2 effects and we’ve chosen \(L=10\), then the latter 7 posterior probabilities will be spread over a very large number of SNPs as it “doesn’t know where to put it” –> get a very big credible set of not very correlated variants, can just remove these.

Results

It is shown in the paper that the PIPs obtained from CAVIAR, FINEMAP and DAP-G are largely comparable with those from SuSiE, despite differing modelling assumptions (see image below). This implies that the PIPs are robust to details of modelling assumptions. Notice that there are more cases where PIPs from SuSiE for the true effect variables are higher than the other methods (the points in red), indicating that SuSiE is better at distinguishing true effect variables.

It is shown that there are some small coverage issues with their resultant credible sets. Below, blue points are for DAP-G and red points are for SuSiE 95% credible sets.

Note: SuSiE needs to be developed for use on case-control data. To do this, the IBSS method needs to be modified by replacing the SERs with logistic or glm equivalents.

Simple Example

SuSiE is basically a method for variable selection in large scale regression. Here, I walk through the example given in the paper (https://stephenslab.github.io/susie-paper/manuscript_results/motivating_example.html).

Consider fitting the regression model,

\[y=\sum_{j=1}^p x_j\beta_j+\epsilon,\]

\[\epsilon \sim N(0,\sigma^2I_n).\]

where \(x_1=x_2\), \(x_3=x_4\) and \(\beta_1 \neq 0\), \(\beta_4 \neq 0\), \(\beta_2=0\) and \(\beta_3=0\).

We wish to make the inference that \(\beta_1 \neq 0\) or \(\beta_2 \neq 0\) and \(\beta_3 \neq 0\) or \(\beta_4 \neq 0\).

This is useful in genetic fine-mapping where variants are in high linkage disequilibrium. In this simplistic analogy, we consider 4 variants in the GWAS region - variant 1 and variant 4 are causal but variant 1 is completely correlated with variant 2, and variant 3 is completely correlated with variant 4, making inference of the true causal variants difficult.

The inference we wish to make is that variables \(x_1\) and \(x_2\) (and \(x_3\) and \(x_4\)) are too similar to distinguish.

Simulate a dataset with \(p=1000\) variants and \(n=500\) individuals.

set.seed(1)
n = 500
p = 1000
b = rep(0,p) # coefficient vector
b[200] = 1 # variant 200 (1) causal
b[800] = 1 # variant 800 (4) causal
X = matrix(rnorm(n*p), nrow=n, ncol=p) # model matrix 
X[,200] = X[,400] # variant 200 (1) and 400 (2) indistinguishable
X[,600] = X[,800] # variant 600 (3) and 800 (4) indistinguishable
y = X %*% b + rnorm(n) # regression model

Plot the true effect size.

plot(b, col="black", pch=16, main = 'True effect size')
pos = 1:length(b)
points(pos[b!=0], b[b!=0], col=2, pch=16)

LASSO inference

LASSO will choose one of:

\(x_1,x_3\)
\(x_1,x_4\) (TRUTH)
\(x_2,x_3\)
\(x_2,x_4\)

set.seed(1)
alpha = 1
y.fit = glmnet::glmnet(X, y, alpha = alpha, intercept = FALSE)
y.cv  = glmnet::cv.glmnet(X, y, alpha = alpha, intercept = FALSE,
                         lambda = y.fit$lambda)
bhat  = glmnet::predict.glmnet(y.fit, type ="coefficients",
                               s = y.cv$lambda.min)[-1,1]
plot(bhat, col="black", pch=16, main = 'Lasso')
pos = 1:length(bhat)
points(pos[b!=0], bhat[b!=0], col=2, pch=16)

Here, LASSO has chosen variants 200 (1) and 600 (3), recall that variants 200 (1) and 800 (4) are causal but 200 (1) is correlated with 400 (2) and 800 (4) with 600 (3). We see that option 1 has been picked by LASSO.

SuSiE inference

We now use SuSiE for the inference, setting \(L=5\) representing at most 5 true effects.

fitted = susieR::susie(X, y, L=5,
               estimate_residual_variance=TRUE, 
               scaled_prior_variance=0.2^2,
               tol=1e-3, track_fit=TRUE, min_abs_corr=0.1)

# X: n by p matrix of covariates
# y: vector of length n (phenotypes)
# L: no. of non-zero elements
# estimated_residual_variance: if TRUE, the variance of the residual
#                              is estimated separately for each L component
# scaled_prior_variance: provides initial estimates of the prior variances
#                        can be a scalar or a vector
# tol: convergence tolerance for IBSS fitting procedure
# min_abs_corr: minimum correlation (r) allowed in a credible set (default is 0.5)

susieR::susie_plot(fitted, y='PIP', b=b, max_cs=0.4, main = paste('SuSiE, ', length(fitted$sets$cs), 'CS identified'))

SuSiE has reported 2 credible sets, the first (green circles) containing variables \(x_1,x_2\), the second (blue circles) with \(x_3,x_4\). This is exactly the inference we wanted.

Realistic example

(Follows https://stephenslab.github.io/susieR/articles/finemapping.html)

Use simulated data containing expression level of a gene (\(y\)) in \(N=600\) individuals. Consider \(P=1000\) variants. The data is simulated to have exactly 3 non-zero effects.

# load data
data(N3finemapping)
attach(N3finemapping)
names(data)

## [1] "X"                 "chrom"             "pos"              
## [4] "true_coef"         "residual_variance" "Y"                
## [7] "allele_freq"       "V"

The data includes the regression data-set \(X\) and \(y\) and the true regression coefficient the data is simulated from.

We only use the first column of \(y\) for inference, the second is just a second replication.

b <- data$true_coef[,1] # vector of 1000 0s except at CVs where the value is the true effect
which(b!=0)

## [1] 403 653 773

plot(b, pch=16, ylab='effect size', main='True effect size')

The causal variants are 403, 653 and 773. The above plot shows the ‘true’ signals.

To perform fine-mapping using susieR, we consider that there are at most 10 causal variables (\(L=10\)). We set SuSiE prior variance (percentage of variance explained) to \(0.1\) (this seems robust to other choices - eQTL data!). The variational algorithm that fits SuSiE models updates the residual variance. The default is to report a 95% credible set.

fitted <- susie(data$X, data$Y[,1],
                L = 10,
                estimate_residual_variance = TRUE, 
                scaled_prior_variance = 0.1,
                verbose = TRUE)

## [1] "objective:-1392.42460921237"
## [1] "objective:-1388.98662572569"
## [1] "objective:-1387.87893344487"
## [1] "objective:-1387.87158194815"
## [1] "objective:-1387.73100740437"
## [1] "objective:-1387.73099852765"
## [1] "objective:-1387.67616609042"
## [1] "objective:-1387.67615783405"
## [1] "objective:-1386.37718431774"
## [1] "objective:-1386.3640270645"
## [1] "objective:-1381.99570801801"
## [1] "objective:-1381.86807518114"
## [1] "objective:-1381.18590761065"
## [1] "objective:-1381.18342754333"
## [1] "objective:-1381.13344267982"
## [1] "objective:-1381.13343270768"
## [1] "objective:-1381.13113705137"
## [1] "objective:-1381.1311369139"
## [1] "objective:-1381.13106293455"
## [1] "objective:-1381.13106291271"

print(fitted$sets)

## $cs
## $cs$L3
## [1] 653
## 
## $cs$L1
## [1] 773 777
## 
## $cs$L2
##  [1] 362 372 373 374 379 381 383 384 386 387 388 389 391 392 396 397 398
## [18] 399 400 401 403 404 405 407 408 415
## 
## 
## $purity
##    min.abs.corr mean.abs.corr median.abs.corr
## L3    1.0000000     1.0000000       1.0000000
## L1    0.9815726     0.9907863       0.9907863
## L2    0.8686309     0.9673674       0.9756251
## 
## $cs_index
## [1] 3 1 2
## 
## $coverage
## [1] 0.95

SuSiE reports three 95% credible sets (containing 1, 2 and 27 variants respectively), each containing one of the causal variants. The minimum absolute correlation is 0.86 - this should be high to ensure the credible sets contain highly correlated variables.

What happens if we change the threshold from 95% to 80%?

sets <- susie_get_cs(fitted,
                     X = data$X,
                     coverage = 0.8,
                     min_abs_corr = 0.1)
print(sets)

## $cs
## $cs$L3
## [1] 653
## 
## $cs$L1
## [1] 773 777
## 
## $cs$L2
##  [1] 374 379 381 383 384 386 387 388 389 391 396 399 403 404 405 408
## 
## 
## $purity
##    min.abs.corr mean.abs.corr median.abs.corr
## L3    1.0000000     1.0000000       1.0000000
## L1    0.9815726     0.9907863       0.9907863
## L2    0.9119572     0.9750637       0.9807260
## 
## $cs_index
## [1] 3 1 2
## 
## $coverage
## [1] 0.8

The only difference is that the size of the third credible set has decreased (less variants as we are “less sure” the CV is contained) and the ‘purity’ has increased from 0.86 to 0.91.

Posterior inclusion probabilities

susie_plot(fitted, y="PIP", b=b)

We find that in the third credible set, there are some variants with higher PIPs than the true causal variant in the set (variant 403).

Fine-mapping with summary statistics

If only summary statistics are available (i.e. we don’t have the output from each single SNP regression), susieR can only output coefficients in standardised \(X\), \(y\) scale.

Confusion: In the paper, it states that the full genotype level data is required but in the online vignette it states “When the aforementioned sufficient summary statistics are provided, SuSiE can produce the exact same outcome as if individual level data were used” with the “aforementioned sufficient summary statistics” being the OR, p-values, MAFs, sample sizes, variance of phenotype and LD.

The correlation matrix can be computed directly from the data.

R <- cor(data$X) # LD matrix - must be r

If we do not know the variance of \(y\) (which we don’t - since we don’t have individual level data - i.e. the \(y\) vector), then susieR gives the coefficients in standardized \(X\), \(y\) scale.

fitted_bhat_standardize <- susie_bhat(bhat = sumstats[1,,1], 
                                      shat = sumstats[2,,1], 
                                      R = R, n = nrow(data$X), 
                                      L = 10, 
                                      scaled_prior_variance = 0.1, 
                                      estimate_residual_variance = TRUE,
                                      estimate_prior_variance = FALSE)

summary(fitted_bhat_standardize)$cs

##   cs cs_log10bf cs_avg_r2 cs_min_r2
## 1  3   9.754560 1.0000000 1.0000000
## 2  1  15.973140 0.9816575 0.9634847
## 3  2   8.159787 0.9357997 0.7545197
##                                                                                                  variable
## 1                                                                                                     653
## 2                                                                                                 773,777
## 3 362,372,373,374,379,381,383,384,386,387,388,389,391,392,396,397,398,399,400,401,403,404,405,407,408,415

The results show that three credible sets have been found, each containing one of the causal variants (exactly the same inference as using the full data above).

susie_plot(fitted_bhat_standardize, y="PIP", b=b)

Using \(z\) scores

We have values for \(\hat{\beta}\) and \(SE(\hat{\beta})\) which can be used to obtain the z-scores. \[z=\frac{\hat{\beta}}{SE(\hat{\beta})}\]

z_scores <- sumstats[1,,] / sumstats[2,,]
z_scores <- z_scores[,1]
susie_plot(z_scores, y = "z", b=b) # error in function (ylab should be Z)

fitted_z <- susie_z(z_scores, R = R, L = 10)
summary(fitted_z)$cs

##   cs cs_log10bf cs_avg_r2 cs_min_r2
## 1  2   7.168100 1.0000000 1.0000000
## 2  1  13.319105 0.9816575 0.9634847
## 3  3   5.965511 0.9188853 0.6907024
##                                                                                                              variable
## 1                                                                                                                 653
## 2                                                                                                             773,777
## 3 360,362,365,368,372,373,374,379,381,383,384,386,387,388,389,391,392,396,397,398,399,400,401,403,404,405,407,408,415

susie_plot(fitted_z, y="PIP", b=b)

The number of variants in the third credible set has increased.

Comparison with Bayesian Approach

I now simulate some data and see how SuSiE performs. Note that this data is simulated from a case-control GWAS, for which SuSiE supposedly does not work.

1 CV

Data has been simulated using our standard code, with \(max(|Z|)=5.223\).

data1 <- readRDS("/Users/anna/Google Drive/PhD/SuSiE/data1.RDS")
names(data1)

## [1] "z-scores"    "LD"          "OR"          "N0"          "iCV"        
## [6] "allele-freq" "pp"

iCV <- data1$iCV

b <- rep(0,length(data1$`z-scores`))
b[iCV] <- 1
susie_plot(data1$`z-scores`, y = "z", b = b) # again, error in code - ylab should be Z

Let’s see how SuSiE performs using the z-scores and LD matrix. The CV is coloured in red.

fitted_z <- susie_z(data1$"z-scores", R = data1$"LD", L = 10, coverage=0.95)
summary(fitted_z)$cs

##   cs cs_log10bf cs_avg_r2 cs_min_r2                        variable
## 1  1   8.012054 0.8207658 0.5996833 106,109,111,112,129,143,148,149

susie_plot(fitted_z, y="PIP", b=b)

We see that variant 106 (iCV) is contained in the credible set. Let’s compare this with the Bayesian approach and use our correction to estimate the true coverage of the resultant credible set.

library(corrcoverage)

pp0 <- data1$pp
plot(pp0, pch = 16)
points(iCV, pp0[iCV], col = 2, pch = 16)

credset(pp0,iCV,thr=0.95)

## $credset
## [1] 112 111 109 106 149 148 129 143
## 
## $claimed.cov
## pos43439296 
##   0.9621392 
## 
## $covered
## [1] 1
## 
## $nvar
## pos43439296 
##           8

corrcov(z = data1$`z-scores`, f = data1$`allele-freq`, N0 = data1$N0, N1 = data1$N0, Sigma = data1$LD)

## [1] 0.9748017

	Thr	nvar	covered
SuSiE	0.95	8	TRUE
Bayesian	0.95	8	TRUE

The standard Bayesian method and SuSiE pick the same credible set - this is expected as SuSiE’s novelty is made apparent when there are multiple effect variables.

2 CV

The standard Bayesian fine-mapping method (and therefore our correction method) are not guaranteed to be accurate when there is not exactly one CV in the region, however SuSiE is intended exactly in these scenarios.
I have now used our simulation method to simulate GWAS results for 2 CVs in the region.

data1 <- readRDS("/Users/anna/Google Drive/PhD/SuSiE/2cv_data.RDS")
names(data1)

## [1] "z-scores"    "LD"          "OR"          "N"           "iCV"        
## [6] "allele-freq" "pp"

iCV <- data1$iCV
iCV

## [1]  88 152

data1$`z-scores`[iCV]

## [1] 5.556533 4.573261

max(abs(data1$`z-scores`))

## [1] 5.556533

b <- rep(0,length(data1$`z-scores`))
b[iCV] <- 1
susie_plot(data1$`z-scores`, y = "z", b = b) # again, error in code - ylab should be Z

In our simulated data, the two CVs are variants 88 and 152 which have Z-scores of 5.557 and 4.574, respectively.

fitted_z <- susie_z(data1$"z-scores", R = data1$"LD", L = 10, coverage=0.95)
summary(fitted_z)$cs

##   cs cs_log10bf cs_avg_r2 cs_min_r2
## 1  1   8.814761 0.9630191 0.9224495
## 2  2   5.719842 0.7619470 0.4767884
##                                                      variable
## 1                                              31,32,34,88,92
## 2 138,143,146,148,149,151,152,153,154,161,163,164,169,172,176

b <- rep(0,length(data1$`z-scores`))
b[iCV] <- 1
susie_plot(fitted_z, y="PIP", b=b)

Indeed, the SuSiE method finds 2 credible sets of variants - each containing an effect variant.
Lets compare this with the standard Bayesian method. This finds a credible set with 5 variants (in fact this is the same as CS1 from SuSiE) with claimed coverage of 0.9587.

pp0 <- data1$pp

credset(pp0, iCV, thr=0.95)

## $credset
## [1] 88 92 31 32 34
## 
## $claimed.cov
## pos47437058 
##   0.9586951 
## 
## $covered
## [1] 1 0
## 
## $nvar
## pos47437058 
##           5

corrcov(z = data1$`z-scores`, f = data1$`allele-freq`, N0 = data1$N, N1 = data1$N, Sigma = data1$LD)

## [1] 0.9868798

Our corrected coverage method finds that true coverage of this credible set is nearer to 0.98. Can we use our method to improve the resolution of the credible set?

corrected_cs(z = data1$`z-scores`, f = data1$`allele-freq`, N0 = data1$N, N1 = data1$N, Sigma = data1$LD, lower = 0, upper = 1, desired.cov = 0.95)

## [1] "thr:  0.5 , cov:  0.679796831473973"
## [1] "thr:  0.75 , cov:  0.853521498132342"
## [1] "thr:  0.875 , cov:  0.938086051668765"
## [1] "thr:  0.9375 , cov:  0.980752281102671"
## [1] "thr:  0.90625 , cov:  0.960882132610726"
## [1] "thr:  0.890625 , cov:  0.950411576970438"

## $credset
## [1] "pos47443731" "pos47444180" "pos47436702" "pos47436940" "pos47437058"
## 
## $req.thr
## [1] 0.890625
## 
## $corr.cov
## [1] 0.9504116
## 
## $size
## pos47437058 
##   0.9586951

No, the same 5 variants are required for the corrected 95% credible set.

This example shows how SuSiE can be applied to make meaningful inferences when the single causal variant assumption is dropped.

Notes and Queries

Stated that “the minimum SuSiE requires are the \(Z\) scores, LD matrix and the sample size” but in the examples above, we’ve only supplied the \(Z\) scores and the LD matrix.
Note that LD matrix needs to be a correlation matrix \(r\), not \(r^2\) or \(abs(r)\).
The research paper explicitly states that the full individual-level data is required for the SuSiE method - however the vignettes show that identical results are obtained using the summary-level data? There is no documentation for the susie_z function used here.
The research paper also states that it has not yet been extended for use on binary traits (i.e. case control data), but I have used it on our standard case-control simulation and found it to be accurate? Originally thought it may struggle due to differing LD structures between cases and controls - for example when we sample from the cases, we are sampling more often the causal allele at that SNP.