The previous method of using samples to estimate coverage did indeed provide a better estimate of true coverage, but it failed to incorporate the additional information gained through ordering.

Estimating size from the samples does give a better indicator of coverage, but this is just because it is getting a better estimate of the true pp than the pp we have chosen.

https://www.nature.com/articles/nrg3706

Statistical power is the likelihood of rejecting the null, when it should be rejected. I.e. the likelihood of detecting an effect, when there is one to detect (“true positive”).

Low statistical power increases the probability of missing genuine associations and increases the probability of false-positive (detects an association when there is not one).

Statistical power is affected by allele frequency, effect size and sample size.

Let \(\beta\) be the \(\beta\) coefficient from GWAS.

\[Z=\frac{\hat{\beta}}{SE(\hat{\beta})}.\]

The power of the study is,

\[P(|Z|>Z_\alpha).\]

A calibration curve could be derived to map size to coverage.

A calibration curve uses known information (such as true coverage which we know as we are using simulations) to enable better estimates of future data.

An example is finding out the true temperature given a dodgy thermometer. If we plot the readings from the dodgy thermometer on the \(y\) axis and the true temperature on the \(x\) axis, then we can can fit a curve such that we can read across from the value we obtain using the dodgy thermometer and down to find out the true temperature.

In our case, we can build a calibration curve to map the size of the credible set to the true coverage using our simulation studies. This curve could then be used in future experiements such that researchers can find the predicted true coverage using the size of their credible set.

It must be re-calibrated for different experimental conditions (in our case, sorted/unsorted and power of the experiment), as we know that the relationship between size and coverage is different under certain conditions.

For example, using our simulations, we can plot the relationship of size and coverage in different conditions (sorting/power). This then means that when given the size of the credible set (known to experimentors), we can predict the true coverage using the appropriate calibration curve.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3593158/

Marginal effect: Individual effects of each SNP on a trait.

Joint effect: The joint effect of the SNPs on a trait.

Standard least squared estimate can be used to estimate the joint effect of multiple SNPs on a single trait. The vector \(\hat{\mathbf{b}}\) is the estimate of the joint effects.

This is simple to do, given a joint model. But in GWAS, each SNP is often tested individually for association. Hence, for SNP \(j\) we obtain \(\hat{\boldsymbol{\beta_j}}\), which is an estimate of that SNPs marginal effect.

The marginal effect of multiple SNPs on a trait can be found, but these do not take into account LD information (which the joint estimate does).

There are two main issues with this single SNP approach:

For two SNPs close together, if the risk alleles are negatively correlated, then as one goes up, one must go down so that “the effects of both SNPs will be attenuated”. This means that the approach is underpowered and there is a risk that the SNPs may be undetected.

It is difficult to determine the relatedness of two SNPs using LD information after they have both been identified as significant.

Marginal effect methods are more common in GWAS, but have major limitations (sometimes won’t identify associations!). Therefore useful to convert marginal effects to joint effects.

Marginal effects can be converted to joint effects using the summary statistics from single-SNP analysis and individual level genotype data from the discovery sample (may not be accessibile). I.e. we don’t need the phenotype data.

The paper shows how the joint effect estimate \(\mathbf{\hat{b}}\) can be written with respect to the marginal effect estimate, \(\hat{\boldsymbol{\beta}}\).

However, \((X^TX)\) may not be known in practise (the individual level genotype data from the whole discovery sample), but it is essentially a variance-covariance matrix and can be estimated using LD info and allele frequencies. It is approximately \(\mathbf{B}\), which is defined in the paper. If there is no correlation between SNP \(j\) and SNP \(k\), then the corresponding entry in \(\mathbf{B}\) will be 0.

Equation (12) shows that we can approximate a joint analysis of multiple SNPs using the marginal effects and without needing individual genotype data.

These values can be easily ammended if \(n\) is no longer constant across different SNPs (\(n\) is sample size, \(N\) is no. of SNPs).

**They show how we can use summary data from single-SNP analyses and individual-level genotype data from the sample for multi-SNP joint effect analyses. If individual-level data is not avaliable, then this can be estimated using LD correlations in a reference sample.**

- Conditional analysis is used as a tool to identify secondary associations at a locus.

**This method is also used in conditional analysis. “We can perform a multi-SNP conditional analysis using summary data from single-SNP analyses and individual-level genotype data of the sample without accessing the phenotype data. As in a joint analysis, if the individual-level genotype data of the discovery sample are unavailable, we can estimate the LD correlations from the reference sample and approximate a conditional analysis.”**