Original paper: https://projecteuclid.org/download/pdfview_1/euclid.aos/1438606853

Extention to model-X knockoffs and application to genetics: https://rss.onlinelibrary.wiley.com/doi/full/10.1111/rssb.12265?casa_token=1kQDVm8FyYYAAAAA%3A89_fSTVfFyt1imMi4LDfCSUpTwdjHhPGU7_KeRVsmOC6liBQzFvI0t_UZvjgiuDQHGvM5ZjmLRxhqYg

Gene hunting with hidden Markov model knockoffs (2019): https://academic.oup.com/biomet/article/106/1/1/5066539

Bottolo’s and Richardson’s discussion of the above paper: https://academic.oup.com/biomet/article/106/1/19/5318360

Fine-mapping application (knockoffZoom, 2020): https://www.nature.com/articles/s41467-020-14791-2.pdf

Matthew Stephen’s review of knockoffZoom: https://www.biorxiv.org/content/10.1101/631390v2#disqus_thread

The aim of many modern statistical analyses is to pick out the interesting (true) signals from noise in a data-driven manner, which is disparate from conventional statistical analyses which tend to be hypothesis-driven. For example, in genetics our aim may be to find the genes or SNPs that influence the trait of interest using a GWAS, yet due to the vast number of statistical tests conducted in parallel, many false positives results may be found. The aim is to control for the number of false positive results so that the follow-up researcher looking at the candidate genes or SNPs that our method priorities does not waste more than (e.g.) 10% of her time looking into false positive results.

Let \(Y\) be a vector of length \(n\) representing the phenotype (e.g. disease status) of \(n\) individuals and \(X\) be a genotype matrix of dimension \(n*p\) representing the genotype of \(n\) individuals. We can formulate our aim as to find the minimal set of genes or SNPs, \(\hat{s}\), such that the distribution of \(Y|X\) depends on \(X\) only through the SNPs in \(\hat{s}\). In other words, a SNP is not interesting if it does not help to predict the outcome \(Y\) given that we know the information at all the other SNPs. Formally, \(Y \mathrel{{\perp\mkern-10mu\perp}}X_\hat{s}\text{ | }X_{-\hat{s}}\)

Knockoffs is a variable selection procedure that controls for the FDR, where FDR is defined as the expected fraction of false positives amongst all the selections. For ease of understanding, we discuss how the knockoffs method is used to generate input into a black box procedure (e.g. logistic regression in a case-control GWAS) to obtain outputs for which the FDR can be controlled at a specified level.

Briefly, fake data (“knockoffs”) are generated that perform as negative controls. The values and distributions of the true data and the negative controls is then compared to help identify true positives. This is comparable to capturing the statistical variation in many realizations of the same experimental procedure using the same data to help identify the true positives. For example, if 24% of our knockoffs were found to be significant (therefore false positives), then we could predict that the FDR of the true data is 24%.

Suppose that:

- We have i.i.d samples from \(P_{X,Y}\).
- The distribution of \(X\) is approximately known (e.g. because we have genotypes from millions of people).
- The distribution of \(Y|X\) (likelihood) is completely unknown.

**Generate knockoffs**

Given the original genotypes, \(X_1,...,X_p\), construct “knockoffs”, \(\tilde{X_1},...,\tilde{X_p}\), without looking at \(Y\) so that \(\tilde{X}\) is conditionally independent of \(Y\) given \(X\). The knockoffs should mimic the correlation structure found within the original variables (i.e. \(P_X\) is assumed to be known). These knockoffs must be generated such that they satisfy the exchangeability property such that for each null \(j\),

\[\begin{equation} (X_1,...,X_j,...;\tilde{X_1},...,\tilde{X_j},...)\stackrel{d}{=}(X_1,...,\tilde{X_j},...;\tilde{X_1},...,X_j,...) \end{equation}\]

where \(\stackrel{d}{=}\) denotes equality in distribution. I.e. fake and true values can be swapped without changing the joint distribution. Note that we know the knockoffs are null as they are generated without looking at \(Y\).

**Run black box method to obtain outcomes for true values and knockoffs**

Run the black box method to generate the outcome values for the true data and the knockoffs (e.g. run a GWAS to obtain \(Z\) scores). Ensure that this method is demographic so that it does not use information on which is a true variable and which is a knockoff. This means that,

\[\begin{equation} (Z_j,\tilde{Z_j})\stackrel{d}{=}(\tilde{Z_j},Z_j). \end{equation}\]

**Calculate statistic for each true value and knockoff pair**

For each SNP \(j\), combine the \(Z\) score from the real data, \(Z_j\), with the \(Z\) score from the knockoff data, \(\tilde{Z_j}\) to get a single score, \(W_j\). This score must satisfy the anti-symmetric property,

\[\begin{equation} W_j=w_j(Z_j,\tilde{Z_j}) \text{ s.t. } w_j(Z_j,\tilde{Z_j})=-w_j(\tilde{Z_j},Z_j). \end{equation}\]

For example, \(W_j=Z_j-\tilde{Z_j}\) would be suitable.

This means that the sign of the null \(W_j\) are i.i.d coin flips, whereas the none nulls are more likely to be positive, because the \(Z\) score at the real data will be bigger than the data at the fake data. For example, to call a SNP significant, we require the \(Z\) score at the true value to be bigger than the \(Z\) score for the corresponding knockoff, which we know is null.

**Generate a set of SNPs for which the FDR is controlled**

Sort \(W\) values into descending order of absolute value and group until the estimated FDR (number of negatives divided by total number) is no longer below some chosen FDR threshold. Report the SNPs in the group with positive \(W\) values (the null SNP is more significant than the true SNP where there are negative \(W\) values).

In the example below, the grouped 6 SNPs have an estimated FDR of \(1/6\) (one negative value, 6 SNPs in the group). Note that this is a conservative estimate and so this approach controls for FDR at some specified value (at most the FDR is \(1/6\)).

Based on: https://academic.oup.com/biomet/article/106/1/1/5066539

GWASs have two specific challenges:

The mechanisms through which the phenotypes depend on the genetic variants is unknown and may involve interactions (\(P_{Y|X}\) unknown).

LD patterns are complex.

This paper develops exact and computationally efficient procedures for when \(F_X\) corresponds to a Markov chain or hidden Markov model (that often LD is modeled by).

As sample sizes increase and LD patterns are unraveled, we are finding that many genetic variants are correlated with a phenotype of interest, although only a fraction of these associations may be important.

In fine-mapping, the genome is often clumped into “genomic loci” and then these loci are fine-mapped independently to try and identify the causal variants residing in each locus. *KnockoffZoom* is a new method for fine-mapping that eliminates the pre-clumping step by accounting for LD genome-wide and controls for FDR. Their method assumes that LD is adequately described by HMMs and provably controls for FDR. It makes no assumptions on the relationship between genetic variants and the phenotype (this is what we’re trying to discover anyway…) and can therefore be applied to both quantitative and qualitative traits. The output of knockoffZoom is groups of SNPs that distinctly influence the trait accounting for the effects of all others.

In conventional conditional analyses, a variant is null if its allele distribution is independent of the phenotype. KnockoffZoom employs stricter conditional hypotheses whereby a variant is null if it is independent of the trait conditional on all other variants. Note that LD makes it challenging to reject this conditional null hypothesis. Matthew Stephens states: “This conditional test is in many ways more informative than conventional marginal tests because it ensures that a significant group cannot be explained by linkage disequilibrium (LD) with other measured SNPs outside the group. Thus the conditional test comes closer to identifying groups of potentially-causal SNPs than do conventional marginal tests”.

They use the fastPHASE HMM to approximate the distribution of genotypes to generate knockoffs and then fit a multivariate predictive model to the trait and compute feature importance measures for the genotypes and knockoffs. Feature importance for each true and knockoff pair are combined into a test statistic. The output from their method is sets of distinct discoveries that control the FDR at each resolution. A disadvantage is that the groups of tested markers must be both contiguous and pre-specified, which may hamper fine-mapping utility where risk variants are non-contiguous.

They state in the discussion that prior information, such as summary statistics for a second trait or genomic annotations, can be incorporated and that this is an interesting area for future research.