J. Taylor
September 3-6, 2017
\[ \newcommand{\Ee}{\mathbb{E}} \newcommand{\Pp}{\mathbb{P}} \newcommand{\real}{\mathbb{R}} \newcommand{\hauss}{{\cal H}} \newcommand{\lips}{{\cal L}} \]
A model for random sets.
Some old integral geometry.
Curvature measures and the normal bundle.
|
|
|
|
|
|
Record data (PET, fMRI) for subjects at rest: \[R_{i,t}, \quad 1 \leq i \leq n, t \in M.\]
Record data (PET, fMRI) for same subjects during task: \[A_{i,t}, \quad 1 \leq i \leq n, t \in M.\]
Compute the differences \[ E_{i,t} = A_{i,t} - R_{i,t}, \qquad t \in M \] and form \[ T_t = \frac{\bar{E}_t}{SE(\bar{E}_t)}, \qquad t \in M. \]
Which \(t \in M\) are “active”: \(\Ee(\bar{E}_t) \overset{D}{=} \Delta_t \neq 0\)?
Could use simultaenous inference for hypotheses \[ H_{0,t}: \Ee(\bar{E}_t)=0. \]
Union-Intersection test based on \[ \Pp\left(\sup_{t \in M_0} |T_t| \geq u\right) \leq \Pp_0\left(\sup_{t \in M} |T_t| \geq u\right) \] with \[ \begin{aligned} M_0 &= \left\{t \in M: \Delta_t=0 \right\} \\ \Pp_0 &= \text{Global (intersection) null}. \end{aligned} \]
|
|
|
|
|
|
Twice-differentiable Gaussian process \((f_t)_{t \in M}\) on a manifold \(M\).
Centered \(\Ee \{f_t\} = 0\);
Marginally stationary \(\Ee \{f_t^2\} = 1\)
Gaussian processes are specified by mean and covariance function.
Finite-dimensional distributions are all simple – multivariate Gaussian.
All of the quantities we will talk can be discussed in geometric terms.
To do specific calculations requires a lot of work, but general formula is compact and interpretable.
One of the blessings and curses of geometry…
We will allow boundaries and corners.
Our platonic \(M\) will be \([-1,1]^3\), the cube…
Random sets we will consider are of the form: \[ f^{-1}D = \left\{t \in M: f_t \in D \right\}\] \(D \subset \mathbb{R}\).
In particular, we are interested in the typical geometry of these sets, and how it is determined by correlation function of \(f\) and the set \(D\).
Our \(T\) random field has multiple copies of \(f\): \(\epsilon_i, 1 \leq i \leq n\).
For \(T\) fields, we can think of \[ D = {\tt t}^{-1}[u,\infty) \subset \real^n \] where \({\tt t}: \real^n \rightarrow \real\) is the usual one-sample \(t\)-statistic.
If \(D=[u,\infty)\) for \(u\) large, then \(f^{-1}D\) satisifes a clumping heuristic.
Determining whether \(f^{-1}D=\emptyset\) can be approximated by counting the number of clumps.
Note: \(f^{-1}[u,\infty) \neq \emptyset\) iff \(f\) has a critical point of value greater \(\geq u\).
If we can compute the expected number of such critical points, we get an approximation of \[ \Pp_0\left(f^{-1}[u,\infty) \neq \emptyset \right) \leq \Ee_0\left( \text{# critical points of value $\geq u$}\right) \]
As we increased threshold above 2, say, clumps become “simple”.
Typically one critical point (local maxima) per “clump”.
Maybe the number of local maxima is Poisson-like.
For \(N \sim \text{Poisson}(\lambda)\) with a small parameter \[ \Pp(N > 0) \approx \Ee(N). \]
We just need to get our hands on \[ \Ee \left\{\# \text{local maxima of $f$ above the level $u$} \right\}. \]
We’re interested in integral geometric properties.
Specifically, the (expected) Euler characteristic \[ \Ee \left\{\chi\left(f^{-1}[u,+\infty)\right)\right\} = \Ee \left\{\chi\left(M \cap f^{-1}[u,+\infty)\right)\right\}\]
Let \(M\) be a 2-manifold without boundary, then \[ \Ee \left\{\chi\left(M \cap f^{-1}[0,+\infty)\right)\right\} = \frac{1}{2} \cdot \chi(M)\]
A topological invariant: in \(\real^3\) \[ \text{# connected components - # holes + # voids} \]
Key property: when \(A\) is simply connected \[ \chi(A) = 1 \]
Key property: \(\chi(\emptyset)=0\).
\(\implies\) if \(A\) is a disjoint union of simply connected sets \[ \chi(A) = \text{# connected components}. \]
Key property: Morse theory tells us that we can compute \(\chi\) based on critical points!
For nice enough \(M\) \[ \Ee \left\{\chi\left(M \cap f^{-1}[u,+\infty)\right)\right\} \overset{u \rightarrow \infty} \simeq \Pp \left\{\sup_{t \in M} f_t \geq u \right\}\]
Of all quantities in the studies of Gaussian processes, the EC stands out as being explicitly computable in wide generality \[ \Ee \left\{\chi\left(M \cap f^{-1}[u,+\infty)\right)\right\} = \sum_{j=0}^{\text{dim}(M)} \lips_j(M) \rho_j(u).\]
Note that the expectation decouples into a sum of products: one thing depends on \(M\), the other on the level \(u\).
The quantities \(\lips_j\)’s and \(\rho_j\)’s have geometric interpretation.
The \(\lips_j\)’s measure the size / complexity of \(M\), while \(\rho_j\) are related to how extreme the level is.
Suppose \(f\) is (the restriction to \(M\)) of an isotropic process \(\tilde{f}\), with \(\text{Var}\left\{d\tilde{f}/dx_i \right\} = 1\).
For small \(r\), the functionals \(\lips_j(M)\) are implicitly defined by Steiner-Weyl formula \[ \hauss_k \left(x \in \real^k: d(x, M) \leq r \right) = \sum_{j=0}^k \omega_{k-j} r^{k-j} \lips_j(M) \]
Weyl pointed out that the quantities \(\lips_j(M)\) depend only on \(M\) and a Riemannian structure \(g\): \[ \lips_j(M) = \lips_j((M,g)). \]
What is a Riemannian metric? Is there one associated to \(f\)?
A Riemannian metric on \(M\) is a way to measure area and inner products locally.
Specifically, given two tangent vectors \(X_t, W_t\) at \(t \in M\) a Riemannian metric determines the inner product of these two vectors.
Uniquely determines a measure on \(M\).
A natural candidate for a metric determined by \(f\) is \[ g_t(X_tf, Y_tf) = \Ee_0[X_tf \cdot Y_tf]. \]
In local coordinates, this is \[ g_t \left(\frac{\partial }{\partial t_i} \biggl|_t, \frac{\partial }{\partial t_j} \biggl|_t \right) = \Ee_0 \left(\frac{\partial f}{\partial t_i} \biggl|_t \cdot \frac{\partial f}{\partial t_i} \biggl|_t \right) \]
\[ \hauss_3\left( \text{Tube}([0,a] \times [0,b] \times [0,c],r)\right) = abc + 2r \cdot ( ab+bc+ac) + (\pi r^2) \cdot (a+b+c) + \frac{4\pi r^3}{3} \]
Singularity means implicit definition is invalid, BUT \(\lips_j(\cdot)'s\) are still well defined
They are defined for a large class of sets…
\[ \hauss_{k-1} \left(\exp_{\nu} \left(A \cap \partial D, r\nu \right) \right) = \sum_{j=1}^k r^{j-1} \cdot \omega_j \int_A \; \lips_{k-j}(D; dp) \]
Derivation of tube formula essentially just calculus combined on the (unit) normal bundle \[ \begin{aligned} N_1(D) &= \left\{(t,\eta_t): \eta_t \in N_tD, \|\eta_t\|_2 = 1 \right\} \\ &= \left\{(t,\eta_t): \text{$\eta_t$ is a unit (exterior) normal to $D$ at $t$} \right\} \end{aligned} \]
Note: we use \(N(D)\) for all normal vectors not just unit normals.
Define \[ t^*(y) = \text{argmin}_{z \in D} \frac{1}{2} \|y-z\|^2_2 \] and the maps \[ \begin{aligned} y &\overset{\pi_D}{\mapsto} \left(t^*(y), \frac{y - t^*(y)}{\|y-t^*(y)\|_2}, \|y-t^*(y)\|_2 \right) \\ (t,\eta_t,r) &\overset{\exp_D}{\mapsto} t + r \cdot \eta_t \end{aligned} \]
Then, \[ \pi_D = \exp_D^{-1} \]
Parametrization of \(\text{Tube}(D)\) is inverse to projection onto \(D\).
And now for something completely different…
Exact post-selection inference with applications to the LASSO (arxiv.org/1311.6238 )
In vitro measurement of resistance of sample of HIV viruses to NRTI drug 3TC.
633 cases, and 91 different mutations occuring more than 10 times in sample.
Source: HIVDB
Goal: to build an interpretable predictive model of resistance based on mutation pattern.
Model: \(y|X\) (fixed design, Gaussian errors)
dim(X)
## [1] 633 91
colnames(X)[1:5]
## [1] "P6D" "P20R" "P21I" "P35I" "P35M"
\[ \hat{\beta}_{\lambda} = \text{argmin}_{\beta} \frac{1}{2} \|Y-X\beta\|^2_2 + \lambda \|\beta\|_1 \]
\[ \lambda = \kappa \cdot \mathbb{E}( \|X^T\epsilon\|_{\infty}), \qquad \epsilon \sim N(0, \sigma^2 I). \]
\(\sigma^2\) the usual estimate from full model.
Leads to \(\lambda \approx 43\)
selected_vars
## [1] "P62V" "P65R" "P67N" "P69i" "P75I" "P77L" "P83K" "P90I"
## [9] "P115F" "P151M" "P181C" "P184V" "P190A" "P215F" "P215Y" "P219R"
Xselect = X[,selected_vars]
lm.selected = lm(Y ~ ., data=data.frame(Y, Xselect))
summary(lm.selected)
##
## Call:
## lm(formula = Y ~ ., data = data.frame(Y, Xselect))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9654 -0.3327 -0.0651 0.2653 4.9333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.252e-12 2.497e-02 0.000 1.000000
## P62V 4.465e-02 2.999e-02 1.489 0.136994
## P65R 2.837e-01 2.612e-02 10.860 < 2e-16 ***
## P67N 1.423e-01 2.884e-02 4.935 1.03e-06 ***
## P69i 1.649e-01 2.691e-02 6.126 1.61e-09 ***
## P75I 2.770e-02 3.955e-02 0.700 0.483889
## P77L 4.774e-02 4.327e-02 1.103 0.270371
## P83K -7.475e-02 2.551e-02 -2.930 0.003513 **
## P90I 1.038e-01 2.536e-02 4.094 4.81e-05 ***
## P115F 6.754e-02 2.729e-02 2.475 0.013593 *
## P151M 9.424e-02 3.541e-02 2.661 0.007992 **
## P181C 9.691e-02 2.623e-02 3.694 0.000240 ***
## P184V 2.218e+00 2.610e-02 84.973 < 2e-16 ***
## P190A 4.797e-02 2.582e-02 1.858 0.063696 .
## P215F 1.152e-01 2.896e-02 3.976 7.83e-05 ***
## P215Y 1.748e-01 2.992e-02 5.845 8.24e-09 ***
## P219R 8.512e-02 2.569e-02 3.313 0.000976 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6281 on 616 degrees of freedom
## Multiple R-squared: 0.9313, Adjusted R-squared: 0.9296
## F-statistic: 522.2 on 16 and 616 DF, p-value: < 2.2e-16
Allows testing hypotheses suggested by the data.
No free lunch: users must declare the algorithm.
Simplest example of selective inference: data splitting
X1 = X[selection_half,]; Y1 = Y[selection_half]
G1 = glmnet(X1, Y1)
beta.hat1 = coef(G1, 43/(633 * sqrt(2)), exact=TRUE)[2:92]
selected1 = which(beta.hat1 != 0)
selected1
## [1] 1 8 16 17 19 23 29 31 32 42 50 65 67 68 78 79 81 82 84 86 87 91
summary(lm(Y ~ ., data=data.frame(X[,selected1], Y), subset=inference_half))
##
## Call:
## lm(formula = Y ~ ., data = data.frame(X[, selected1], Y), subset = inference_half)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0572 -0.3701 -0.0860 0.3240 4.9064
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.037984 0.039323 0.966 0.334871
## P6D 0.040582 0.044685 0.908 0.364533
## P41L -0.121952 0.075873 -1.607 0.109056
## P62V 0.061613 0.049640 1.241 0.215522
## P65R 0.313896 0.039398 7.967 3.60e-14 ***
## P67N 0.211301 0.061126 3.457 0.000627 ***
## P69i 0.170076 0.049531 3.434 0.000681 ***
## P75T 0.104660 0.041891 2.498 0.013023 *
## P83K -0.083117 0.041709 -1.993 0.047206 *
## P90I 0.094192 0.037612 2.504 0.012811 *
## P116Y 0.151594 0.037910 3.999 8.06e-05 ***
## P135M 0.017384 0.041423 0.420 0.675030
## P178M -0.046817 0.040655 -1.152 0.250434
## P181C 0.136794 0.042864 3.191 0.001570 **
## P184V 2.203686 0.041395 53.236 < 2e-16 ***
## P210W -0.011475 0.069596 -0.165 0.869152
## P211K -0.005614 0.041842 -0.134 0.893348
## P215F 0.189774 0.051694 3.671 0.000287 ***
## P215Y 0.288219 0.076418 3.772 0.000196 ***
## P219E -0.041281 0.043536 -0.948 0.343806
## P219Q -0.033336 0.059221 -0.563 0.573921
## P219R 0.081919 0.038202 2.144 0.032822 *
## P228R -0.044568 0.044285 -1.006 0.315055
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6924 on 294 degrees of freedom
## Multiple R-squared: 0.9201, Adjusted R-squared: 0.9141
## F-statistic: 153.8 on 22 and 294 DF, p-value: < 2.2e-16
Formal justification \[
{\cal L}_{F \times \mathbb{Q}}(X_{E,2}^Ty_2) = {\cal L}_F(X_{2,E}^Ty_2 | \hat{E}, \omega, X_1, y_1)
\] with \(\omega\) the random split; \((X_1, y_1)\) the selection_half
; and \((X_2, y_2)\) the inference_half
.
Model used for inference (assumption!) \[ F \in \left\{N(X_E\beta_E, \sigma_E^2 I): \beta_E \in \mathbb{R}^E \right\} = {\cal M}_E. \]
Note \[ {\cal L}_{F \times \mathbb{Q}} \left(X_{2,E}^Ty_2 \bigl\vert \hat{E}(y, \omega)=E, \omega, y_1 \right) \equiv {\cal L}_{F\times \mathbb{Q}} \left(X_{E}^Ty \bigl\vert \hat{E}(y, \omega)=E, \omega, X_1, y_1\right) \]
By conditioning on selection_half
inference is possible (and simple) using inference_half
.
Conditioning on \((X_1, y_1)\) a fortiori conditions on the variables selected using \((X_1, y_1)\).
Hypotheses generated by selection_half
are functions of the data!
Similarly, if hypotheses are generated by pilot data, at the confirmatory stage they are still functions of all of the data.
We could use all of the data to choose a model \[ {\cal L}_{F }\left(X^T_Ey \bigl| \hat{E}\right) \] and still condition on the variables selected.
Somewhat easier to condition on \(\text{sign}(\hat{\beta}) \equiv (\hat{E}, z_{\hat{E}})\).
selected_vars
## [1] "P62V" "P65R" "P67N" "P69i" "P75I" "P77L" "P83K" "P90I"
## [9] "P115F" "P151M" "P181C" "P184V" "P190A" "P215F" "P215Y" "P219R"
selectiveInference
fixedLassoInf(X, Y, beta.hat, 43, intercept=FALSE)
##
## Call:
## fixedLassoInf(x = X, y = Y, beta = beta.hat, lambda = 43, intercept = FALSE)
##
## Standard deviation of noise (specified or estimated) sigma = 0.634
##
## Testing results at lambda = 43.000, with alpha = 0.100
##
## Var Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
## 16 0.045 1.476 0.184 -0.045 0.145 0.050 0.050
## 17 0.284 10.761 0.000 0.240 0.336 0.049 0.049
## 19 0.142 4.890 0.000 0.095 0.225 0.049 0.050
## 23 0.165 6.070 0.000 0.114 0.210 0.048 0.049
## 27 0.028 0.694 0.723 -0.434 0.094 0.050 0.049
## 30 0.048 1.093 0.234 -0.108 0.324 0.050 0.050
## 31 -0.075 -2.904 0.025 -0.117 -0.013 0.050 0.049
## 32 0.104 4.057 0.007 0.041 0.146 0.050 0.049
## 41 0.068 2.453 0.083 -0.015 0.119 0.050 0.050
## 54 0.094 2.637 0.040 0.006 0.168 0.049 0.050
## 67 0.097 3.661 0.001 0.056 0.282 0.049 0.050
## 68 2.218 84.205 0.000 2.162 2.262 0.049 0.049
## 69 0.048 1.841 0.845 -0.973 0.053 0.050 0.049
## 81 0.115 3.940 0.008 0.044 0.214 0.050 0.049
## 82 0.175 5.792 0.000 0.124 0.225 0.048 0.050
## 87 0.085 3.283 0.049 0.000 0.139 0.050 0.049
##
## Note: coefficients shown are partial regression coefficients
\[ {\cal L}_F\left(X^T_Ey \big\vert (\hat{E}, z_{\hat{E}}) = (E, z_E) \right) \]
\[ {\cal L}_F\left(X^T_Ey \big\vert \hat{E}=E \right) \]
\[ F \in \left\{N(X_E\beta_E, \sigma^2_E I): \beta_E \in \mathbb{R}^E \right\} ={\cal M}_E. \]
Introducing variables \(\mu\) along with the linear equality constraint \(\mu=X\beta\) the LASSO can be stated as \[ \text{minimize}_{\mu,\beta} \frac{1}{2} \|y-\mu\|^2_2 + \lambda \|\beta\|_1 \] subject to \(\mu=X\beta\).
Minimizing the corresponding Lagrangian yields the dual problem \[ \text{minimize}_{r} \frac{1}{2} \|y-r\|^2_2 \] subject to \(\|X^Tr\|_{\infty} \leq \lambda\).
We see that the LASSO is actually a metric projection.
Having found \(\hat{r}(y)\) we can recover \(X\hat{\beta}\) by \[ X\hat{\beta} = y - \hat{r}. \]
The vector \(\hat{\beta}\) can be found given its active set \[ \hat{E} = \left\{j:|X_j^T\hat{r}|=\lambda \right\} \]
Finally \[ \hat{\beta}_E = X_E^{\dagger}X\hat{\beta}. \]
For LASSO these are \[ X^T(y-X\hat{\beta}) = \lambda \hat{u} \] where \(\hat{u} \in \partial (\|\cdot\|_1)(\hat{\beta})\)
A computer (e.g. glmnet
) gives us \(\hat{\beta}\).
Rewriting slightly \[ X^Ty = X^TX\hat{\beta} + \lambda \hat{u} \]
Solving the LASSO is a map \[ X^Ty \mapsto (\hat{\beta}, \hat{u}) \]
Given \((\hat{\beta}, \hat{u})\) we can reconstruct \(X^Ty\)!
We could even invert the LASSO!
Event of interest \[ \left\{(\hat{E}, z_{\hat{E}}) = (E,z_E) \right\} \]
KKT conditions on this event \[ \begin{aligned} X_E^T(y-X_E\hat{\beta}_E) &= \lambda z_E \\ \max_{j \not \in E}|X_{j}^T(y - X_E\hat{\beta}_E)| \leq \lambda. \end{aligned} \]
Equivalent to polyhedral constraints \[ y \in \left\{w \in \real^n: A_{(E,z_E)}w \leq b_{(E,z_E)} \right\} \]
The active ones are equivalent to \[ z_E (X_E^{\dagger}y - (X_E^TX_E)^{-1}z_E) \leq 0 \]
The inactive ones are equivalent to \[ \begin{aligned} X_{-E}^T((I-P_E)y + (X_E^T)^{\dagger} (\lambda z_E)) \leq \lambda \\ -X_{-E}^T((I-P_E)y + (X_E^T)^{\dagger} (\lambda z_E)) \leq \lambda \end{aligned} \]
\[ H_{0,(j|E)} : \beta_{j|E} = 0. \]
Tests satisfy selective type I error guarantee \[ F\left(\phi_{j|E} \vert (\hat{E}, \hat{z}_E)\right) \leq \alpha, \qquad \forall F \in H_{0,(j|E)} \]
The same guarantee as data splitting!
We report results
\[\phi_{(j| \hat{E})}(y), j \in \hat{E}.\]
We talked about two different problems.
Both problems relate to critical points!
LASSO problem throws parametric inference into the mix.
Tomorrow we will look at some tools to compute statistical quantities related to critical points.