J. Taylor
September 3-6, 2017
\[ \newcommand{\sqbinom}[2]{\begin{bmatrix} #1 \\ #2 \end{bmatrix}} \newcommand{\Ee}{\mathbb{E}} \newcommand{\Pp}{\mathbb{P}} \newcommand{\real}{\mathbb{R}} \newcommand{\hauss}{{\cal H}} \newcommand{\lips}{{\cal L}} \newcommand{\mink}{{\cal M}} \newcommand{\data}{{\cal D}} \]
We have discussed Kac-Rice and Slepian tools for working with critical points, particularly conditioning
Tools were presented for doing computations under one distribution \(\Pp\).
Statistical inference requires understanding statistical models \({\cal M}\), i.e. collections of distributions.
Event of interest \[ \left\{(\hat{E}, z_{\hat{E}}) = (E,z_E) \right\} \]
KKT conditions \[ \begin{aligned} X_E^T(y-X_E\hat{\beta}_E) &= \lambda z_E \\ \max_{j \not \in E}|X_{j}^T(y - X_E\hat{\beta}_E)| \leq \lambda. \end{aligned} \]
Equivalent to polyhedral constraints \[ y \in \left\{w \in \real^n: A_{(E,z_E)}w \leq b_{(E,z_E)} \right\} \]
Having conditioned on \((E,z_E)\) we can use this information to form statistical hypotheses.
Statistical hypotheses are subsets of statistical models. What model are we talking about here?
A candidate model: saturated model (known variance, unstructured mean) \[ {\cal M}_S = \left\{N(\mu, \sigma^2 I): \mu \in \real^n \right\} \]
Is this the most natural model?
##
## Call:
## lm(formula = Y ~ ., data = data.frame(Y, Xselect))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9654 -0.3327 -0.0651 0.2653 4.9333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.252e-12 2.497e-02 0.000 1.000000
## P62V 4.465e-02 2.999e-02 1.489 0.136994
## P65R 2.837e-01 2.612e-02 10.860 < 2e-16 ***
## P67N 1.423e-01 2.884e-02 4.935 1.03e-06 ***
## P69i 1.649e-01 2.691e-02 6.126 1.61e-09 ***
## P75I 2.770e-02 3.955e-02 0.700 0.483889
## P77L 4.774e-02 4.327e-02 1.103 0.270371
## P83K -7.475e-02 2.551e-02 -2.930 0.003513 **
## P90I 1.038e-01 2.536e-02 4.094 4.81e-05 ***
## P115F 6.754e-02 2.729e-02 2.475 0.013593 *
## P151M 9.424e-02 3.541e-02 2.661 0.007992 **
## P181C 9.691e-02 2.623e-02 3.694 0.000240 ***
## P184V 2.218e+00 2.610e-02 84.973 < 2e-16 ***
## P190A 4.797e-02 2.582e-02 1.858 0.063696 .
## P215F 1.152e-01 2.896e-02 3.976 7.83e-05 ***
## P215Y 1.748e-01 2.992e-02 5.845 8.24e-09 ***
## P219R 8.512e-02 2.569e-02 3.313 0.000976 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6281 on 616 degrees of freedom
## Multiple R-squared: 0.9313, Adjusted R-squared: 0.9296
## F-statistic: 522.2 on 16 and 616 DF, p-value: < 2.2e-16
The model in the R
output above is \[
{\cal M}_E = \left\{N(X_E\beta_E, \sigma^2 I): \beta_E \in \real^E, \sigma^2 > 0 \right\}.
\]
We might call this a data selected model.
The R
output is invalid because it has not adjusted for selection. Conditioning each law in \({\cal M}_E\) on the selection event would adjust this problem.
Which model should we use?
Conditioning on \((\hat{E},\hat{z}_E)\), both \({\cal M}_S\) and \({\cal M}_E\) yield selective models.
For example, the data selected model has selective model \[ {\cal M}^*_E = \left\{F^*: F \in {\cal M}_E, \frac{dF^*}{dF}(X,y) \propto 1_{\{(\hat{E},\hat{z}_E)=(E,z_E)\}} \right\} \]
The saturated model yields selective model \[ {\cal M}^*_S = \left\{F^*: F \in {\cal M}_S, \frac{dF^*}{dF}(X,y) \propto 1_{\{(\hat{E},\hat{z}_E)=(E,z_E)\}} \right\} \]
Nonparametric models also fine, though one needs to consider selective CLT.
Let’s work with \({\cal M}^*_S\) for now.
Having fixed \((E,z_E)\) we know that we are considering only response that yield the same active set and signs.
We can now test \[ H_0: \eta^T\mu=0 \] with \(\eta\) allowed to depend on \((E,z_E)\).
Common example would be testing whether \(j\)-th coefficient is 0 \[ \eta = (X_E^T)^{\dagger}e_j \]
Such targets were discussed in POSI of Berk et al. (2013).
The parameter \(P_{\eta}^{\perp}\mu = (I - \eta\eta^T/\|\eta\|^2_2)\mu\) is a nuisance parameter for this test.
How do we get rid of it?
Standard method is plugin. This fails! (c.f. the impossiblity results of Leeb and Potscher)
Bootstrap also fails (if naively applied).
Plugging in estimates of variance does not always fail.
As \({\cal M}^*_S\) is an exponential family, we can in fact condition on the corresponding sufficient statistic.
Define \(N=P_{\eta}^{\perp}y\). Then, we can write \[ y = \eta^Ty/|\eta\|^2_2 \cdot \eta + N \]
Conditioning on \(N\), the only randomness in the selection event is \(\eta^Ty\): \[ A( \eta^Ty/|\eta\|^2_2 \cdot \eta + N) \leq b \iff \eta^Ty \in [{\cal V}^-(N), {\cal V}^+(N)], \dots \]
Above, \[ \begin{aligned} {\cal V}^-(N) &= \max_{j: A[j]^T\eta < 0} \frac{-b_j+A[j]^TN}{A[j]^T\eta/\|\eta\|^2_2} \\ {\cal V}^+(N) &= \min_{j: A[j]^T\eta > 0} \frac{b_j-A[j]^TN}{A[j]^T\eta/\|\eta\|^2_2} \\ \end{aligned} \]
Let \(\theta=\eta^T\mu\) and define \[TG(\eta^Ty, \theta; N) = \frac{\Phi({\cal V}^+(N) - \theta) - \Phi(\eta^Ty - \theta)} {\Phi({\cal V}^+(N) - \theta) - \Phi({\cal V}^-(N) - \theta)}. \]
Note that this is just the CDF transform from the above univariate truncated Gaussian.
Then, for any \(F \in {\cal M}_S\) such that \(\eta^T\mu(F)=\theta\) \[ TG(\eta^Ty,\theta;N) \overset{F^*}{\sim} \text{Unif}(0,1). \]
fixedLassoInf(X, Y, beta.hat, 43, intercept=FALSE)
##
## Call:
## fixedLassoInf(x = X, y = Y, beta = beta.hat, lambda = 43, intercept = FALSE)
##
## Standard deviation of noise (specified or estimated) sigma = 0.634
##
## Testing results at lambda = 43.000, with alpha = 0.100
##
## Var Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
## 16 0.045 1.476 0.184 -0.045 0.145 0.050 0.050
## 17 0.284 10.761 0.000 0.240 0.336 0.049 0.049
## 19 0.142 4.890 0.000 0.095 0.225 0.049 0.050
## 23 0.165 6.070 0.000 0.114 0.210 0.048 0.049
## 27 0.028 0.694 0.723 -0.434 0.094 0.050 0.049
## 30 0.048 1.093 0.234 -0.108 0.324 0.050 0.050
## 31 -0.075 -2.904 0.025 -0.117 -0.013 0.050 0.049
## 32 0.104 4.057 0.007 0.041 0.146 0.050 0.049
## 41 0.068 2.453 0.083 -0.015 0.119 0.050 0.050
## 54 0.094 2.637 0.040 0.006 0.168 0.049 0.050
## 67 0.097 3.661 0.001 0.056 0.282 0.049 0.050
## 68 2.218 84.205 0.000 2.162 2.262 0.049 0.049
## 69 0.048 1.841 0.845 -0.973 0.053 0.050 0.049
## 81 0.115 3.940 0.008 0.044 0.214 0.050 0.049
## 82 0.175 5.792 0.000 0.124 0.225 0.048 0.050
## 87 0.085 3.283 0.049 0.000 0.139 0.050 0.049
##
## Note: coefficients shown are partial regression coefficients
Plug-in uses \[ \hat{F} = N \left(\begin{pmatrix} \theta \\ Y_2 \end{pmatrix}, I_{2 \times 2} \right) \] as reference in \({\cal M}_2\) and considers the law of \(Y_1\) under \(\hat{F}^*\) (i.e. after conditioning on the winner).
If it had access to \(\mu_2\), it might use \[ F_0 = N \left(\begin{pmatrix} \theta \\ \mu_2 \end{pmatrix}, I_{2 \times 2} \right) \] and its selective counterpart \(F_0^*\).
Plug-in fails because \(\hat{F} \neq F_0\) as \(Y_2-\mu_2=O(1)\) and hence \(\hat{F}^* \neq F_0^*\).
In asymptotic settings, \(Y\) would be something like \(n^{1/2}\bar{Y}\) and under local alternatives \(\mu = n^{1/2} \delta\) so \[ Y_2 - \mu_2 = n^{1/2}(\bar{Y}_2 - \delta_2) = O(1). \]
This is at the heart of the impossibility results.
Conditioning on \(Y_2\) fixes this failure.
Suppose that scientist is willing to assume \(\mu_2=0\) so \(H_0\) is really \[ \mu = \begin{pmatrix}\theta \\ 0 \end{pmatrix}. \]
Then, we no longer need to condition on \(Y_2\) – it gets integrated out.
We can use the law of \(Y_1\) under \(F_0^*\) for valid test.
For the lasso, if we were to use the data-selected model then we do not need to condition on \[ R_Ey = (I - P_E)y \] with \(P_E=X[,E]^{\dagger}X[,E]\).
We condition instead on \(N=P_Ey - \eta^Ty / \|\eta\|^2_2 \cdot \eta\) and write \[ y = \eta^Ty / \|\eta\|^2_2 \eta + R_Ey + N \]
In this case we must integrate out \(R_Ey\).
Due to special structure of LASSO, it turns out we get exactly the same \(p\)-values and confidence intervals.
For algorithms like forward stepwise there is a difference!
How to make asymptotic statements?
Classical asymptotics are results along sequences of models \({\cal M}_n\).
For each \(n\), we will now have a particular selection event and \({\cal M}_n^*\).
What converges here?
Suppose CLT holds uniformly along \({\cal M}_n\) for some estimator \(\hat{\theta}_n\) for parameter \(\theta_n\).
What is the analog of CLT along \({\cal M}_n^*\)?
The model \({\cal M}^*\) is determined by some selection event \[ {\cal M}^* = \left\{F^*: \frac{dF^*}{dF} = L_F, F \in {\cal M}\right\}. \]
Properties of \(L_F\) over \({\cal M}\) can be used to answer asymptotic questions.
Suppose selection can be expressed in terms of some statistical functional \(T_n\) that satisfies a uniform CLT.
We might be interested in target statistical functional \(\hat{\theta}_n\) that satisfies uniform CLT jointly with \(T_n\).
We take our models to be the laws of \((T_n, \hat{\theta}_n)\)
The nuisance parameter is \[ N_n = T_n - \text{Cov}(T_n, \hat{\theta}_n) \text{Cov}(\hat{\theta}_n)^{-1} \hat{\theta}_n = T_n - A_n \hat{\theta}_n \]
Appropriate Gaussian reference preselection would be \[ n^{1/2}(\hat{\theta}_n - \theta(F) ) \sim N\left(0, n \cdot \text{Cov}(\hat{\theta}_n) \right) \]
Adjust by the restricted likelihood ratio \[ \hat{\theta}_n \mapsto L_F(N_n + A_n \hat{\theta}_n). \]
Up to now, our selective likelihood ratio was proportional to an indicator.
Led to truncated models which can perform poorly.
If our selection is done on a randomized version of the data, laws are not truncated.
Randomization can also establish consistency on larger models, as well as selective CLT on larger classes of models.
See arxiv.org/1507.06739.
Suppose \(\real^p \ni \omega \sim G\) with density \(g\).
We might solve \[ \text{minimize}_{\beta} \frac{1}{2} \|Y-X\beta\|^2_2 + \lambda \|\beta\|_1 - \omega^T\beta + \frac{\epsilon}{2} \|\beta\|^2_2 \]
Very similar to data splitting.
This problem will yield untruncated selective models.
What does the selective likelihood ratio look like?
Suppose you have your favorite structure inducing convex function \({\cal P}(\beta)\).
Consider the randomized problem \[ \text{minimize}_{\beta} \ell(D;\beta) + {\cal P}(\beta) -\omega^T\beta + \frac{\epsilon}{2} \|\beta\|^2_2 \]
KKT conditions are \[ \omega = \nabla \ell(D;\hat{\beta}) + \epsilon \cdot \hat{\beta} + \hat{u} \] with \[ (\hat{\beta}, \hat{u}) \in \left\{(\beta,u): u \in \partial {\cal P}(\beta) \right\}. \]
We reconstruct the pair as \[ (D,\beta,u) \overset{\phi}{\mapsto} (D, \ell(D;\beta) + \epsilon \cdot \beta + u) \]
Many queries of interest can be expressed as \[ (\hat{\beta}, \hat{u}) \in {\cal B} \subset \left\{(\beta,u): u \in \partial {\cal P}(\beta) \right\}. \]
Proportional to \[ L_F(D) \propto \int_{{\cal B}} g\left(\nabla \ell(D;\beta) + \epsilon \cdot \beta + u \right) J_{\phi}(D;\beta,u) \; d\beta \; du. \]
Jacobian considers derivative w.r.t. \((\beta, u)\).
This is (almost) the same calculation we did when considering the volume of tubes formula!
This construction is not entirely new…
Suppose we have an exponential family \[ {\cal M} = \left\{f_{\theta}: \frac{df_{\theta}}{dm}(t) = \exp\left(\theta^T t - \Lambda(\theta)\right)\right\} \]
MLE satisfies the score equation \[ \nabla \Lambda(\hat{\theta}(T)) = T \]
Density of \(\hat{\theta}\) under \(f_{\theta^*}\) (Barndorff-Neilson; Efron and Hinkley; Fisher) \[ \theta \mapsto \exp\left(\nabla \Lambda(\theta)^T\theta^* - \Lambda(\theta^*) \right) \det(\nabla^2 \Lambda(\theta))^{1/2} \]
See also “estimator augmentation” (Zhou, (2014); Zhou (2016))
The estimator augmentation / classical MLE reconstructs only the sufficient statistic \(T\)
With randomization, we reconstruct arbitrary data \((X,Y)\) and our randomization \(\omega\).
We are usually interested in conditional law, found by restricting to some \({\cal B}\).
Let \(D\) be the score of some model. \[ \text{minimize}_{\beta: \|\beta\|_1 \leq 1} -\beta^TD \]
In this case \(\epsilon=0\) and we have set \[ {\cal P}(\beta) = \begin{cases} 0 & \|\beta\|_1 \leq 1 \\ \infty & \text{otherwise.} \end{cases} \]
Interested in conditioning on the winning index, \(j^*\). KKT conditions are \[ \omega - D \in {\cal K}_{j^*} \] where \({\cal K}_j\) is the cone \[ \left\{z \in \real^p: |z_j| \geq \|z\|_{\infty} \right\}. \]
Given a partition \(G\) of \(\{1, \dots ,p\}\) the group LASSO penalty is \[ {\cal P}(\beta) = \sum_{g \in G} \lambda_g \|\beta_g\|_2 \]
Corresponding \(K\) is \[ {\cal U} = \prod_{g \in G} B_2(0, \lambda_g). \]
Conditioning on active groups restricts \(u\) to \[ {\cal U}_E = \left(\prod_{g \in E} S(\lambda_g) \right) \times \left(\prod_{g \in -E} B_2(0,\lambda_g) \right) \] while \(\beta \in N_u(\cal U)\).
Jacobian of reconstruction has very similar form to that of metric projection onto \({\cal B}_U\) in Steiner-Weyl formula…
Our first lecture considered two seemingly separate problems.
Both problems considered properties of critical points on different domains \(K\).
Geometry of \(K\) plays important role in studying statistical properties of critical points.
The normal bundle of \(K\) is perhaps the most fundamental object in describing these properties.
Data scientist might ask a second question…
Data scientist might collect some fresh data…
Consider high-dimensional setting without making many additional assumptions. (Markovic, Xia, and T., 2017)
Allow arbitrary queries.
Allow many, many queries as in reusable holdout (Dwork et al., 2016)
Query 2 is allowed to depend on the outcome of query 1.
The outcome of the queries determine \({\cal B}_i\).
Choice of \(\ell_2\) is allowed to depend on result of query 1.
After observing that the LASSO chose \((E,z_E)\) we might be interested in testing \(H_0: \theta^*(F)=0\) for some functional \(\theta\).
Often, we will have an estimator of \(\hat{\theta}\) satisfying a joint CLT with \(\data=X^Ty\).
We decompose \(\data\) into \({\cal N}\) (for nuisance) and its \(\hat{\theta}\) component \[ \begin{aligned} \data &= \data - \text{Cov}_F(\data, \hat{\theta}) \text{Cov}_F(\hat{\theta})^{-1} \hat{\theta} + \text{Cov}_F(\data, \hat{\theta}) \text{Cov}_F(\hat{\theta})^{-1} \hat{\theta} \\ &= {\cal N} - A \hat{\theta} \end{aligned} \]
Plugging in \(N(\theta, \text{Cov}_F(\hat{\theta}))\) yields selective likelihood ratio \[ \theta \mapsto \int_{{\cal B}_{E,z_E}} g \left((X^TX + \epsilon I) \beta + {\cal N} - A \theta + u\right) \; d\beta \; du \]