**Workshop at 30.-31.03.2023**

**Speakers:**

Alexander Aue

Peter Bühlmann

Aurore Delaigle

Mathias Drton

Shao Gao

Yuta Koike

Victor Panaretos

Kolyan Ray

Stefan Richter

Ingo Steinwart

**Alexander Aue: Testing high-dimensional general linear hypotheses under a multivariate regression model with spiked noise covariance**

This talk considers the problem of testing linear hypotheses under a multivariate regression model with a high-dimensional response and spiked noise covariance. The proposed family of tests consists of test statistics based on a weighted sum of projections of the data onto the estimated latent factor directions, with the weights acting as the regularization parameters. We establish asymptotic normality of the test statistics under the null hypothesis. We also establish the power characteristics of the tests and propose a data-driven choice of the regularization parameters under a family of local alternatives. The performance of the proposed tests is evaluated through a simulation study. Finally, the proposed tests are applied to the Human Connectome Project data to test for the presence of associations between volumetric measurements of human brain and behavioral variables.

**Peter Bühlmann: Deconfounding and Well-Specification**

Hidden confounding is a severe problem when interpreting regression or causal parameters, and it may also lead to poor generalization performance for prediction. Adjusting for unobserved confounding is important but challenging when based on observational data only. We propose spectral deconfounding, a class of linear data transformations, followed by standard sparse estimation methods such as the Lasso, or the Debiased Lasso when confidence guarantees are required. The proposed methodology has provable (optimality) properties when assuming dense confounding. We conclude with a discussion when such a dense confounding assumption fails to hold and how one can still address partial well-specification.

The talk is based on different joint work with Domagoj Cevid, Zijian Guo, Nicolai Meinshausen, Christoph Schultheiss and Ming Yuan.

**Aurore Delaigle: Estimating a prevalence in group testing problems with missing values**

Estimating the prevalence of an infectious disease in a big population typically requires testing individuals for the disease using a specimen test. When a new disease spreads quickly, testing each individual is often not possible because of time constraints and limited resources. The group testing procedure was introduced in the 1940’s to handle such situations and has been used extensively during the covid-19 pandemic. Instead of testing all individuals for a disease, it tests the pooled specimens of groups of individuals. This approach permits to estimate fast and accurately a prevalence which is not too high. Often it is the prevalence conditional on important variables which is of interest, and techniques have been developed in the literature for estimating it from group testing data. However, these fail if covariates and/or specimens are missing for some of the individuals, a situation which is often encountered in practice. We construct consistent estimators of conditional prevalence for group testing data, designed for such cases.

**Mathias Drton: Testing many possibly irregular polynomial constraints**

In a number of applications, a hypothesis of interest can be characterized algebraically by polynomial equality and inequality constraints on an easily estimable statistical parameter. However, using the constraints in statistical tests can be challenging because the number of relevant constraints may be on the same order or even larger than the number of observed samples. Moreover, standard distributional approximations may be invalid due to singularities of the constraints. To mitigate these issues we propose to design tests by estimating the relevant polynomials via incomplete U-statistics and leverage recent advances in bootstrap approximation to derive critical values. Specifically, we form the incomplete U-statistics with a computational budget parameter on the order of the sample size and show that this allows one to accommodate settings where the individual U-statistics kernels may be mixed non-degenerate or degenerate.

**Shao Gao: Detection and Recovery of Sparse Signal Under Correlation**

We study a p dimensional Gaussian sequence model with equicorrelated noise. In the first part of the talk, we consider detection of a signal that has at most s nonzero coordinates. Our result fully characterizes the nonasymptotic minimax separation rate as a function of the dimension p, the sparsity s and the correlation level. Surprisingly, not only does the order of the minimax separation rate depend on s, it also varies with p-s. This new phenomenon only occurs when correlation is present. In the second part of the talk, we consider the problem of signal recovery. Unlike the detection rate, the order of the minimax estimation rate has a dependence on p-2s, which is also a new phenomenon that only occurs with correlation. We also consider detection and recovery procedures that are adaptive to the sparsity level. While the optimal detection rate can be achieved adaptively without any cost, the optimal recovery rate can only be achieved in expectation with some additional cost.

**Yuta Koike: High-dimensional bootstrap and asymptotic expansion: A first attempt**

The recent seminal work of Chernozhukov, Chetverikov and Kato has shown that bootstrap approximation for the maximum of a sum of independent random vectors can be justified under mild moment assumptions even when the dimension is much larger than the sample size. In this context, numerical experiments suggest that third-order matching bootstrap approximations would have superior performance than the Gaussian approximation in finite samples, but the existing theoretical results cannot explain this phenomenon. In this talk, we present an attempt to fill this gap using the Edgeworth expansion.

**Victor Panaretos: The Extrapolation of Correlation**

We discuss the problem of positive-semidefinite extension: extending a partially specified covariance kernel from a subdomain Ω of a rectangular domain I x I to a covariance kernel on the entire domain I x I. For a broad class of domains Ω called serrated domains, we can obtain a complete picture. Namely, we demonstrate that a canonical completion always exists and can be explicitly constructed. We characterise all possible completions as suitable perturbations of the canonical completion, and determine necessary and sufficient conditions for a unique completion to exist. We interpret the canonical completion via the graphical model structure it induces on the associated Gaussian process. Furthermore, we show how the determination of the canonical completion reduces to the solution of a system of linear inverse problems in the space of Hilbert-Schmidt operators, and derive rates of convergence when the kernel is to be empirically estimated. We conclude by providing extensions of our theory to more general forms of domains, and by demonstrating how our results can be used in statistical inverse problems associated with stochastic processes. Based on joint work with in collaboration with K.G. Waghmare (EPFL)

**Kolyan Ray: Bayesian estimation in a multidimensional diffusion model with high frequency data**

We consider nonparametric Bayesian inference in a multidimensional diffusion model with coupled drift function and diffusion coefficient arising from physical considerations. We show that posteriors (and posterior means) based on discrete high-frequency observations and suitably calibrated Gaussian priors can recover these model parameters at the minimax optimal rate over Holder smoothness classes in any dimension. As a by-product of our proof, we also show that certain frequentist penalized least squares estimators are minimax optimal for estimating the diffusion coefficient in a multidimensional setting.

**Stefan Richter: Empirical process theory and oracle inequalities for (non-)stationary processes**

We provide an empirical process theory for locally stationary processes based on two measures of dependence:

The functional dependence measure and absolute regularity. Key results are maximal inequalitie, functional central limit theorems and Bernstein-type inequalities.

Oracle inequalities for minimum empirical risk estimators in terms of approximation and estimation error are provided.

The results are applied to quantify the forecasting error of neural network estimators.

**Ingo Steinwart: Density-Based Cluster Analysis**

A central, initial task in data science is cluster analysis, where the goal is to find clusters in unlabeled data. One widely accepted definition of clusters has its roots in a paper by Carmichael et al., where clusters are described to be densely populated areas in the input space that are separated by less populated areas. The mathematical translation of this idea usually assumes that the data is generated by some unknown probability measure that has a density with respect to the Lebesgue measure. Given a threshold level, the clusters are then defined to be the connected components of the density level set. However, choosing this threshold and possible width parameters of a density estimator, which is left to the user, is a notoriously difficult problem, typically only addressed by heuristics. In the first part of this talk, I show how a simple algorithm based on a density estimator can find the smallest level for which there are more than one connected component in the level set. Unlike other cluster algorithms this approach is fully adaptive in the sense that it does not require the user to guess crucial hyper-parameters. In addition, I will explain, how recursively applying this algorithm makes it possible to estimate the entire split tree of clusters. In the second part of the talk I will discuss practical aspects of the algorithm including an efficient implementation. Finally, I present some numerical illustrations.