< Back to previous page

Project

Robust techniques for functional data and generalized linear models

Robust estimators are indispensable tools in statistics. Frequently, a (small) part of the data sample follows a different pattern as the majority of the data or even no pattern at all. Such atypical observations are called outliers. They may be simple gross errors such as measurement errors or copying mistakes. However, they may also be observations governed by different laws or indicate subgroups or structures in the data sample. 

Two different frameworks with different areas of application are studied. Firstly, robust techniques for functional data are investigated. This type of data, popularized by advances in data gathering, has led to a new field of study in statistics. New techniques for the detection of outliers are proposed, such as the Centrality-Stability Plot, the Functional Outlier Map, and heatmaps. These are based on statistical depth functions and distance measures derived from them. The techniques are illustrated on both univariate and multivariate functional data. Moreover, functional data defined on a multivariate domain, such as fluorescence excitation-emission spectra or video data are studied. 

Robust supervised classification of functional data is discussed in the third chapter. A new classification procedure based on the DistSpace transform is proposed,  which maps each data point to the vector of its distances to all classes, followed by k-nearest neighbor (kNN) classification of the transformed data points. This combines affine invariance and robustness with the simplicity and wide applicability of kNN. The proposal is compared with other methods in experiments with both real and simulated data. 

A second part of the thesis concerns Generalized Linear Models or GLMs, a unified regression framework for response variables belonging to the exponential family. This family encompasses a broad class of popular distributions such as the normal, Poisson, binomial and gamma distribution. Moreover, GLMs only assume a linear relation, up to transformation, between the predictors and the mean of the response variable.

Chapter 4 discusses a problem in actuarial sciences. More specifically, we consider the challenge insurers face when estimating the future reserves needed to handle historic and outstanding claims that are not fully settled. A  well-known and widely used technique in this context is the chain-ladder method, which is a deterministic algorithm. To include a stochastic component, one may apply GLMs to the run-off triangles based on past claims data. Analytical expressions for the standard deviation of the resulting reserve estimates are typically difficult to derive.  A popular alternative approach to obtain inference  is to use the bootstrap technique. However, the standard procedures are sensitive to the possible presence of outliers. These atypical observations, may both inflate or deflate traditional reserve estimates and corresponding inference such as their standard errors. Several robust bootstrap procedures are investigated in the claims reserving framework comparing their performance on both simulated and real data. 

Chapter 5 deals with a phenomenon frequently occurring when analyzing data with GLMs. Real data often display a larger or smaller variability than expected under the prescribed GLM. This extra deviation around the mean, or lack thereof in case of underdispersion, may be constant across observations but may equally well depend on a set of predictors. Accounting for this varying dispersion is critical for several reasons. Firstly, correct confidence intervals for the regression coefficients governing the mean depend on the dispersion effects. Secondly, ignoring dispersion may result in a loss of efficiency in the estimation of the mean coefficients. Lastly, the dispersion model itself may be the main focus of interest. We propose a robust estimator for the joint modeling of mean and dispersion effects in the context of GLMs. Our methodology does not suppose constant dispersion but models both mean and dispersion behavior based on a possibly different set of predictors. As such, the proposed methodology is highly flexible. We derive theoretical properties of the estimator and discuss the problem of robust inference. The good performance of the estimator is shown on both simulated and real data. 

Date:1 Oct 2013 →  30 May 2017
Keywords:robustness, functional data, generalized linear models
Disciplines:Applied mathematics in specific fields, Computer architecture and networks, Distributed computing, Information sciences, Information systems, Programming languages, Scientific computing, Theoretical computer science, Visual computing, Other information and computing sciences, Statistics and numerical methods
Project type:PhD project