Robust inference techniques based on resampling
Linear regression is the most famous type of regression analysis in statistics. A statistical analysis of a linear regression model usually begins with estimation of the regression coefficients and continues with measuring the accuracy of the estimators. Unfortunately, it is well known that a traditional statistical analysis based on the least squares principle is very sensitive to outliers in the data. Although many robust estimators have been proposed to control the effect of outliers, robust inference techniques have remained more scarce. Therefore, the main goal of this thesis is to investigate robust inference techniques. For the approach in this thesis, the key concept for the development of robust inference is the fast and robust resampling methodology. Instead of applying standard resampling techniques such as bootstrapping and subsampling, a resampling distribution is generated by calculating a fast and robust resampling estimator for a large number of resamples. The resulting resampling distribution is robust against outliers and can also be computed extremely fast, as opposed to the original resampling algorithm. Inference based on fast and robust resampling is considered for seemingly unrelated regression models and for generalized linear models.
Seemingly unrelated regression models generalize linear regression models with normally distributed errors by considering multiple regression equations that are linked by contemporaneously correlated disturbances. MM-estimators are introduced to obtain estimators that have a high breakdown point and a high normal efficiency. A fast and robust bootstrap procedure is then developed to obtain robust inference for these estimators. Confidence intervals for the model parameters as well as hypothesis tests for linear restrictions of the regression coefficients in seemingly unrelated regression models are constructed. Moreover, in order to evaluate the need for a seemingly unrelated regression model, a robust procedure is proposed to test for the presence of correlation among the disturbances. The performance of the fast and robust bootstrap inference is evaluated empirically in simulation studies and illustrated on real data.
MM-estimators for seemingly unrelated regression models are applied in the framework of stochastic loss reserving for general insurance, as a robust alternative to the general multivariate chain ladder method. The chain ladder method is a widely used technique to forecast the reserves that an insurance company will be liable to pay in the event of a claim. To make predictions for multiple run-off triangles simultaneously, a general multivariate chain ladder method has been proposed that takes into account contemporaneous correlations and structural connections between different run-off triangles. With the robust methodology it is possible to detect which claims have an abnormally large influence on the reserve estimates. A simulation design is introduced to generate artificial multivariate run-off triangles and the importance of taking into account contemporaneous correlations and structural connections between the run-off triangles is illustrated. By generating contaminated data the sensitivity of the traditional chain ladder method and the good performance of the robust method is shown. The analysis of a portfolio from practice makes clear that the robust method can provide better insight in the structure of the data.
Finally, robust model selection inspired by the fast and robust resampling methodology is introduced for generalized linear models. Selecting the optimal model from a set of competing models is an essential task in statistics. Particular attention is paid to a robust model selection criterion that combines goodness of fit and a measure of prediction. The prediction loss is estimated by using resampling techniques. In addition to case bootstrapping, also error bootstrapping and subsampling algorithms are considered. To reduce the computational burden, a modified fast and robust resampling method is proposed. It is shown that this modification still yields a consistent model selection criterion, in the sense that the optimal model is identified with probability one as the sample size grows to infinity. The performance of the proposed methodology is evaluated empirically by a simulation study and illustrated on real data examples.