< Back to previous page

Project

Development and implementation of real-time, robust statistical methods with novel applications in food sorting

In industrial food sorting, fast sensor based technologies are used for automated food inspection. These sensors typically produce multivariate data that are used as input for classification algorithms, which are responsible for the detection of commonly found defects among the regular material. Typically, huge amounts of product are scanned in an automated fashion. Food inspection machines therefore generate gigabytes of multivariate data in milliseconds, frequently pushing the boundaries of available computing power.

Outliers can dramatically influence the prediction efficiency of traditional classifiers. Robust algorithms are thus an absolute must, since industrial datasets are typically corrupted by outliers in the form of label and measurement noise. However, none of the well-known high breakdown methods can handle the sheer volume of data from these machines. This thesis addresses this problem by the introduction of new robust statistical procedures which are fast to compute, and which are specifically designed for robust outlier detection and multiclass classification problems.

This doctoral thesis contains four chapters, where the relation between the different outlier detection techniques is discussed in the first chapter.

The second chapter focusses on the speed-up of the deterministic minimum covariance determinant method (DetMCD), which detects outliers by fitting a robust covariance matrix. We construct a much faster version of DetMCD by replacing its initial estimators by two new methods and by incorporating update-based  concentration steps. The computation time is reduced further by parallel computing, requiring the development of a novel robust aggregation method to combine the results from the individual threads.

In the third chapter, we integrate the real-time DetMCD method into quadratic discriminant analysis (QDA), which is a widely used classification technique. This allows us to solve classification problems with multiple classes. Based on a training dataset, each class in the data is characterized by an estimate of its center and shape, which can then be used to assign unseen observations to one of the classes. We present a novel, robust QDA method where we additionally integrate an anomaly detection step to classify the most suspicious observations into a separate class of outliers. We also introduce the label bias plot, a graphical display to identify label and measurement noise in the training data.

However, most outlier detection techniques assume that the non-outlying observations are roughly elliptically distributed, but many datasets are not of that form. Moreover, their computation time increases substantially when the number of variables goes up. In Chapter 4 we therefore propose the Kernel Minimum Regularized Covariance Determinant (KMRCD) estimator in Chapter four which addresses both issues. It is not restricted to elliptical data because it implicitly computes robust covariances in a kernel-induced feature space. A fast algorithm is constructed that starts from kernel-based initial estimates, where the kernel trick is exploited to speed up the subsequent computations.

 

Date:1 Oct 2016 →  17 Dec 2020
Keywords:robust statistics, big data, outlier detection
Disciplines:Applied mathematics in specific fields, Computer architecture and networks, Distributed computing, Information sciences, Information systems, Programming languages, Scientific computing, Theoretical computer science, Visual computing, Other information and computing sciences, Statistics and numerical methods
Project type:PhD project