Title Promoter Affiliations Abstract "Development and application of statistical methodology for analysis of the phenomenon of multi-drug resistance in the EU: demonstration of analytical approaches using antimicrobial resistance isolate-based data." "Marc AERTS" "Centre for Statistics" "In accordance with Decision 2013/652/EC, harmonisation of monitoring of antimicrobial resistance (AMR) in animals and food reporting will be further enhanced in the EU and reporting of AMR data at isolate level by MSs to EFSA will become mandatory from calendar year 2015 and onwards (2014 data and onwards). Based on AMR isolate-based data reported on a voluntary basis, the 2012 EU Summary Report on AMR summarises important information on multi-drug resistance (MDR) and already includes 'summary indicators' of MDR and the breakdown of the multi-/co-resistance patterns recorded. The isolate-based dataset allows the following to be reported: source of the sample (animal species, animal populations or food categories), the date of sampling, the country of origin, the bacterial species and subtype of the isolate tested and the susceptibility test results to a harmonised set of antimicrobial substances. MDR is considered to be a major public health issue. It is important that EFSA can provide an evidence-based evaluation of the role of food production in the emergence and spread of multiple drug resistant micro-organisms. Further analytical and methodological preparatory work is performed on the available 2010-2013 isolate-based data in order to have a more in-depth analysis of MDR, notably to investigate associations between resistance traits and to carry out tracing analyses of the geographical and temporal diffusion of MDR. This project provides suitable analysis methods to address these questions and may identify areas for improvement in monitoring systems." "Statistical Tools for Anomaly Detection and Fraud Analytics" "Tim Verdonck" "Statistics and Data Science, Information Systems Engineering Research Group (LIRIS) (main work address Leuven)" "Data is one of the most valuable resources businesses have today. Companies and institutions increasingly invest in tools and platforms to collect and store data about every event that is impacting their business, like their customers, transactions, products, and the market in which they operate. Although the costs for maintaining the huge, expanding volume of data are often considerable, companies are willing to make the investment as it serves their true ambition of being able to extract valuable information from their large quantity of data. As a result, companies increasingly rely on data-driven techniques for developing powerful predictive models to aid them in their decision process. These models, however, are often not well aligned with the core business objective of profit maximization or minimizing financial losses, in the sense that, the models fail to take into account the costs and benefits that are associated with their predictions. In this thesis, we propose new methods for developing models that incorporate costs and gains directly into the construction process of the model.The first method, called ProfTree (Höppner et al., 2018), builds a profit driven decision tree for predicting customer churn. The recently developed expected maximum profit measure for customer churn (EMPC) has been proposed in order to select the most profitable churn model (Verbraken et al., 2013). ProfTree integrates the EMPC metric directly into the model construction and uses an evolutionary algorithm for learning profit driven decision trees.The second and third method, called cslogit and csboost, are approaches for learning a model when the costs due to misclassification vary between instances. An instance-dependent threshold is derived, based on the instance-dependent cost matrix for transfer fraud detection, that allows for making the optimal cost-based decision for each transaction. The two novel classifiers, cslogit and csboost, are based on lasso-regularized logistic regression and gradient tree boosting, respectively, which directly minimize the proposed instance-dependent cost measure when learning a classification model.A major challenge when trying to detect fraud is that the fraudulent activities form a minority class which make up a very small proportion of the data set, often less than 0.5%. Detecting fraud in such a highly imbalanced data set typically leads to predictions that favor the majority group, causing fraud to remain undetected. The third contribution in this thesis is an oversampling technique, called robROSE, that solves the problem of imbalanced data by creating synthetic samples that mimic the minority class while ignoring anomalies that could distort the detection algorithm and spoil the resulting analysis.Besides using methods for making data-driven decisions, businesses often take advantage of statistical techniques to detect anomalies in their data with the goal of discovering new insights. However, the mere detection of an anomalous case does not always answer all questions associated with that data point. In particular, once an outlier is detected, the scientific question why the case has been flagged as an outlier becomes of interest.In this thesis, we propose a fast and efficient method, called SPADIMO (Debruyne et al., 2019), to detect the variables that contribute most to an outlier’s abnormal behavior. Thereby, the method helps to understand in which way an outlier lies out.The SPADIMO algorithm allows us to introduce the cellwise robust M regression estimator (Filzmoser et al., 2020) as the first linear regression estimator of its kind that intrinsically yields both a map of cellwise outliers consistent with the linear model, and a vector of regression coefficients that is robust against outliers. As a by-product, the method yields a weighted and imputed data set that contains estimates of what the values in cellwise outliers would need to amount to if they had fit the model.All introduced algorithms are implemented in R and are included in their respective R package together with supporting functions and supplemented documentation on the usage of the algorithms. These R packages are publicly available on CRAN and at github.com/SebastiaanHoppner." "Data Analytics and Statistical Modelling" "Tim Verdonck" "Statistics and Data Science" "The project – in cooperation with the Joint Research Centre (JRC) of the European Commission – aims to develop and apply robust statistical and machine learning techniques for outlier/anomaly detection. Various approaches will be considered, such as distance/density or tree-based approaches, (generalised) Benford’s law, robust regression techniques, time series analysis and graphical models. Such methods will be applied to high-dimensional and heterogeneous data provided by the JRC coming from one or more fields of application, such as fraud detection, misinformation detection and anti-money laundering." "PHD POSITION IN DATA ANALYTICS AND STATISTICAL MODELING" "Stefan Van Aelst" "Statistics and Data Science" "In many research fields data dimension reduction techniques are widely used. Fields such as chemometrics, signal processing, and video compression, try to deal with these issues with tools that transform high-dimensional data to lower dimensions where the meaningful properties of the data are retained. Principal Component Analysis (PCA) is a widely used tool for dimension reduction. However, it is known that PCA is not robust against outliers. Most robust PCA methods are developed to deal with rowwise outliers, which are observations that deviate from the majority. The MacroPCA method can additionally deal with outlying cells and missing values. These methods however can only be applied to two-dimensional data matrices. For multiway data, models such as parallel factor analysis (PARAFAC) have been developed to reduce their dimension. Available robust PARAFAC methods can only deal with rowwise outliers and missing values. The goal of this research is to develop and study methods that can simultaneously deal with rowwise outliers, cellwise outliers, and missing values in multiway data. Application to real-life datasets will also be considered, especially in chemometrics." "Dose-efficient fusion of imaging and analytical techniques in scanning transmission electron microscopy." "Sandra Van Aert" "Electron microscopy for materials research (EMAT)" "The aim of this project is to realize a major breakthrough in the quantitative analysis of imaging and analytical techniques in scanning transmission electron microscopy (STEM). Therefore, we will exploit the physics-based description of the fundamental processes of electron scattering and combine this with a thorough multivariate statistical analysis of the recorded signals. In this manner, we will be able to identify the chemical nature of all individual atoms in three dimensions (3D). So far, imaging and analytical signals have been analyzed separately in STEM. Although analytical techniques are in principle well suited because of their elemental specificity, they have a much lower signal to noise ratio as compared to imaging techniques. We foresee that our multivariate method, in which new physics-based models are incorporated to describe the electron-object interaction, enables us to achieve element-specific atom counting at a local scale and to determine even the ordering of the atoms along the viewing direction. Furthermore, our approach will be optimized to reach high elemental measurement precision for a minimum incoming electron dose. This novel dose-efficient quantitative methodology will clearly usher electron microscopy in a new era of 3D element-specific metrology at the atomic scale. This will exactly provide the input needed to understand the unique link between a material's structure and its properties in both materials and in life sciences." "Development of sample introduction methods and analytical protocols for high-precision and accurate isotopic analyses via multi collector ICP Mass Spectrometry for innovative and interdisciplinary applications" "No name available, Philippe Claeys" "Ghent University, KU Leuven, Analytical, Environmental & Geo-Chemistry, Geology" "Although as a first approximation, it can be stated that the isotopic composition of materials is constant in nature, variations do occur - to a relatively pronounced extent as a result of the decay of naturally occurring long-lived radionuclides and to a smaller extent (generally) as a result of isotope fractionation effects, accompanying physical and biological processes as well as chemical reactions. The high-resolution multi-collector ICP - mass spectrometer available at UGent and used in a consortium between UGent, KULeuven end VUB allows isotope ratios to be measured with extremely high precision. Novel measurement protocols and sample preparation methods - quantitative isolation of the target elements as a pure fraction is required - will be developed and the strategies thus obtained will be used in interdisciplinary projects aiming at, e.g., provenance determination of archaeological objects, development of a new diagnostic tool for tracing down specific diseases or their evolution via blood analysis or documenting seawater composition over geological times in the context of palaeoclimatic reconstructions. Efforts will also be made to understand a more recently discovered phenomenon - mass-independent isotope fractionation - with the final aim of also incorporating it in a later stage in real-life applications." "Insight and analytical problem solving in early aging" "Eva Van den Bussche" "Brain and Cognition" "We live in an aging society, which is accompanied by increasing challenges at the individual, interpersonal, clinical, and societal level. The aging challenge also consists of a rise in the number of older adults suffering from neurological disorders such as stroke. At the same time, our world is also becoming increasingly more complex: it requires efficient and flexible cognitive skills, such as cognitive control. We are continuously bombarded with sensory input. Our cognitive system needs to selectively process relevant input, maintain this input and inhibit irrelevant input, to achieve our goals. However, precisely these cognitive control skills gradually deteriorate with age and can suddenly be affected after stroke. As pharmacological treatments for cognitive decline have limited efficacy, there is an urgent need for non-pharmacological methods to address cognitive impairment. This project specifically aims to unravel the behavioral and neural mechanisms underlying impairment in cognitive control in healthy older adults and stroke survivors. To achieve this, four objectives are formulated. First, we will pinpoint which, when and how cognitive control functions decline in healthy aging. Second, we will expose neural markers and networks of cognitive control decline in healthy aging. Third, we will unravel patterns of stroke-induced cognitive control deficits in stroke survivors. Fourth, we will map the lesion neuroanatomy of post-stroke cognitive control deficits at the network level. To reach these objectives, we propose a multi-method approach, combining cross-sectional, longitudinal and patient studies, behavioral and neuroimaging (EEG and fMRI) techniques and advanced statistical tools (structural equation modelling, network and connectivity analyses). Ultimately, this project will provide the basis for developing new, non-pharmacological intervention programs to delay, decelerate or decrease cognitive control impairment." "Dynizer Query Language - DQL: The development of a method for operational and analytical querying of NoSQL data based on semantic connections" "Guy De Tré" "Department of Telecommunications and information processing" "The aim of this project is to develop a high-performance method for expressive and intuitive querying of data within the metastructure of the in-house developed NoSGL database management system, Dynizer. Where the current survey follows technical expertise for the custom search of complex data structures using 3 basic queries, the query facility will be extended with generic query language that allows query searches based on semantic market statistics of data elements and aggregation functions on groups of row based data elements for simple data analysis applications. Furthermore, a method will be developed to enrich unstructured data with a (questionable) Dynizer meta-structure, and to process it in a distributed manner in an efficient manner." "Innovative Pricing and reserving in non-life insurance." "Gerda Claeskens" "Insurance Research Group, Operations Research and Statistics Research Group (ORSTAT) (main work address Leuven)" "Today's society generates data more rapidly than ever before, creating many opportunities as well as challenges for statisticians. Many industries become increasingly dependent on high-quality data, and the demand for sound statistical analysis of these data is rising accordingly.In the insurance sector, data have always played a major role. When selling a contract to a client, the insurance company is liable for the claims arising from this contract and will hold capital aside to meet these future liabilities. As such, the insurance premium has to be paid before the real costs are known. This is referred to as the inversion of the production cycle. It implies that the activities of pricing and reserving are strongly interconnected in actuarial practice. On the one hand, pricing actuaries have to determine a fair price for the insurance products they want to sell. Setting the premium levels charged to the insureds is done in a data driven way where statistical models are essential. Risk-based pricing is crucial in a competitive and well-functioning insurance market. On the other hand, an insurance company must safeguard its solvency and reserve capital to fulfill outstanding liabilities. Reserving actuaries thus must predict, with maximum accuracy, the total amount needed to pay claims that the insurer has legally committed himself to cover for. These reserves form the main item on the liability side of the balance sheet of the insurance company and therefore have an important economic impact.The ambition of this research is the development of new, accurate predictive models for the insurance work field. The overall objective is to improve actuarial practices for pricing and reserving by using sound and flexible statistical methods shaped for the actuarial data at hand. This thesis focusses on three related research avenues in the domain of non-life insurance: (1) flexible univariate and multivariate loss modeling in the presence of censoring and truncation, (2) car insurance pricing using telematics data and (3) claims reserving using micro-level data.After an introductory chapter, we study mixtures of Erlang distributions with a common scale parameter in Chapter 2. These distributions form a very versatile, yet analytically tractable, class of distributions making them suitable for loss modeling purposes. We develop a parameter estimation procedure using the EM algorithm that is able to fit a mixture of Erlang distributions under censoring and truncation, which is omnipresent in an actuarial context.Chapter 3 extends the estimation procedure to multivariate mixtures of Erlang distributions. This multivariate distribution generalizes the univariate mixture of Erlang distributions while preserving its flexibility and analytical tractability. When modeling multivariate insurance losses or dependent risks from different portfolios or lines of business, the inherent shape versatility of multivariate mixtures of Erlangs allows one to adequately capture both the marginals and the dependence structure. Moreover, its desirable analytical properties are particularly convenient in a wide variety of insurance related modelling situations.In Chapter 4 we explore the vast potential of telematics insurance from a statistical point of view. We analyze a unique Belgian portfolio of young drivers who signed up for a telematics product. Through telematics technology driving behavior data are collected in between 2010 and 2014 on when, where and how long the insured car is being used. The aim of our contribution is to develop the statistical methodology to incorporate this telematics information in statistical rating models, where we focus on predicting the number of claims, in order to adequately set premium levels based on individual policyholder's driving habits.Chapter 5 presents a new technique to predict the number of incurred but not reported claim counts. Due to time delays between the occurrence of the insured event and the notification of the claim to the insurer, not all of the claims that occurred in the past have been observed when the reserve needs to be calculated. We propose a flexible regression framework to model and jointly estimate the occurrence and reporting of claims on a daily basis.The last chapter concludes our work by presenting several suggestions for future research related to topics covered." "ANUBIS: Aligned oNline and multilevel User and entity Behavior" "Wouter Verbeke" "Information Systems Engineering Research Group (LIRIS) (main work address Leuven), Statistics and Data Science" "Fraud is a fierce threat to digital business. A typical organization is estimated to lose 5% of its revenues due to fraud, which is hard to eradicate since dynamic, system-dependent and organizationspecific. Powerful and intelligent fraud detection systems are therefore of crucial importance, to timely block, prevent and contain fraud and to mitigate losses. User and entity behavior analytics essentially profile the activity of users, peer groups and other entities such as devices, applications and networks, with the aim to detect anomalous patterns which are indicative for security threats, such as fraud.In this research project, we will improve the adaptiveness and detection power of user and entity behavior analytics by aligning the objective of these approaches when learning from data with the business objective of minimizing fraud losses, instead of maximizing performance from a statistical perspective. For this purpose, we will leverage and advance upon profit driven analytics and cost-sensitive ensemble learning approaches. Additionally, we will extend these approaches to accommodate online and multilevel learning from streaming data from across systems and applications. The developed approaches will be empirically evaluated on available data sets and benchmarked to state-of-the-art approaches."