< Back to previous page

Project

Statistical Tools for Anomaly Detection and Fraud Analytics

Data is one of the most valuable resources businesses have today. Companies and institutions increasingly invest in tools and platforms to collect and store data about every event that is impacting their business, like their customers, transactions, products, and the market in which they operate. Although the costs for maintaining the huge, expanding volume of data are often considerable, companies are willing to make the investment as it serves their true ambition of being able to extract valuable information from their large quantity of data. As a result, companies increasingly rely on data-driven techniques for developing powerful predictive models to aid them in their decision process. These models, however, are often not well aligned with the core business objective of profit maximization or minimizing financial losses, in the sense that, the models fail to take into account the costs and benefits that are associated with their predictions. In this thesis, we propose new methods for developing models that incorporate costs and gains directly into the construction process of the model.

The first method, called ProfTree (Höppner et al., 2018), builds a profit driven decision tree for predicting customer churn. The recently developed expected maximum profit measure for customer churn (EMPC) has been proposed in order to select the most profitable churn model (Verbraken et al., 2013). ProfTree integrates the EMPC metric directly into the model construction and uses an evolutionary algorithm for learning profit driven decision trees.

The second and third method, called cslogit and csboost, are approaches for learning a model when the costs due to misclassification vary between instances. An instance-dependent threshold is derived, based on the instance-dependent cost matrix for transfer fraud detection, that allows for making the optimal cost-based decision for each transaction. The two novel classifiers, cslogit and csboost, are based on lasso-regularized logistic regression and gradient tree boosting, respectively, which directly minimize the proposed instance-dependent cost measure when learning a classification model.

A major challenge when trying to detect fraud is that the fraudulent activities form a minority class which make up a very small proportion of the data set, often less than 0.5%. Detecting fraud in such a highly imbalanced data set typically leads to predictions that favor the majority group, causing fraud to remain undetected. The third contribution in this thesis is an oversampling technique, called robROSE, that solves the problem of imbalanced data by creating synthetic samples that mimic the minority class while ignoring anomalies that could distort the detection algorithm and spoil the resulting analysis.

Besides using methods for making data-driven decisions, businesses often take advantage of statistical techniques to detect anomalies in their data with the goal of discovering new insights. However, the mere detection of an anomalous case does not always answer all questions associated with that data point. In particular, once an outlier is detected, the scientific question why the case has been flagged as an outlier becomes of interest.

In this thesis, we propose a fast and efficient method, called SPADIMO (Debruyne et al., 2019), to detect the variables that contribute most to an outlier’s abnormal behavior. Thereby, the method helps to understand in which way an outlier lies out.

The SPADIMO algorithm allows us to introduce the cellwise robust M regression estimator (Filzmoser et al., 2020) as the first linear regression estimator of its kind that intrinsically yields both a map of cellwise outliers consistent with the linear model, and a vector of regression coefficients that is robust against outliers. As a by-product, the method yields a weighted and imputed data set that contains estimates of what the values in cellwise outliers would need to amount to if they had fit the model.

All introduced algorithms are implemented in R and are included in their respective R package together with supporting functions and supplemented documentation on the usage of the algorithms. These R packages are publicly available on CRAN and at github.com/SebastiaanHoppner.

Date:1 Sep 2016 →  7 Sep 2020
Keywords:Statistics
Disciplines:Applied mathematics in specific fields, Computer architecture and networks, Distributed computing, Information sciences, Information systems, Programming languages, Scientific computing, Theoretical computer science, Visual computing, Other information and computing sciences, Statistics and numerical methods
Project type:PhD project