< Terug naar vorige pagina

Publicatie

Designing Anomaly Detection Algorithms that Exploit Flexible Supervision

Boek - Dissertatie

Anomaly detection is the task of identifying observations in a dataset that do not conform the expected behavior. It is a crucial data mining task as in the real world, anomalous observations often correspond to real costs. For example, a machine that breaks, a fraudulent credit card transaction, or a patient experiencing irregular heart rhythms. With the advent of big data, manually sifting through millions of observations to detect the anomalies has become intractable. It is too time-consuming and costly. Instead, we need to design algorithms that automate this job for us. Most existing anomaly detection algorithms work in a purely data-driven manner, meaning that they only look at the raw data to identify the anomalies. This requires making upfront assumptions about what may constitute anomalous behavior, such as anomalies that are infrequent or are substantially different from the normal observations. In many real-world applications, however, the anomalies do not conform to these assumptions. For example, some normal behaviors occur less frequently than the anomalous behaviors, such as a machine maintenance operation carried out only once in a while. The disconnect between the assumptions and reality results in a mismatch between what the detection algorithm predicts to be an anomaly and what is actually an anomaly. How can we design anomaly detection algorithms that can deal with such adverse effects? The hypothesis of this dissertation states that anomaly detection algorithms would derive considerable benefit from exploiting flexible expert supervision. Flexible supervision refers to all forms of knowledge a domain expert has about a real-world application. By integrating this knowledge somehow in the detection algorithm, we could improve its performance. For example, the expert could correct the algorithm to stop flagging routine maintenance operations as anomalous. Currently, a handful of anomaly detection algorithms exists that can exploit such simple binary label information (an observation is anomalous or normal) obtained from the expert. However, flexible supervision goes substantially beyond this classic binary label format. This dissertation makes three scientific contributions, each related to exploiting flexible supervision in anomaly detection. The first contribution is an anomaly detection algorithm that exploits the knowledge the expert has about sporadically reoccurring patterns in the data, such as maintenance operations. If such a pattern does not occur when it is expected to, it gives rise to an absent pattern anomaly. In contrast to regular anomalies, absent pattern anomalies are identified by the absence of normal behavior, not by the presence of anomalous behavior. We introduce an algorithm that exploits a limited set of annotated occurrences of a pattern provided by the expert, to detect its suspicious absences. The second contribution is an anomaly detection algorithm that exploits the knowledge contained in event logs. In an event log, the expert keeps track of all events that could be instrumental in identifying patterns in the data. We develop an algorithm that exploits the information contained in both event logs and continuous time series data (e.g., water consumption measurements in a retail store over time) to detect periods of anomalous behavior in the time series data (e.g., leaks or spills in the store). The final contributions of this dissertation focuses on optimally exploiting the binary label information obtained either from the expert or available for a related dataset. For many real-wold problems, we have multiple, related datasets at our disposal. For example, if we are collecting data for multiple machines. The expert provides label information for only a subset of these datasets. We design a label propagation algorithm for augmenting the anomaly score of any unsupervised anomaly detector with a binary label information obtained from the expert through an active learning strategy. The use of an unsupervised anomaly detector allows one to derive an initial anomaly score for each instance in a dataset, while the label propagation can correct this initial score in a model-agnostic manner to better reflect the expert knowledge about anomalous or normal behavior. Finally, we introduce several algorithms for transferring label information between two different, yet related datasets. These algorithms compare the data distributions of the two datasets to determine whether they are similar which would justify transferring label information.
Jaar van publicatie:2020
Toegankelijkheid:Open