< Terug naar vorige pagina

Publicatie

Mining cohesive patterns in sequences and extreme multi-label classification

Boek - Dissertatie

Finding patterns in long event sequences is an important data mining task. In the past, research focused on finding all frequent patterns, where the anti-monotonic property of frequency was used to design efficient algorithms. Recently, research focused on producing a smaller output containing only the most interesting patterns. In this thesis, we discover patterns using cohesion and quantile-based cohesion. Cohesion measures how close the items making up the pattern are on average. Quantile-based cohesion measures the proportion of pattern occurrences that are cohesive. We tackle the fact that both measures are not anti-monotonic by developing an upper bound to prune the search space. Experiments show that our method efficiently discovers important patterns that existing state-of-the-art methods fail to discover. In the second part of this thesis, we focus on multi-label classification which is important in different applications such as text categorisation, scene classification and bioinformatics. In machine learning, multi-label classification is the problem of identifying a set of labels for a new instance, based on a training database of labelled instances. Traditionally, methods learn a separate model for each label, however, this is not feasible for datasets with millions of labels. We propose a new algorithm that predicts labels using a linear ensemble of instance- and feature-based nearest neighbours. We tackle the problem of computing cosine similarity and similarity weighted predictions on large datasets using an inverted index and sparse optimisation. In addition, we propose a new top-k query with pruning based on a partition of the training database. Experiments show that our method is more accurate and orders of magnitude faster than state-of-the-art methods and requires less than 20 ms per instance to predict labels for extreme datasets consisting of hundreds of thousands of labels without the need for expensive hardware.
Aantal pagina's: 153
Jaar van publicatie:2020
Trefwoorden:Doctoral thesis
Toegankelijkheid:Open