< Back to previous page

Publication

Interactive Semi-Supervised Clustering

Book - Dissertation

One of the key tools to gain knowledge from data is clustering: identifying groups of instances that are highly similar. In the traditional unsupervised clustering process, a practitioner needs to select a suitable distance measure, clustering algorithm, and corresponding hyperparameter settings to arrive at a satisfactory clustering. Making the right choices for these components is hard, and as a result a practitioner often performs many iterations in search of a clustering that matches their interests. The goal of this thesis is to develop methods that make it easier to find a good clustering. Most of our contributions are in the setting of semi-supervised clustering. Semi-supervised methods allow the user to guide the clustering process through direct feedback, rather than by tedious tweaking of the components of the clustering pipeline. We consider feedback in the form of pairwise constraints, which specify whether two instances should be in the same cluster or not. We present five main contributions. We first investigate the use of internal quality measures for unsupervised algorithm and hyperparameter selection in clustering. We identify several important limitations of these measures, and conclude that they are not suited for this purpose. In our second contribution, we examine the use of constraint set consistency, a measure that captures the utility of constraint sets, for selecting a semi-supervised algorithm. Our results show that constraint set consistency cannot be used within the scope of individual datasets, and consequently cannot be used for algorithm selection. In our third contribution, we use constraints to select and tune an unsupervised clustering algorithm. We find our simple strategy to be highly effective, outperforming existing semi-supervised clustering methods that operate within the bounds of a single algorithm. In our fourth contribution, we introduce the concept of super-instances, and two concrete methods that exploit them: COBRA and COBRAS. The latter is the first semi-supervised clustering method that allows for truly interactive and iterative clustering with pairwise constraints. Finally, in our fifth contribution, we show that COBRAS can easily be adapted to take label queries into account and that this can result in better clusterings, especially when the number of clusters is large.
Publication year:2018
Accessibility:Open