< Terug naar vorige pagina

Publicatie

COBRAS+: Reusing Previously Obtained Constraints in Active Semi-Supervised Clustering

Boekbijdrage - Boekhoofdstuk Conferentiebijdrage

Clustering is an inherently subjective process, where different clusterings obtained from the same dataset may be desired in different applications. Semi-supervised clustering relies on partial ground truth information to obtain a clustering of interest. In the active setting, the clustering algorithm selects a query to maximize the return in each iteration. In particular, we consider pairwise queries that determine whether given two instances should be in the same cluster or in different clusters, which avoids individual labeling of instances. The longer the algorithm runs, the more constraints it obtains; hence, there is a comprehensive set of previously obtained constraints in the long run. Unlike the existing algorithms, we propose an approach, named COBRAS+, that exploits previously obtained constraints extensively to avoid asking redundant queries and to decrease the number of queries that are needed to obtain the same clustering quality. To avoid the overuse of previously obtained constraints, which may be misleading for the clustering, we define a dissimilarity measure between constraints so that we rely on existing constraints when they are sufficiently similar to the ideal constraint we seek. We demonstrate that our approach provides a more accurate clustering by asking the same number of queries, or requires at least 15% fewer queries to achieve the same clustering quality. Moreover, we apply this concept to an incremental learning problem where the active clustering algorithm starts with constraints that were previously obtained from a subset of the available data. We observe the same type of improvement whether or not this subset is an accurate representative of the entire dataset because our approach may utilize externally provided information in the form of pairwise queries.
Boek: Proceedings of the 33rd Benelux Conference on Artificial Intelligence and 30th Belgian-Dutch Conference on Machine Learning (BNAIC/BeneLearn 2021)
Pagina's: 184 - 202
Aantal pagina's: 19
Jaar van publicatie:2021
Toegankelijkheid:Open