< Back to previous page

Publication

Clustering social sciences and humanities publications

Book Contribution - Book Abstract Conference Contribution

Subtitle:can word and document embeddings improve cluster quality?
We study how different document representation techniques affect the outcomes of clustering based on textual information. Our dataset consists of titles and abstracts for 15,907 publications from the social sciences and humanities. We compare established document representation techniques such as TF-IDF and Latent Semantic Indexing with word and document embedding techniques (Word2Vec and Doc2Vec) in terms of the quality of the clustering outcomes. Quality is assessed by calculating silhouette scores and the average textual coherence of the clusters. The results show that word and document embeddings are promising feature engineering techniques in the context of clustering social sciences and humanities publications. The average of Word2Vec embeddings works best for identifying textually coherent clusters.
Book: 18th international conference on scientometrics & informetrics (ISSI2021), virtual event, 12-15 July, 2021, KU Leuven, Belgium / Glänzel, Wolfgang [edit.]; et al.
Pages: 369 - 374
Publication year:2021
Keywords:P1 Proceeding
Accessibility:Closed