< Back to previous page
Representation Learning for Automated Document Classification
Book - Dissertation
Classification of text documents into a hierarchical taxonomy of categories isa challenging task in a multitude of subject domains. With an increase indigitalization, the need for high performing automatic classification systems isgrowing every day. In this dissertation, contributions are presented that takeup these challenges in the fields of online news articles and electronic healthcaredocuments, respectively.First, a method is introduced for obtaining better news image representations orembeddings by exploiting naturally co-occurring text. It is shown that the newlyobtained text-enriched image representations improve the image classificationprocess especially in a setting with a limited amount of training data.Second, we survey the literature in the field of deep learning methods forclassification of medical documents and implement and compare the bestperforming models. The survey shows that a combination of convolutionaloperations and a per label attention mechanism yield the best results overall.Furthermore, in a setting with a more limited amount of training data,hierarchical variants of such models tend to improve the classification process.Continuing on the work of the literature survey, as a third work we researchthe use of post-processing heuristics. These heuristics re-evaluate the model'sclassification values based on the values of categories that are close in the targetICD taxonomy. This leads to improved classification results in a setting with alarge target space compared to the size of the training dataset.The last contribution investigates continual learning approaches to the bestperforming models for hierarchical classification. This way the model can swiftlyadapt to user feedback without having to retrain the entire model. An additionalfocus is to guarantee a minimal loss of already acquired knowledge in this onlinelearning setting.