< Back to previous page

Project

Data-Efficient Methods for Natural Language Processing: Applications in Healthcare

Natural language processing is the study of processing language data to perform human language-related tasks. With the advance of machine learning models, such as deep neural networks, natural language processing technologies have been applied to many use cases, including document classification, sentiment analysis, and information extraction. Deep neural networks perform a target task by learning from data without human intervention for inference. However, their power comes at the cost of large, labelled training data which require a lot of human labour. 

In this dissertation, we investigate and propose data-efficient algorithms for training neural networks-based natural language processing models for healthcare applications, where data and label are scarce. Our four main contributions demonstrate how data-efficient methods that maximise the utility of labelled and unlabelled data and exploits knowledge can be used to train neural networks-based natural language processing models in data- and label- scarce settings.


Firstly, we present a data-efficient method that combines a data augmentation technique and a semi-supervised learning approach to a setting where there is a small labelled dataset and a relatively large unlabelled dataset. The data augmentation method applies text editing operations to input texts, and the semi-supervised learning method utilises a trained model's predictions as pseudo-labels. We evaluate our method on a custom dataset containing user complaints about their sleep and analysed the effect of the proposed method. 


Secondly, we focus on active learning methods, particularly pool-based active learning, which is when there is a relatively large amount of unlabelled data and a small amount of labelled data at the beginning, and the fixed number of data points are iteratively labelled and added to the labelled set. We first analyse the limitations of existing active learning methods and proposed a label-efficient training method that mitigates them. The proposed method combines the strength of self-supervised learning, data augmentation, and active learning to fully utilise both unlabelled and labelled data. We evaluate our method on our custom dataset and a benchmark dataset and find that the proposed method outperforms the existing state-of-the-art methods. 


Thirdly, we study how to add numeracy skills into a language model by using synthetic data for a temporal information extraction task. We propose a rule-based synthetic data generation method that can increase the size of the training data and a novel multi-task model architecture that can extract temporal expressions and normalise them into standard formats. We evaluate our methods on a custom dataset containing free text sleep diaries. We find that multi-task learning that includes an auxiliary task, which is related to the target task, can contribute to the target performance improvement when using synthetic data for training. 


Lastly, we investigate the opportunities of applying the data-efficient methods to a clinical NLP application and discussed the important problem of bias. We first study the underlying bias in the public benchmark dataset and analyse the effect of bias on the model's behaviour. We find that the benchmark-trained model performs differently across demographic groups because the benchmark dataset is imbalanced. We then propose novel approaches to mitigate this problem. We evaluate our methods on the clinical benchmark dataset and show that the proposed approach can achieve better fairness scores in terms of equal performance across different demographic groups.


The main conclusion of this dissertation is that the proposed data-efficient methods are the most effective in low-resource settings when there is small labelled data set or there is a lack of labelled data. The contributions in this dissertation is a starting point for future research into developing deep neural networks-based natural language processing systems for low-resource application domains, such as a healthcare domain. 

Date:23 Sep 2018 →  27 Jan 2023
Keywords:machine learning, deep learning, heterogeneous data processing, activity recognition
Disciplines:Sensors, biosensors and smart sensors, Other electrical and electronic engineering
Project type:PhD project