Learning with biased data: invariant representations and target labels

Kehrenberg, Thomas Maximilian (2021) Learning with biased data: invariant representations and target labels. Doctoral thesis (PhD), University of Sussex.

[img] PDF - Published Version
Download (5MB)

Abstract

Biased data represents a significant challenge for the proper functioning of machine learning models, which affects the trustworthiness of deployed models. These biases are usually introduced by the data generation process, i.e., data is collected from non-representative samples or is the result of biased processes. However, these data deficiencies can be very expensive or even impossible to fix, which makes it desirable to solve the problem on the algorithmic end. In this work, I consider two different forms of data bias: labelling bias and sampling bias; investigated under the framework of algorithmic fairness and evaluated using common fairness metrics. Labelling bias here refers to a systematic bias, correlated with a sensitive attribute, which causes the labels in the dataset to differ from the “true” labels; whereas sampling bias indicates that samples are missing from the training set in a systematic way, but are still present in the setting where the model is intended to be deployed. Both biases will make a naively trained model fail to generalize. I present three approaches to tackling this problem, each relying on some form of additional knowledge about the data. The first approach, dealing with labelling bias, is based on implicit, probabilistic target labels which satisfy certain given statistics. These target labels can be used to train any likelihood-based model. The second approach deals with strong spurious correlations in the training data, which can be seen as a specific form of sampling bias. A bias-free partially-labelled context set is used to learn an interpretable representation of the data which is invariant to the spurious correlation and can be assessed qualitatively. The third approach deals with less extreme cases of sampling bias, but relaxes the assumption of having labels in the context set, by learning an invariant representation via distribution matching.

Item Type: Thesis (Doctoral)
Schools and Departments: School of Engineering and Informatics > Engineering and Design
Subjects: Q Science > Q Science (General) > Q0300 Cybernetics > Q0325 Self-organizing systems. Conscious automata > Q0325.5 Machine learning
Depositing User: Library Cataloguing
Date Deposited: 08 Sep 2021 08:31
Last Modified: 08 Sep 2021 08:31
URI: http://sro.sussex.ac.uk/id/eprint/101574

View download statistics for this item

📧 Request an update