Robertson, Andrew David.pdf (5.64 MB)
Characterising semantically coherent classes of text through feature discovery
There is a growing need to provide support for social scientists and humanities scholars to gather and “engage” with very large datasets of free text, to perform very bespoke analyses. method52 is a text analysis platform built for this purpose (Wibberley et al., 2014), and forms a foundation that this thesis builds upon. A central part of method52 and its methodologies is a classifier training component based on dualist (Settles, 2011), and the general process of data engagement with method52 is determined to constitute a continuous cycle of characterising semantically coherent sub-collections, classes, of the text. Two broad methodologies exist for supporting this type of engagement process: (1) a top-down approach wherein concepts and their relationships are explicitly modelled for reasoning, and (2) a more surface-level, bottom-up approach, which entails the use of key terms (surface features) to characterise data. Following the second of these approaches, this thesis examines ways of better supporting this type of data engagement to more effectively support the needs of social scientists and humanities scholars in engaging with text data. The classifier component provides an active learning training environment emphasising the labelling of individual features. However, it can be difficult to interpret and incorporate prior knowledge of features. The process of feature discovery based on the current classifier model does not always produce useful results. And understanding the data well enough to produce successful classifiers is timeconsuming. A new method for discovering features in a corpus is introduced, and feature discovery methods are explored to resolve these issues. When collecting social media data, documents are often obtained by querying an API with a set of key phrases. Therefore, the set of possible classes characterising the data is defined by these basic surface features. It is difficult to know exactly which terms must searched for, and the usefulness of terms can change over time as new discussions and vocabulary emerge. Building on the feature discovery techniques, a framework is presented in this thesis for streaming data with an automatically adapting query to deal with these issues.
History
File Version
- Published version
Pages
250.0Department affiliated with
- Informatics Theses
Qualification level
- doctoral
Qualification name
- phd
Language
- eng
Institution
University of SussexFull text available
- Yes
Legacy Posted Date
2019-07-10Usage metrics
Categories
No categories selectedKeywords
Licence
Exports
RefWorks
BibTeX
Ref. manager
Endnote
DataCite
NLM
DC