Characterising semantically coherent classes of text through feature discovery

Robertson, Andrew David (2019) Characterising semantically coherent classes of text through feature discovery. Doctoral thesis (PhD), University of Sussex.

[img] PDF - Published Version
Download (5MB)

Abstract

There is a growing need to provide support for social scientists and humanities scholars to gather and “engage” with very large datasets of free text, to perform very bespoke analyses. method52 is a text analysis platform built for this purpose (Wibberley et al., 2014), and forms a foundation that this thesis builds upon. A central part of method52 and its methodologies is a classifier training component based on dualist (Settles, 2011), and the general process of data engagement with method52 is determined to constitute a continuous cycle of characterising semantically coherent sub-collections, classes, of the text. Two broad methodologies exist for supporting this type of engagement process: (1) a top-down approach wherein concepts and their relationships are explicitly modelled for reasoning, and (2) a more surface-level, bottom-up approach, which entails the use of key terms (surface features) to characterise data. Following the second of these approaches, this thesis examines ways of better supporting this type of data engagement to more effectively support the needs of social scientists and humanities scholars in engaging with text data. The classifier component provides an active learning training environment emphasising the labelling of individual features. However, it can be difficult to interpret and incorporate prior knowledge of features. The process of feature discovery based on the current classifier model does not always produce useful results. And understanding the data well enough to produce successful classifiers is timeconsuming. A new method for discovering features in a corpus is introduced, and feature discovery methods are explored to resolve these issues. When collecting social media data, documents are often obtained by querying an API with a set of key phrases. Therefore, the set of possible classes characterising the data is defined by these basic surface features. It is difficult to know exactly which terms must searched for, and the usefulness of terms can change over time as new discussions and vocabulary emerge. Building on the feature discovery techniques, a framework is presented in this thesis for streaming data with an automatically adapting query to deal with these issues.

Item Type: Thesis (Doctoral)
Schools and Departments: School of Engineering and Informatics > Informatics
Subjects: A General Works > AZ History of scholarship and learning. The humanities > AZ0195 Electronic information resources. Including computer network resources, the Internet, digital libraries, etc.
Q Science > Q Science (General) > Q0300 Cybernetics > Q0350 Information theory > Q0387 Knowledge representation > Q0387.5 Semantic networks
Q Science > QA Mathematics > QA0075 Electronic computers. Computer science > QA0076.9.A-Z Other topics, A-Z > QA076.9.D343 Data mining
Z Bibliography. Library Science. Information Resources > Z0665 Library Science. Information Science > Z0696 Classification and notation
Depositing User: Library Cataloguing
Date Deposited: 10 Jul 2019 11:24
Last Modified: 10 Jul 2019 11:24
URI: http://sro.sussex.ac.uk/id/eprint/84841

View download statistics for this item

📧 Request an update