Topical subcategory structure in text classification

Lyra, Risto Matti Juhani (2019) Topical subcategory structure in text classification. Doctoral thesis (PhD), University of Sussex.

[img] PDF - Published Version
Download (1MB)

Abstract

Data sets with rich topical structure are common in many real world text classification tasks. A single data set often contains a wide variety of topics and, in a typical task, documents belonging to each class are dispersed across many of the topics. Often, a complex relationship exists between the topic a document discusses and the class label: positive or negative sentiment is expressed in documents from many different topics, but knowing the topic does not necessarily help in determining the sentiment label. We know from tasks such as Domain Adaptation that sentiment is expressed in different ways under different topics. Topical context can in some cases even reverse the sentiment polarity of words: to be sharp is a good quality for knives but bad for singers. This property can be found in many different document classification tasks.

Standard document classification algorithms do not account for or take advantage of topical diversity; instead, classifiers are usually trained with the tacit assumption that topical diversity does not play a role. This thesis is focused on the interplay between the topical structure of corpora, how the target labels in a classification task distribute over the topics and how the topical structure can be utilised in building ensemble models for text classification. We show empirically that a dataset with rich topical structure can be problematic for single classifiers, and we develop two novel ensemble models to address the issues. We focus on two document classification tasks: document level sentiment analysis of product reviews and hierarchical categorisation of news text. For each task we develop a novel ensemble method that utilises topic models to address the shortcomings of traditional text classification algorithms.

Our contribution is in showing empirically that the class association of document features is topic dependent. We show that using the topical context of documents for building ensembles is beneficial for some tasks, and present two new ensemble models for document classification. We also provide a fresh viewpoint for reasoning about the relationship of class labels, topical categories and document features.

Item Type: Thesis (Doctoral)
Schools and Departments: School of Engineering and Informatics > Informatics
Subjects: Q Science > Q Science (General) > Q0300 Cybernetics > Q0325 Self-organizing systems. Conscious automata > Q0334 Artificial intelligence > Q0337.5 Pattern recognition systems
Q Science > QA Mathematics > QA0075 Electronic computers. Computer science > QA0076.9.A-Z Other topics, A-Z > QA0076.9.D343 Data mining
Depositing User: Library Cataloguing
Date Deposited: 17 Jan 2019 15:33
Last Modified: 11 Mar 2020 12:02
URI: http://sro.sussex.ac.uk/id/eprint/81340

View download statistics for this item

📧 Request an update