University of Sussex
Browse
Lyra, Risto Matti Juhani.pdf (1.32 MB)

Topical subcategory structure in text classification

Download (1.32 MB)
thesis
posted on 2023-06-09, 16:34 authored by Risto Matti Juhani Lyra
Data sets with rich topical structure are common in many real world text classification tasks. A single data set often contains a wide variety of topics and, in a typical task, documents belonging to each class are dispersed across many of the topics. Often, a complex relationship exists between the topic a document discusses and the class label: positive or negative sentiment is expressed in documents from many different topics, but knowing the topic does not necessarily help in determining the sentiment label. We know from tasks such as Domain Adaptation that sentiment is expressed in different ways under different topics. Topical context can in some cases even reverse the sentiment polarity of words: to be sharp is a good quality for knives but bad for singers. This property can be found in many different document classification tasks. Standard document classification algorithms do not account for or take advantage of topical diversity; instead, classifiers are usually trained with the tacit assumption that topical diversity does not play a role. This thesis is focused on the interplay between the topical structure of corpora, how the target labels in a classification task distribute over the topics and how the topical structure can be utilised in building ensemble models for text classification. We show empirically that a dataset with rich topical structure can be problematic for single classifiers, and we develop two novel ensemble models to address the issues. We focus on two document classification tasks: document level sentiment analysis of product reviews and hierarchical categorisation of news text. For each task we develop a novel ensemble method that utilises topic models to address the shortcomings of traditional text classification algorithms. Our contribution is in showing empirically that the class association of document features is topic dependent. We show that using the topical context of documents for building ensembles is beneficial for some tasks, and present two new ensemble models for document classification. We also provide a fresh viewpoint for reasoning about the relationship of class labels, topical categories and document features.

History

File Version

  • Published version

Pages

195.0

Department affiliated with

  • Informatics Theses

Qualification level

  • doctoral

Qualification name

  • phd

Language

  • eng

Institution

University of Sussex

Full text available

  • Yes

Legacy Posted Date

2019-01-17

Usage metrics

    University of Sussex (Theses)

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC