University of Sussex
Browse
Pay, Jack Frederick.pdf (23.75 MB)

The Corpus Expansion Toolkit: finding what we want on the web

Download (23.75 MB)
thesis
posted on 2023-06-09, 21:26 authored by Jack PayJack Pay
This thesis presents the Corpus Expansion Toolkit (CET), a generally applicable toolkit that allows researchers to build domain-specific corpora from the web. The main purpose of the work presented in this thesis and the development of the CET is to provide a solution to discovering desired content on the web from possibly unknown locations or a poorly defined domain. Using an iterative process, the CET is able to solve the problem of discovering domain-specific online content and expand a corpus using only a very small number of example documents or characteristic phrases taken from the target domain. Using a human-in-the-loop strategy and a chain of discrete software components the CET also allows the concept of a domain to be iteratively defined using the very online resources used to expand the original corpus. The CET combines feature extraction, search, web crawling and machine learning methods to collected, store, filter and perform information extraction on collected documents. Using a small number of example ‘seed’ documents the CET is able to expand the original corpus by finding more relevant documents from the web and provide a number of tools to support their analysis. This thesis presents a case study-based methodology that introduces the various contributions and components of the CET through the discussion of five case studies covering a wide variety of domains and requirements that the CET has been applied. These case studies hope to illustrate three main use cases, listed below, where the CET is applicable: 1. Domain known – source known 2. Domain known – source unknown 3. Domain unknown – source unknown First, use cases where the sites for document collection are known and the topic of research is clearly defined. Second, instances where the topic of research is clearly defined but where to find relevant documents on the web is unknown. Third, the most extreme use case, where the domain is poorly defined or unknown to the researcher and the location of the information is also unknown. This thesis presents a solution that allows researchers to begin with very little information on a specific topic and iteratively build a clear conception of a domain and translate that to a computational system.

History

File Version

  • Published version

Pages

198.0

Department affiliated with

  • Informatics Theses

Qualification level

  • doctoral

Qualification name

  • phd

Language

  • eng

Institution

University of Sussex

Full text available

  • Yes

Legacy Posted Date

2020-08-19

Usage metrics

    University of Sussex (Theses)

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC