The Corpus Expansion Toolkit: finding what we want on the web

Pay, Jack Frederick (2020) The Corpus Expansion Toolkit: finding what we want on the web. Doctoral thesis (PhD), University of Sussex.

[img] PDF - Published Version
Download (24MB)


This thesis presents the Corpus Expansion Toolkit (CET), a generally applicable toolkit that allows researchers to build domain-specific corpora from the web. The main purpose of the work presented in this thesis and the development of the CET is to provide a solution to discovering desired content on the web from possibly unknown locations or a poorly defined domain. Using an iterative process, the CET is able to solve the problem of discovering domain-specific online content and expand a corpus using only a very small number of example documents or characteristic phrases taken from the target domain. Using a human-in-the-loop strategy and a chain of discrete software components the CET also allows the concept of a domain to be iteratively defined using the very online resources used to expand the original corpus. The CET combines feature extraction, search, web crawling and machine learning methods to collected, store, filter and perform information extraction on collected documents. Using a small number of example ‘seed’ documents the CET is able to expand the original corpus by finding more relevant documents from the web and provide a number of tools to support their analysis. This thesis presents a case study-based methodology that introduces the various contributions and components of the CET through the discussion of five case studies covering a wide variety of domains and requirements that the CET has been applied. These case studies hope to illustrate three main use cases, listed below, where the CET is applicable:

1. Domain known – source known
2. Domain known – source unknown
3. Domain unknown – source unknown

First, use cases where the sites for document collection are known and the topic of research is clearly defined. Second, instances where the topic of research is clearly defined but where to find relevant documents on the web is unknown. Third, the most extreme use case, where the domain is poorly defined or unknown to the researcher and the location of the information is also unknown. This thesis presents a solution that allows researchers to begin with very little information on a specific topic and iteratively build a clear conception of a domain and translate that to a computational system.

Item Type: Thesis (Doctoral)
Schools and Departments: School of Engineering and Informatics > Informatics
Subjects: Q Science > QA Mathematics > QA0075 Electronic computers. Computer science
T Technology > TK Electrical engineering. Electronics Nuclear engineering > TK7800 Electronics > TK7885 Computer engineering. Computer hardware
Z Bibliography. Library Science. Information Resources > ZA Information resources > ZA4050 Electronic information resources > ZA4150 Computer network resources
Depositing User: Library Cataloguing
Date Deposited: 19 Aug 2020 06:12
Last Modified: 19 Aug 2020 06:12

View download statistics for this item

📧 Request an update