Evaluating distributional models of compositional semantics

Batchkarov, Miroslav Manov (2016) Evaluating distributional models of compositional semantics. Doctoral thesis (PhD), University of Sussex.

[img] PDF - Published Version
Download (1MB)

Abstract

Distributional models (DMs) are a family of unsupervised algorithms that represent the meaning of words as vectors. They have been shown to capture interesting aspects of semantics. Recent work has sought to compose word vectors in order to model phrases and sentences. The most commonly used measure of a compositional DM’s performance to date has been the degree to which it agrees with human-provided phrase similarity scores.
The contributions of this thesis are three-fold. First, I argue that existing intrinsic evaluations are unreliable as they make use of small and subjective gold-standard data sets and assume a notion of similarity that is independent of a particular application. Therefore, they do not necessarily measure how well a model performs in practice. I study four commonly used intrinsic datasets and demonstrate that all of them exhibit undesirable properties.
Second, I propose a novel framework within which to compare word- or phrase-level DMs in terms of their ability to support document classification. My approach couples a classifier to a DM and provides a setting where classification performance is sensitive to the quality of the DM.
Third, I present an empirical evaluation of several methods for building word representations and composing them within my framework. I find that the determining factor in building word representations is data quality rather than quantity; in some cases only a small amount of unlabelled data is required to reach peak performance. Neural algorithms for building single-word representations perform better than counting-based ones regardless of what composition is used, but simple composition algorithms can outperform more sophisticated competitors. Finally, I introduce a new algorithm for improving the quality of distributional thesauri using information from repeated runs of the same non deterministic algorithm.

Item Type: Thesis (Doctoral)
Schools and Departments: School of Engineering and Informatics > Informatics
Subjects: P Language and Literature > P Philology. Linguistics > P0098 Computational linguistics. Natural language processing
Depositing User: Library Cataloguing
Date Deposited: 01 Jun 2016 14:31
Last Modified: 01 Jun 2016 14:31
URI: http://sro.sussex.ac.uk/id/eprint/61062

View download statistics for this item

📧 Request an update