Batchkarov, Miroslav Manov.pdf (1.3 MB)
Evaluating distributional models of compositional semantics
thesis
posted on 2023-06-09, 01:17 authored by Miroslav Manov BatchkarovDistributional models (DMs) are a family of unsupervised algorithms that represent the meaning of words as vectors. They have been shown to capture interesting aspects of semantics. Recent work has sought to compose word vectors in order to model phrases and sentences. The most commonly used measure of a compositional DM’s performance to date has been the degree to which it agrees with human-provided phrase similarity scores. The contributions of this thesis are three-fold. First, I argue that existing intrinsic evaluations are unreliable as they make use of small and subjective gold-standard data sets and assume a notion of similarity that is independent of a particular application. Therefore, they do not necessarily measure how well a model performs in practice. I study four commonly used intrinsic datasets and demonstrate that all of them exhibit undesirable properties. Second, I propose a novel framework within which to compare word- or phrase-level DMs in terms of their ability to support document classification. My approach couples a classifier to a DM and provides a setting where classification performance is sensitive to the quality of the DM. Third, I present an empirical evaluation of several methods for building word representations and composing them within my framework. I find that the determining factor in building word representations is data quality rather than quantity; in some cases only a small amount of unlabelled data is required to reach peak performance. Neural algorithms for building single-word representations perform better than counting-based ones regardless of what composition is used, but simple composition algorithms can outperform more sophisticated competitors. Finally, I introduce a new algorithm for improving the quality of distributional thesauri using information from repeated runs of the same non deterministic algorithm.
History
File Version
- Published version
Pages
139.0Department affiliated with
- Informatics Theses
Qualification level
- doctoral
Qualification name
- phd
Language
- eng
Institution
University of SussexFull text available
- Yes
Legacy Posted Date
2016-06-01Usage metrics
Categories
No categories selectedKeywords
Licence
Exports
RefWorks
BibTeX
Ref. manager
Endnote
DataCite
NLM
DC