University of Sussex
Browse

File(s) not publicly available

Good-Turing frequency estimation without tears.

journal contribution
posted on 2023-06-07, 23:10 authored by William A Gale, Geoffrey Sampson
Linguists and speech researchers who use statistical methods often need to estimate the frequency of some type of item in a population containing items of various types. A common approach is to divide the number of cases observed in a sample by the size of the sample; sometimes small positive quantities are added to divisor and dividend in order to avoid zero estimates for types missing from the sample. These approaches are obvious and simple, but they lack principled justification, and yield estimates that can be wildly inaccurate. I.J. Good and Alan Turing developed a family of theoretically well-founded techniques appropriate to this domain. Some versions of the Good-Turing approach are very demanding computationally, but we define a version, the Simple Good-Turing estimator, which is straightforward to use. Tested on a variety of natural-language-related data sets, the Simple Good-Turing estimator performs well, absolutely and relative both to the approaches just discussed and to other, more sophisticated techniques.

History

Publication status

  • Published

Journal

Journal of Quantitative Linguistics

Issue

3

Volume

2

Page range

217-237

ISBN

0929-6174

Department affiliated with

  • Informatics Publications

Notes

nominally 1995 but in fact 1996 (after the deadline for the last exercise).

Full text available

  • No

Peer reviewed?

  • Yes

Legacy Posted Date

2012-02-06

Usage metrics

    University of Sussex (Publications)

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC