File(s) not publicly available
Good-Turing frequency estimation without tears.
journal contribution
posted on 2023-06-07, 23:10 authored by William A Gale, Geoffrey SampsonLinguists and speech researchers who use statistical methods often need to estimate the frequency of some type of item in a population containing items of various types. A common approach is to divide the number of cases observed in a sample by the size of the sample; sometimes small positive quantities are added to divisor and dividend in order to avoid zero estimates for types missing from the sample. These approaches are obvious and simple, but they lack principled justification, and yield estimates that can be wildly inaccurate. I.J. Good and Alan Turing developed a family of theoretically well-founded techniques appropriate to this domain. Some versions of the Good-Turing approach are very demanding computationally, but we define a version, the Simple Good-Turing estimator, which is straightforward to use. Tested on a variety of natural-language-related data sets, the Simple Good-Turing estimator performs well, absolutely and relative both to the approaches just discussed and to other, more sophisticated techniques.
History
Publication status
- Published
Journal
Journal of Quantitative LinguisticsExternal DOI
Issue
3Volume
2Page range
217-237ISBN
0929-6174Department affiliated with
- Informatics Publications
Notes
nominally 1995 but in fact 1996 (after the deadline for the last exercise).Full text available
- No
Peer reviewed?
- Yes
Legacy Posted Date
2012-02-06Usage metrics
Categories
No categories selectedKeywords
Licence
Exports
RefWorks
BibTeX
Ref. manager
Endnote
DataCite
NLM
DC