2022.coling-1.96.pdf (675.97 kB)
MuSeCLIR: a multiple senses and cross-lingual information retrieval dataset
conference contribution
posted on 2023-06-10, 04:51 authored by Wing Yan LIWing Yan LI, Julie WeedsJulie Weeds, David WeirDavid WeirThis paper addresses a deficiency in existing cross-lingual information retrieval (CLIR) datasets and provides a robust evaluation of CLIR systems’ disambiguation ability. CLIR is commonly tackled by combining translation and traditional IR. Due to translation ambiguity, the problem of ambiguity is worse in CLIR than in monolingual IR. But existing auto-generated CLIR datasets are dominated by searches for named entity mentions, which does not provide a good measure for disambiguation performance, as named entity mentions can often be transliterated across languages and tend not to have multiple translations. Therefore, we introduce a new evaluation dataset (MuSeCLIR) to address this inadequacy. The dataset focusses on polysemous common nouns with multiple possible translations. MuSeCLIR is constructed from multilingual Wikipedia and supports searches on documents written in European (French, German, Italian) and Asian (Chinese, Japanese) languages. We provide baseline statistical and neural model results on MuSeCLIR which show that MuSeCLIR has a higher requirement on the ability of systems to disambiguate query terms.
History
Publication status
- Published
File Version
- Published version
Journal
Proceedings of the 28th International Conference on Computational LinguisticsPublisher
International Committee on Computational LinguisticsPublisher URL
Page range
1128-1135Event name
29th International Conference on Computational LinguisticsEvent location
KoreaEvent type
conferenceEvent date
October 12-17, 2022Series
COLING'2022Department affiliated with
- Informatics Publications
Full text available
- Yes
Peer reviewed?
- Yes
Legacy Posted Date
2022-09-27First Open Access (FOA) Date
2022-10-19First Compliant Deposit (FCD) Date
2022-09-27Usage metrics
Categories
No categories selectedKeywords
Licence
Exports
RefWorks
BibTeX
Ref. manager
Endnote
DataCite
NLM
DC