University of Sussex
Browse
2022.coling-1.96.pdf (675.97 kB)

MuSeCLIR: a multiple senses and cross-lingual information retrieval dataset

Download (675.97 kB)
conference contribution
posted on 2023-06-10, 04:51 authored by Wing Yan LIWing Yan LI, Julie WeedsJulie Weeds, David WeirDavid Weir
This paper addresses a deficiency in existing cross-lingual information retrieval (CLIR) datasets and provides a robust evaluation of CLIR systems’ disambiguation ability. CLIR is commonly tackled by combining translation and traditional IR. Due to translation ambiguity, the problem of ambiguity is worse in CLIR than in monolingual IR. But existing auto-generated CLIR datasets are dominated by searches for named entity mentions, which does not provide a good measure for disambiguation performance, as named entity mentions can often be transliterated across languages and tend not to have multiple translations. Therefore, we introduce a new evaluation dataset (MuSeCLIR) to address this inadequacy. The dataset focusses on polysemous common nouns with multiple possible translations. MuSeCLIR is constructed from multilingual Wikipedia and supports searches on documents written in European (French, German, Italian) and Asian (Chinese, Japanese) languages. We provide baseline statistical and neural model results on MuSeCLIR which show that MuSeCLIR has a higher requirement on the ability of systems to disambiguate query terms.

History

Publication status

  • Published

File Version

  • Published version

Journal

Proceedings of the 28th International Conference on Computational Linguistics

Publisher

International Committee on Computational Linguistics

Page range

1128-1135

Event name

29th International Conference on Computational Linguistics

Event location

Korea

Event type

conference

Event date

October 12-17, 2022

Series

COLING'2022

Department affiliated with

  • Informatics Publications

Full text available

  • Yes

Peer reviewed?

  • Yes

Legacy Posted Date

2022-09-27

First Open Access (FOA) Date

2022-10-19

First Compliant Deposit (FCD) Date

2022-09-27

Usage metrics

    University of Sussex (Publications)

    Categories

    No categories selected

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC