University of Sussex
Browse
2020.wanlp-1.13.pdf (1.05 MB)

Embed More Ignore Less (EMIL): enriched representations for Arabic NLP

Download (1.05 MB)
conference contribution
posted on 2023-06-09, 22:20 authored by Ahmed Mostafa Mohamed Ahmed YounesAhmed Mostafa Mohamed Ahmed Younes, Julie WeedsJulie Weeds
Our research focuses on the potential improvements of exploiting language specific characteristics in the form of embeddings by neural networks. More specifically, we investigate the capability of neural techniques and embeddings to represent language specific characteristics in two sequence labeling tasks: named entity recognition (NER) and part of speech (POS) tagging. In both tasks, our preprocessing is designed to use enriched Arabic representation by adding diacritics to undiacritized text. In POS tagging, we test the ability of a neural model to capture syntactic characteristics encoded within these diacritics by incorporating an embedding layer for diacritics alongside embedding layers for words and characters. In NER, our architecture incorporates diacritic and POS embeddings alongside word and character embeddings. Our experiments are conducted on 7 datasets (4 NER and 3 POS). We show that embedding the information that is encoded in automatically acquired Arabic diacritics improves the performance across all datasets on both tasks. Embedding the information in automatically assigned POS tags further improves performance on the NER task.

History

Publication status

  • Published

File Version

  • Published version

Journal

Proceedings of the Fifth Arabic Natural Language Processing Workshop

Publisher

Association for Computational Linguistics

Page range

139-154

Event name

The Fifth Arabic Natural Language Processing Workshop (WANLP 2020)

Event location

Online

Event type

conference

Event date

12th December 2020

Department affiliated with

  • Informatics Publications

Full text available

  • Yes

Peer reviewed?

  • Yes

Legacy Posted Date

2020-12-02

First Open Access (FOA) Date

2021-01-08

First Compliant Deposit (FCD) Date

2020-12-02

Usage metrics

    University of Sussex (Publications)

    Categories

    No categories selected

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC