2020.wanlp-1.13.pdf (1.05 MB)
Embed More Ignore Less (EMIL): enriched representations for Arabic NLP
conference contribution
posted on 2023-06-09, 22:20 authored by Ahmed Mostafa Mohamed Ahmed YounesAhmed Mostafa Mohamed Ahmed Younes, Julie WeedsJulie WeedsOur research focuses on the potential improvements of exploiting language specific characteristics in the form of embeddings by neural networks. More specifically, we investigate the capability of neural techniques and embeddings to represent language specific characteristics in two sequence labeling tasks: named entity recognition (NER) and part of speech (POS) tagging. In both tasks, our preprocessing is designed to use enriched Arabic representation by adding diacritics to undiacritized text. In POS tagging, we test the ability of a neural model to capture syntactic characteristics encoded within these diacritics by incorporating an embedding layer for diacritics alongside embedding layers for words and characters. In NER, our architecture incorporates diacritic and POS embeddings alongside word and character embeddings. Our experiments are conducted on 7 datasets (4 NER and 3 POS). We show that embedding the information that is encoded in automatically acquired Arabic diacritics improves the performance across all datasets on both tasks. Embedding the information in automatically assigned POS tags further improves performance on the NER task.
History
Publication status
- Published
File Version
- Published version
Journal
Proceedings of the Fifth Arabic Natural Language Processing WorkshopPublisher
Association for Computational LinguisticsPublisher URL
Page range
139-154Event name
The Fifth Arabic Natural Language Processing Workshop (WANLP 2020)Event location
OnlineEvent type
conferenceEvent date
12th December 2020Department affiliated with
- Informatics Publications
Full text available
- Yes
Peer reviewed?
- Yes
Legacy Posted Date
2020-12-02First Open Access (FOA) Date
2021-01-08First Compliant Deposit (FCD) Date
2020-12-02Usage metrics
Categories
No categories selectedLicence
Exports
RefWorks
BibTeX
Ref. manager
Endnote
DataCite
NLM
DC