Unsupervised learning of Arabic non-concatenative morphology

Khaliq, Bilal

Khaliq,_Bilal.pdf (3.2 MB)

Unsupervised learning of Arabic non-concatenative morphology

thesis

posted on 2023-06-08, 20:41 authored by Bilal Khaliq

Unsupervised approaches to learning the morphology of a language play an important role in computer processing of language from a practical and theoretical perspective, due their minimal reliance on manually produced linguistic resources and human annotation. Such approaches have been widely researched for the problem of concatenative affixation, but less attention has been paid to the intercalated (non-concatenative) morphology exhibited by Arabic and other Semitic languages. The aim of this research is to learn the root and pattern morphology of Arabic, with accuracy comparable to manually built morphological analysis systems. The approach is kept free from human supervision or manual parameter settings, assuming only that roots and patterns intertwine to form a word. Promising results were obtained by applying a technique adapted from previous work in concatenative morphology learning, which uses machine learning to determine relatedness between words. The output, with probabilistic relatedness values between words, was then used to rank all possible roots and patterns to form a lexicon. Analysis using trilateral roots resulted in correct root identification accuracy of approximately 86% for inflected words. Although the machine learning-based approach is effective, it is conceptually complex. So an alternative, simpler and computationally efficient approach was then devised to obtain morpheme scores based on comparative counts of roots and patterns. In this approach, root and pattern scores are defined in terms of each other in a mutually recursive relationship, converging to an optimized morpheme ranking. This technique gives slightly better accuracy while being conceptually simpler and more efficient. The approach, after further enhancements, was evaluated on a version of the Quranic Arabic Corpus, attaining a final accuracy of approximately 93%. A comparative evaluation shows this to be superior to two existing, well used manually built Arabic stemmers, thus demonstrating the practical feasibility of unsupervised learning of non-concatenative morphology.

History

File Version

Published version

Pages

197.0

Department affiliated with

Informatics Theses

Qualification level

doctoral

Qualification name

phd

Language

eng

Institution

University of Sussex

Full text available

Yes

Legacy Posted Date

2015-05-12

Usage metrics

Keywords

Uncategorised value

Licence

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Unsupervised learning of Arabic non-concatenative morphology

History

File Version

Pages

Department affiliated with

Qualification level

Qualification name

Language

Institution

Full text available

Legacy Posted Date

Usage metrics

Categories

Keywords

Licence

Exports