Leveraging HTML in free text web named entity recognition

Ashby, Colin and Weir, David (2020) Leveraging HTML in free text web named entity recognition. 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), December 8-13, 2020. Published in: Proceedings of the 28th International Conference on Computational Linguistics. 407-413. International Committee on Computational Linguistics, Barcelona, Spain (Online).

[img] PDF - Published Version
Available under License Creative Commons Attribution.

Download (107kB)

Abstract

HTML tags are typically discarded in free text Named Entity Recognition from Web pages. We investigate whether these discarded tags might be used to improve NER performance. We compare Text+Tags sentences with their Text-Only equivalents, over five datasets, two free text segmentation granularities and two NER models. We find an increased F1 performance for Text+Tags of between 0.9% and 13.2% over all datasets, variants and models. This performance increase, over datasets of varying entity types, HTML density and construction quality, indicates our method is flexible and adaptable. These findings imply that a similar technique might be of use in other Web-aware NLP tasks, including the enrichment of deep language models.

Item Type: Conference Proceedings
Schools and Departments: School of Engineering and Informatics > Informatics
Research Centres and Groups: Data Science Research Group
Depositing User: Colin Ashby
Date Deposited: 11 Dec 2020 09:40
Last Modified: 11 Dec 2020 09:40
URI: http://sro.sussex.ac.uk/id/eprint/95648

View download statistics for this item

📧 Request an update
Project NameSussex Project NumberFunderFunder Ref
EPSRC DTP EP/M508172/1G1644EPSRC-ENGINEERING & PHYSICAL SCIENCES RESEARCH COUNCILEP/M508172/1