University of Sussex
Browse
Leveraging HTML in Free Text Web Named Entity Recognition CAMERA.pdf (104.74 kB)

Leveraging HTML in free text web named entity recognition

Download (104.74 kB)
conference contribution
posted on 2023-06-09, 22:26 authored by Colin AshbyColin Ashby, David WeirDavid Weir
HTML tags are typically discarded in free text Named Entity Recognition from Web pages. We investigate whether these discarded tags might be used to improve NER performance. We compare Text+Tags sentences with their Text-Only equivalents, over five datasets, two free text segmentation granularities and two NER models. We find an increased F1 performance for Text+Tags of between 0.9% and 13.2% over all datasets, variants and models. This performance increase, over datasets of varying entity types, HTML density and construction quality, indicates our method is flexible and adaptable. These findings imply that a similar technique might be of use in other Web-aware NLP tasks, including the enrichment of deep language models.

Funding

EPSRC DTP EP/M508172/1; G1644; EPSRC-ENGINEERING & PHYSICAL SCIENCES RESEARCH COUNCIL; EP/M508172/1

History

Publication status

  • Published

File Version

  • Published version

Journal

Proceedings of the 28th International Conference on Computational Linguistics

Publisher

International Committee on Computational Linguistics

Page range

407-413

Event name

28th International Conference on Computational Linguistics

Event location

Barcelona, Spain (Online)

Event type

conference

Event date

December 8-13, 2020

Place of publication

Barcelona, Spain (Online)

Series

COLING'2020

Department affiliated with

  • Informatics Publications

Research groups affiliated with

  • Data Science Research Group Publications

Full text available

  • Yes

Peer reviewed?

  • Yes

Legacy Posted Date

2020-12-11

First Open Access (FOA) Date

2020-12-11

First Compliant Deposit (FCD) Date

2020-12-10

Usage metrics

    University of Sussex (Publications)

    Categories

    No categories selected

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC