Leveraging HTML in Free Text Web Named Entity Recognition CAMERA.pdf (104.74 kB)
Leveraging HTML in free text web named entity recognition
conference contribution
posted on 2023-06-09, 22:26 authored by Colin AshbyColin Ashby, David WeirDavid WeirHTML tags are typically discarded in free text Named Entity Recognition from Web pages. We investigate whether these discarded tags might be used to improve NER performance. We compare Text+Tags sentences with their Text-Only equivalents, over five datasets, two free text segmentation granularities and two NER models. We find an increased F1 performance for Text+Tags of between 0.9% and 13.2% over all datasets, variants and models. This performance increase, over datasets of varying entity types, HTML density and construction quality, indicates our method is flexible and adaptable. These findings imply that a similar technique might be of use in other Web-aware NLP tasks, including the enrichment of deep language models.
Funding
EPSRC DTP EP/M508172/1; G1644; EPSRC-ENGINEERING & PHYSICAL SCIENCES RESEARCH COUNCIL; EP/M508172/1
History
Publication status
- Published
File Version
- Published version
Journal
Proceedings of the 28th International Conference on Computational LinguisticsPublisher
International Committee on Computational LinguisticsPublisher URL
Page range
407-413Event name
28th International Conference on Computational LinguisticsEvent location
Barcelona, Spain (Online)Event type
conferenceEvent date
December 8-13, 2020Place of publication
Barcelona, Spain (Online)Series
COLING'2020Department affiliated with
- Informatics Publications
Research groups affiliated with
- Data Science Research Group Publications
Full text available
- Yes
Peer reviewed?
- Yes
Legacy Posted Date
2020-12-11First Open Access (FOA) Date
2020-12-11First Compliant Deposit (FCD) Date
2020-12-10Usage metrics
Categories
No categories selectedKeywords
Licence
Exports
RefWorks
BibTeX
Ref. manager
Endnote
DataCite
NLM
DC