University of Sussex
Browse
Creating longitudinal datasets and cleaning existing data identifiers in a cystic fibrosis registry using a novel Bayesian p.pdf (2.96 MB)

Creating longitudinal datasets and cleaning existing data identifiers in a cystic fibrosis registry using a novel Bayesian probabilistic approach from astronomy

Download (2.96 MB)
journal contribution
posted on 2023-06-09, 22:28 authored by Pete Hurley, Seb OliverSeb Oliver, Anil Mehta
Patient registry data are commonly collected as annual snapshots that need to be amalgamated to understand the longitudinal progress of each patient. However, patient identifiers can either change or may not be available for legal reasons when longitudinal data are collated from patients living in different countries. Here, we apply astronomical statistical matching techniques to link individual patient records that can be used where identifiers are absent or to validate uncertain identifiers. We adopt a Bayesian model framework used for probabilistically linking records in astronomy. We adapt this and validate it across blinded, annually collected data. This is a high-quality (Danish) sub-set of data held in the European Cystic Fibrosis Society Patient Registry (ECFSPR). Our initial experiments achieved a precision of 0.990 at a recall value of 0.987. However, detailed investigation of the discrepancies uncovered typing errors in 27 of the identifiers in the original Danish sub-set. After fixing these errors to create a new gold standard our algorithm correctly linked individual records across years achieving a precision of 0.997 at a recall value of 0.987 without recourse to identifiers. Our Bayesian framework provides the probability of whether a pair of records belong to the same patient. Unlike other record linkage approaches, our algorithm can also use physical models, such as body mass index curves, as prior information for record linkage. We have shown our framework can create longitudinal samples where none existed and validate pre-existing patient identifiers. We have demonstrated that in this specific case this automated approach is better than the existing identifiers.

History

Publication status

  • Published

File Version

  • Published version

Journal

PLoS ONE

ISSN

1932-6203

Publisher

Public Library of Science

Issue

7

Volume

13

Page range

1-15

Article number

a0199815

Event location

United States

Department affiliated with

  • Physics and Astronomy Publications

Full text available

  • Yes

Peer reviewed?

  • Yes

Legacy Posted Date

2020-12-16

First Open Access (FOA) Date

2020-12-16

First Compliant Deposit (FCD) Date

2020-12-15

Usage metrics

    University of Sussex (Publications)

    Categories

    No categories selected

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC