Causal datasheet for datasets: an evaluation guide for real-world data analysis and data collection design using Bayesian networks

Butcher, Bradley; Huang, Vincent S; Robinson, Christopher; Reffin, Jeremy; Sgaier, Sema K; Charles, Grace; Quadrianto, Novi

frai-04-612551.pdf (1.89 MB)

Causal datasheet for datasets: an evaluation guide for real-world data analysis and data collection design using Bayesian networks

journal contribution

posted on 2023-06-09, 23:42 authored by Bradley Butcher, Vincent S Huang, Christopher RobinsonChristopher Robinson, Jeremy ReffinJeremy Reffin, Sema K Sgaier, Grace Charles, Novi QuadriantoNovi Quadrianto

Developing data-driven solutions that address real-world problems requires understanding of these problems’ causes and how their interaction affects the outcome–often with only observational data. Causal Bayesian Networks (BN) have been proposed as a powerful method for discovering and representing the causal relationships from observational data as a Directed Acyclic Graph (DAG). BNs could be especially useful for research in global health in Lower and Middle Income Countries, where there is an increasing abundance of observational data that could be harnessed for policy making, program evaluation, and intervention design. However, BNs have not been widely adopted by global health professionals, and in real-world applications, confidence in the results of BNs generally remains inadequate. This is partially due to the inability to validate against some ground truth, as the true DAG is not available. This is especially problematic if a learned DAG conflicts with pre-existing domain doctrine. Here we conceptualize and demonstrate an idea of a “Causal Datasheet” that could approximate and document BN performance expectations for a given dataset, aiming to provide confidence and sample size requirements to practitioners. To generate results for such a Causal Datasheet, a tool was developed which can generate synthetic Bayesian networks and their associated synthetic datasets to mimic real-world datasets. The results given by well-known structure learning algorithms and a novel implementation of the OrderMCMC method using the Quotient Normalized Maximum Likelihood score were recorded. These results were used to populate the Causal Datasheet, and recommendations could be made dependent on whether expected performance met user-defined thresholds. We present our experience in the creation of Causal Datasheets to aid analysis decisions at different stages of the research process. First, one was deployed to help determine the appropriate sample size of a planned study of sexual and reproductive health in Madhya Pradesh, India. Second, a datasheet was created to estimate the performance of an existing maternal health survey we conducted in Uttar Pradesh, India. Third, we validated generated performance estimates and investigated current limitations on the well-known ALARM dataset. Our experience demonstrates the utility of the Causal Datasheet, which can help global health practitioners gain more confidence when applying BNs.

Funding

thicalML: Injecting Ethical and Legal Constraints into Machine Learning Models; G2034; EPSRC-ENGINEERING & PHYSICAL SCIENCES RESEARCH COUNCIL; EP/P03442X/1

BayesianGDPR - Bayesian Models and Algorithms for Fairness and Transparency; G2903; EUROPEAN UNION

History

Publication status

Published

File Version

Published version

Journal

Frontiers in Artificial Intelligence

ISSN

2624-8212

Publisher

Frontiers Media

External DOI

https://doi.org/10.3389/frai.2021.612551

Volume

4

Page range

1-18

Article number

a612551

Department affiliated with

Informatics Publications

Full text available

Yes

Peer reviewed?

Yes

Legacy Posted Date

2021-04-23

First Open Access (FOA) Date

2021-04-23

First Compliant Deposit (FCD) Date

2021-04-23

Usage metrics

Keywords

Machine Learning Bayesian Networks Datasheets

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Causal datasheet for datasets: an evaluation guide for real-world data analysis and data collection design using Bayesian networks

Funding

thicalML: Injecting Ethical and Legal Constraints into Machine Learning Models; G2034; EPSRC-ENGINEERING & PHYSICAL SCIENCES RESEARCH COUNCIL; EP/P03442X/1

BayesianGDPR - Bayesian Models and Algorithms for Fairness and Transparency; G2903; EUROPEAN UNION

History

Publication status

File Version

Journal

ISSN

Publisher

External DOI

Volume

Page range

Article number

Department affiliated with

Full text available

Peer reviewed?

Legacy Posted Date

First Open Access (FOA) Date

First Compliant Deposit (FCD) Date

Usage metrics

Categories

Keywords

Licence

Exports