Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

Chen, Shan; Gallifant, Jack; Gao, Mingye; Moreira, Pedro; Munch, Nikolaj; Muthukkumar, Ajay; Rajan, Arvind; Kolluri, Jaya; Fiske, Amelia; Hastings, Janna; Aerts, Hugo; Anthony, Brian; Celi, Leo Anthony; La Cava, William G.; Bitterman, Danielle S.

Computer Science > Computation and Language

arXiv:2405.05506 (cs)

[Submitted on 9 May 2024 (v1), last revised 24 Jun 2024 (this version, v2)]

Title:Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

Authors:Shan Chen, Jack Gallifant, Mingye Gao, Pedro Moreira, Nikolaj Munch, Ajay Muthukkumar, Arvind Rajan, Jaya Kolluri, Amelia Fiske, Janna Hastings, Hugo Aerts, Brian Anthony, Leo Anthony Celi, William G. La Cava, Danielle S. Bitterman

View PDF HTML (experimental)

Abstract:Large language models (LLMs) are increasingly essential in processing natural languages, yet their application is frequently compromised by biases and inaccuracies originating in their training data. In this study, we introduce Cross-Care, the first benchmark framework dedicated to assessing biases and real world knowledge in LLMs, specifically focusing on the representation of disease prevalence across diverse demographic groups. We systematically evaluate how demographic biases embedded in pre-training corpora like $ThePile$ influence the outputs of LLMs. We expose and quantify discrepancies by juxtaposing these biases against actual disease prevalences in various U.S. demographic groups. Our results highlight substantial misalignment between LLM representation of disease prevalence and real disease prevalence rates across demographic subgroups, indicating a pronounced risk of bias propagation and a lack of real-world grounding for medical applications of LLMs. Furthermore, we observe that various alignment methods minimally resolve inconsistencies in the models' representation of disease prevalence across different languages. For further exploration and analysis, we make all data and a data visualization tool available at: this http URL.

Comments:	Submitted for review, data visualization tool available at: this http URL
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2405.05506 [cs.CL]
	(or arXiv:2405.05506v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2405.05506

Submission history

From: Shan Chen [view email]
[v1] Thu, 9 May 2024 02:33:14 UTC (6,850 KB)
[v2] Mon, 24 Jun 2024 23:17:52 UTC (8,430 KB)

Computer Science > Computation and Language

Title:Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators