Threshold KNN-Shapley: A Linear-Time and Privacy-Friendly Approach to Data Valuation

Wang, Jiachen T.; Zhu, Yuqing; Wang, Yu-Xiang; Jia, Ruoxi; Mittal, Prateek

Computer Science > Machine Learning

arXiv:2308.15709 (cs)

[Submitted on 30 Aug 2023 (v1), last revised 26 Nov 2023 (this version, v2)]

Title:Threshold KNN-Shapley: A Linear-Time and Privacy-Friendly Approach to Data Valuation

Authors:Jiachen T. Wang, Yuqing Zhu, Yu-Xiang Wang, Ruoxi Jia, Prateek Mittal

View PDF

Abstract:Data valuation aims to quantify the usefulness of individual data sources in training machine learning (ML) models, and is a critical aspect of data-centric ML research. However, data valuation faces significant yet frequently overlooked privacy challenges despite its importance. This paper studies these challenges with a focus on KNN-Shapley, one of the most practical data valuation methods nowadays. We first emphasize the inherent privacy risks of KNN-Shapley, and demonstrate the significant technical difficulties in adapting KNN-Shapley to accommodate differential privacy (DP). To overcome these challenges, we introduce TKNN-Shapley, a refined variant of KNN-Shapley that is privacy-friendly, allowing for straightforward modifications to incorporate DP guarantee (DP-TKNN-Shapley). We show that DP-TKNN-Shapley has several advantages and offers a superior privacy-utility tradeoff compared to naively privatized KNN-Shapley in discerning data quality. Moreover, even non-private TKNN-Shapley achieves comparable performance as KNN-Shapley. Overall, our findings suggest that TKNN-Shapley is a promising alternative to KNN-Shapley, particularly for real-world applications involving sensitive data.

Comments:	NeurIPS 2023 Spotlight
Subjects:	Machine Learning (cs.LG); Cryptography and Security (cs.CR); Computer Science and Game Theory (cs.GT); Machine Learning (stat.ML)
Cite as:	arXiv:2308.15709 [cs.LG]
	(or arXiv:2308.15709v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2308.15709

Submission history

From: Jiachen T. Wang [view email]
[v1] Wed, 30 Aug 2023 02:12:00 UTC (5,173 KB)
[v2] Sun, 26 Nov 2023 04:32:25 UTC (5,174 KB)

Computer Science > Machine Learning

Title:Threshold KNN-Shapley: A Linear-Time and Privacy-Friendly Approach to Data Valuation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Threshold KNN-Shapley: A Linear-Time and Privacy-Friendly Approach to Data Valuation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators