Skip to main content

Showing 1–9 of 9 results for author: Nargesian, F

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.07354  [pdf, other

    cs.DB cs.CY cs.LG

    FairEM360: A Suite for Responsible Entity Matching

    Authors: Nima Shahbazi, Mahdi Erfanian, Abolfazl Asudeh, Fatemeh Nargesian, Divesh Srivastava

    Abstract: Entity matching is one the earliest tasks that occur in the big data pipeline and is alarmingly exposed to unintentional biases that affect the quality of data. Identifying and mitigating the biases that exist in the data or are introduced by the matcher at this stage can contribute to promoting fairness in downstream tasks. This demonstration showcases FairEM360, a framework for 1) auditing the o… ▽ More

    Submitted 18 July, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

  2. arXiv:2307.02726  [pdf, other

    cs.DB cs.CY cs.LG

    Through the Fairness Lens: Experimental Analysis and Evaluation of Entity Matching

    Authors: Nima Shahbazi, Nikola Danevski, Fatemeh Nargesian, Abolfazl Asudeh, Divesh Srivastava

    Abstract: Entity matching (EM) is a challenging problem studied by different communities for over half a century. Algorithmic fairness has also become a timely topic to address machine bias and its societal impacts. Despite extensive research on these two topics, little attention has been paid to the fairness of entity matching. Towards addressing this gap, we perform an extensive experimental evaluation… ▽ More

    Submitted 5 July, 2023; originally announced July 2023.

    Comments: Accepted to VLDB'23

  3. arXiv:2304.10572  [pdf, other

    cs.DB

    KOIOS: Top-k Semantic Overlap Set Search

    Authors: Pranay Mundra, Jianhao Zhang, Fatemeh Nargesian, Nikolaus Augsten

    Abstract: We study the top-k set similarity search problem using semantic overlap. While vanilla overlap requires exact matches between set elements, semantic overlap allows elements that are syntactically different but semantically related to increase the overlap. The semantic overlap is the maximum matching score of a bipartite graph, where an edge weight between two set elements is defined by a user-defi… ▽ More

    Submitted 20 April, 2023; originally announced April 2023.

  4. arXiv:2303.00940  [pdf, other

    cs.DB

    Sampling over Union of Joins

    Authors: Yurong Liu, Yunlong Xu, Fatemeh Nargesian

    Abstract: Data scientists often draw on multiple relational data sources for analysis. A standard assumption in learning and approximate query answering is that the data is a uniform and independent sample of the underlying distribution. To avoid the cost of join and union, given a set of joins, we study the problem of obtaining a random sample from the union of joins without performing the full join and un… ▽ More

    Submitted 9 March, 2023; v1 submitted 1 March, 2023; originally announced March 2023.

    Comments: 14 pages, 6 figures

  5. arXiv:2301.04901  [pdf, other

    cs.DB cs.IR

    Pylon: Semantic Table Union Search in Data Lakes

    Authors: Tianji Cong, Fatemeh Nargesian, H. V. Jagadish

    Abstract: The large size and fast growth of data repositories, such as data lakes, has spurred the need for data discovery to help analysts find related data. The problem has become challenging as (i) a user typically does not know what datasets exist in an enormous data repository; and (ii) there is usually a lack of a unified data model to capture the interrelationships between heterogeneous datasets from… ▽ More

    Submitted 13 January, 2023; v1 submitted 12 January, 2023; originally announced January 2023.

    Comments: Version submitted to the third round of ICDE 2023 on October 8, 2022

  6. arXiv:2011.14460  [pdf, other

    cs.DS

    AWLCO: All-Window Length Co-Occurrence

    Authors: Joshua Sobel, Noah Bertram, Chen Ding, Fatemeh Nargesian, Daniel Gildea

    Abstract: Analyzing patterns in a sequence of events has applications in text analysis, computer programming, and genomics research. In this paper, we consider the all-window-length analysis model which analyzes a sequence of events with respect to windows of all lengths. We study the exact co-occurrence counting problem for the all-window-length analysis model. Our first algorithm is an offline algorithm t… ▽ More

    Submitted 29 November, 2020; originally announced November 2020.

    ACM Class: F.2.0; E.m

  7. arXiv:2008.01208  [pdf, other

    cs.DB

    Knowledge Translation: Extended Technical Report

    Authors: Bahar Ghadiri Bashardoost, Renée J. Miller, Kelly Lyons, Fatemeh Nargesian

    Abstract: We introduce Kensho, a tool for generating mapping rules between two Knowledge Bases (KBs). To create the mapping rules, Kensho starts with a set of correspondences and enriches them with additional semantic information automatically identified from the structure and constraints of the KBs. Our approach works in two phases. In the first phase, semantic associations between resources of each KB are… ▽ More

    Submitted 3 August, 2020; originally announced August 2020.

    Comments: Extended technical report of "Knowledge Translation" paper, accepted in VLDB 2020

  8. arXiv:1812.07024  [pdf, other

    cs.DB

    Data Lake Organization

    Authors: Fatemeh Nargesian, Ken Q. Pu, Bahar Ghadiri Bashardoost, Erkang Zhu, Renée J. Miller

    Abstract: We consider the problem of creating a navigation structure that allows a user to most effectively navigate a data lake. We define an organization as a graph that contains nodes representing sets of attributes within a data lake and edges indicating subset relationships among nodes. We present a new probabilistic model of how users interact with an organization and define the likelihood of a user f… ▽ More

    Submitted 2 March, 2020; v1 submitted 17 December, 2018; originally announced December 2018.

  9. arXiv:1603.07410  [pdf, other

    cs.DB

    LSH Ensemble: Internet-Scale Domain Search

    Authors: Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, Renée J. Miller

    Abstract: We study the problem of domain search where a domain is a set of distinct values from an unspecified universe. We use Jaccard set containment, defined as $|Q \cap X|/|Q|$, as the relevance measure of a domain $X$ to a query domain $Q$. Our choice of Jaccard set containment over Jaccard similarity makes our work particularly suitable for searching Open Data and data on the web, as Jaccard similarit… ▽ More

    Submitted 23 July, 2016; v1 submitted 23 March, 2016; originally announced March 2016.

    Comments: To appear in VLDB 2016

    ACM Class: H.2.5; H.3.3; H.3.1