Guarding Digital Privacy
Guarding Digital Privacy
Abstract
User profiling, the practice of collecting user information for personalized rec-
ommendations, has become widespread, driving progress in technology. However,
this growth poses a threat to user privacy, as devices often collect sensitive
data without their owners’ awareness. This article aims to consolidate knowl-
edge on user profiling, exploring various approaches and associated challenges.
Through the lens of two companies sharing user data and an analysis of 18 pop-
ular Android applications in India across various categories, including Social,
Education, Entertainment, Travel, Shopping and Others, the article unveils pri-
vacy vulnerabilities. Further, the article propose an enhanced machine learning
framework, employing decision trees and neural networks, that improves state-
of-the-art classifiers in detecting personal information exposure. Leveraging the
XAI (explainable artificial intelligence) algorithm LIME (Local Interpretable
Model-agnostic Explanations), it enhances interpretability, crucial for reliably
identifying sensitive data. Results demonstrate a noteworthy performance boost,
achieving a 75.01% accuracy with a reduced training time of 3.62 seconds for
neural networks. Concluding, the paper suggests research directions to strengthen
digital security measures.
Keywords: User profiling, Privacy Leak, Decision tree, Neural network, Explainable
artificial intelligence
1
1 Introduction
Personalized systems are designed to address the issue of information overload for
users searching for specific queries. They achieve this by customizing information to
individual users based on their profiles [1]. A user profile being a compilation of infor-
mation associated with a user, can help to achieve this goal and to attain target of
accurate recommendations and ultimately deliver personalized information.
As of year 2024, a vast array of data companies exists that track people across all
aspects of their lives, both online or offline, collecting and accumulating an unprece-
dented volume of consumer data. This data provides insight into the behaviour and
demographics of consumers and enable companies to analyze the patterns that can
lift the overall customer experience. Consequently this data has become a valuable
economic asset. For decades, regulatory authorities, journalists and civil society have
mentioned the lack of transparent policies by the companies regarding buying and
selling of their customer’s data.
In the data ecosystem, different companies continuously trade digital user profiles
with one another. These data companies combine and link data from various devices
of users such as computers, mobiles and diverse IoT devices. Every click on a web-
site transmits information using invisible script embedded in web pages to hundreds
of third-party companies. These invisible codes, called trackers, record information
about a user visiting a website and collect data on how a user interacts with the visited
pages. When the same tracker is present across numerous websites, it can develop a
detailed profile of user’s online activities and behaviour [2]. Similarly, every swipe on a
smartphone may trigger these hidden data sharing mechanisms, collecting rich infor-
mation about users. This information flow is not only limited to device manufacturers
and application owners but also extends to a significant number of other third-party
companies.
The increasing number of smart devices, has made the collection of personal data
a greater threat to user’s privacy. In order to provide services through smartphones,
applications require access to phone features such as the camera and microphone.
However, many applications ask for permissions to access sensitive system resources
(e.g., sensors like microphone, camera and GPS), personal information of the user
(e.g., email address and contact list), and also unique identifiers (e.g,. IMEI number)
that are not required to provide services but to track users [3]. Vulnerabilities in
the applications or operating systems of the devices can expose personal data. For
instance, whatsApp vulnerability (CVE-2019-3568 [4]) allowed attackers to remotely
install surveillance software on phones by calling the targeted device. This allowed the
installation of spyware, compromising user’s data on device, and potentially enabling
further attacks. These kind of vulnerabilities creates a loophole for hackers and makes
user’s sensitive and personal information more vulnerable.
This study covers everything about how user profiling works and why it matters for
privacy and security. Other studies usually focus on just one part—either user profiling
process [1], [5], [6], [7], [8] or why it might be a problem for privacy [9],[10],[11],[12].
This research provides a complete repository that explains all about user profiling and
talks about how it affects privacy and security. Rest of the article follows the following
structure. Section II covers objectives of this paper and Section III provides a review of
2
literature in the domain of user profiling. Section IV describes user profiling and its use-
cases. Section V illustrates the user profiling process while section VI provides study on
two data brokers. Privacy and security concerns regarding user profiling are presented
in Section VII. Section VIII describes an experimental study and implementation of
framework to detect privacy leakage from a variety of Android applications. Section
IX illustrates some open research areas. Article is concluded in Section X.
2 Objectives
This article conducts an extensive review of recent contributions exploring the latest
practices and their implications. The contributions of this article can be considered in
four parts as outlined below:
i) Review of user profiling, including methods, types, models, and processes, along
with highlighting challenges in current mechanisms.
ii) Analysis of privacy and security issues present in the user profiling process. This
study explores how thousands of commercial organizations collect, trade and make
use of personal data, and influence the lives of billions of people. Based on the review
of literature and articles, a study of two data collecting companies that operate by
collecting, analyzing and selling user’s data is presented.
iii) An experimental study of several mobile applications used in India is covered
to detect leaks of user’s personal data by intercepting the network traffic. Further, ML
classifiers that improve state-of-the-art frameworks in detecting personal information
exposures is presented and then XAI algorithm LIME is used to provide explanations
for the results generated by the classifiers.
iv) To the end, study provides several open research directions for future work.
3 Related Work
We briefly review some of the prominent approaches related to user profiling in this
section, highlighting the existing shortcomings. We short-listed these papers based on
relevance and clarity of understanding. Keyword search of “user profiling”, “privacy
leak”, “personal data leakage” etc, assisted us in the short-listing process to identify
relevance of different existing works to our text.
Some studies have employed diverse platforms like social media and smart devices
to deduce user actions, in order to develop user profiles and provide relevant recom-
mendations. Table 1 provides summary of some of these studies. Table 2 presents a
compilation of several studies aiming to address the security and privacy concerns asso-
ciated with data collection for user profiling. These studies encompass various aspects,
such as detecting potential disclosure of sensitive information from user devices and
identifying embedded trackers in smart devices. Several other studies have focused on
user profiling through surveys, and thus we present comprehensive information about
these investigations.
Stewart et al. [16], conducted a survey on methods for filtering vast amounts of
information from electric sources, including the internet, to create user profiles. These
techniques includes statistical term-based, neural networks and social filtering. [17] dis-
cussed user modelling techniques for constructing and representing profiles for social
3
Table 1: Summary of works in building user profiles
Ref. Objective Data Methodology Gap noted
Source(s)
[13] Build multidi- Twitter • Developed model to represent dynamic connections Scale of the experiments or the
mensional profile • Ranked list of users generated to find relevant diversity of used data is not
of users explained.
information based on individual requirements.
[8] Develop frame- - Used deep learning to predict user’s traits to create • Used small datasets.
work to fuse profile. • Not considered privacy and
diverse informa-
tion from various ethical aspects of social
sources. media data.
[7] Create smart-TV In-built • Detect faces in front of TV. • Unsuitability for public
based recommen- resources of • Generate anonymous and consolidated user and places.
dation system. smart TV. group profiles. • Accuracy issue due to low
• Create item profiles using stored videos, live chan-
nels, EPG data, and other sources. brightness.
• Recommend items to user by comparison between
two types of profiles.
[15] Show encryption IoT devices Separate traffic into packet streams and • Only focused on passive network threat
alone is insuffi- label it by type of device and then corre- model
cient to preserve lating traffic rates with user interactions to • Identify same manufacturer device using
privacy of smart infer consumer behaviour.
homes DNS is limitation.
[10] Check presence of OTT TV Built crawler to interact with OTT chan- • Only focus on Roku Express and Ama-
trackers in OTT streaming nels. zon Fire TV Stick
devices. devices • Crawler has restricted capability.
[12] Identifying PII Android apps. Used Federated learning to classify outgo- • Not handled data and system hetero-
or ad requests in ing HTTP packets. geneity.
HTTP packets. • Didn’t address real-device resource con-
straint problem
• Didn’t consider possibility of attackers
among clients in a sub-network
[11] Reveal PII leak- Popular Automated the disclosure of PII leakage • Used manual labeling and crowd-
age and give users apps on iOS, using machine learning on a cross-platform sourcing from a specific group of users.
control over their Android, and scale. • Relied on a centralized server model,
data. Windows.
raising privacy concerns
media platforms, highlighting their strengths and weaknesses and providing a vision
for future research, for example, creation of more dynamic and intelligent profiles to
receive more appropriate results from users’ profiles. [18] reviewed the latest advance-
ments in user profiling, including methods, features, and taxonomies. It explored
techniques such as data acquisition, feature extraction, and profiling approaches, along
with performance metrics. The survey also addressed challenges like dataset size, the
4
cold start problem, and domain dependencies. Furthermore, it outlined future research
areas, such as developing versatile, dynamic, language-independent user profiles.
[19] reviewed studies related to profiling smartphone users through ordinary apps,
presenting a general framework for learning user information from smartphone applica-
tions. They included the method of data collection, pre-possessing, and user profiling,
with implications and suggestions for improving business services, user experience,
and profits, and developing mobile context-aware tools to improve the quality of life of
users in different aspects. Another work by [20] provided an overview of profiling users
on social networks. Since the data available on the web ranges from semi-structured
to unstructured, various approaches were presented to profile users in online social
networks. These approaches included clustering, face detection, user activities, con-
tent analysis and behavioral analysis. Authors [21] conducted a systematic mapping
study of profiling users based on reviews, presenting the latest trends in user profile
modeling and analysis.
[22] examined intrusion detection and prevention systems from the standpoint
of exploiting behavior including system behaviours that is generated by hosts and
networks and relate to the host activities and network status and user behaviours
that relate to the direct interaction between the user and the system, for example,
typing patterns. These behaviors were examined to determine whether a user was
legitimate to be on the system i.e. review of intrusion detection and prevention systems
for profiling users. In the first step of profiling users, the behavior was analyzed,
then categorised into system behaviors. As a result of this classification, data profiles
were then divided into system profiles and user profiles, with the latter being further
categorized into more specific categories namely biometric and psychometric profiles
based on their characteristics. A summary was then provided of the advantages and
limitations of these specific profiles and related analysis techniques.
In contrast to these studies, our work presents an examination of the user profiling
process pipeline reviewing latest work, starting with deciding the context of profiling
users, collecting data from various platforms using numerous approaches, and finally
constructing the profile using different techniques. This is followed by security and
privacy implications that profiling has on user’s life. Also, a case-study on two com-
panies has been carried out which are involved in profiling process and establishing
links between a profile to an individual on web. The identification of privacy leakage
from android apps justifies the data exfiltration process that occurs from user’s devices
without their prior consent and knowledge. Next we discuss in detail user profiling
process along with its usecases in Section V.
4 User Profiling
User profiling refers to the process of extracting data about the interests, preferences,
behaviors, and needs of a user, which can be used to identify an individual. Various
interpretations of user profiling have been proposed in the literature, depending on
the context in which it is used. For instance, in work by [23], user profiles are designed
on a mobile cloud environment to provide distributed IT services and resources to
users based on context information. Authors of this work manifests user profile as
5
information of user and service consumed by them. Information like user ID, user
name, hobbies, personal desire, and other details are stored in user’s profile, while the
service information part stored data about IT services used, such as service name,
context, provider, and frequency of access, etc.
6
them to target specific section of society for a general cause like women health care
hygiene. This gives governmental sector organizations enormous capabilities for their
e-government strategies [27]. Apart from providing e-services, the government makes
use of user profiling by collecting citizens data for various affairs, such as predict-
ing crimes and illicit activities based on citizens’ behavioural profiles developed over
time. Also, there are organizations that use internal systems to track and recommend
services for its employees based on their profile. For instance, IBM [28] utilizes data
analytics and AI algorithms to match employees’ profile with relevant projects and
opportunities. Accurately representing the user profile is crucial to obtain the best
results that closely align with their preferences. The next section will explain the
details of the user profiling process.
7
Fig. 1: Taxonomy of user profiling
8
and accelerometer sensors of smartphones to gather user data and create a behav-
ioral model. The authors introduced a context-generation method that continuously
updates a trust score, indicating the likelihood of users engaging in activities simi-
lar to their regular patterns. Whereas, [31] used smart watch and smart phone to get
details of user’s activity and accelerometers sensor data to categorize user behaviour.
Web is an important source of data. Within seconds of a user clicking on a website
or interacting with a mobile application, trackers embedded in website/app profile a
user. One usually needs to create an account in order to use services of most web
platforms. User personal information such as their name, email id, birthday, profile
picture, and possibly even the financial information are all linked to his account.
All the interactions a user has while using social media service and e-marketplace,
including the type of content or product user finds interesting and shared by him,
interactions with other users and advertisements the user interacts with and responds
to are identified with this account. A profile can be developed based on the data a
user provides, and also their interactions. [13] and [32] created a user profile from the
personal information that is shared by users on social network sites. Websites can also
study their users’ shopping habits, how did they find their website, and if they are
interested in advertisements on pages they visit. With help of IP addresses, cookies,
and small image files called tracking pixels or web beacons, most of this information
can be reverse engineered. By combining information from different websites, a detailed
profile of a user can be generated. For example, [33] and [34] generated user profiles
from data gathered from user’s web search with an aim to get insights into user
behavior patterns.
Data collected from various platforms can be used by businesses to gain a deep
understanding of their customers and provide them with personalized services.
9
collecting user preferences. They consulted users for their preferences using the explicit
approach, while the implicit approach discovers user preferences through MovieLens
datasets, wherein a mapping between MovieLens genres and user preferences is estab-
lished based on their semantic meanings and relationship. Next, we will expound
upon the methodology of formulating and modelling the user profile utilizing diverse
techniques.
10
nodes represent abstract topics of interest to the user. Profiles are represented as
vectors of weighted features, where features represent concepts instead of individual
words or phrases. [40] proposed the use of hierarchical concepts, which allows for
generalization. The hierarchy of concepts can either be fixed or dynamic, depending
on the user’s interest.
4. Graph-Based: Graph-based representations, however, focus on utilizing graphs or
networks to represent and analyze user-related information without specific empha-
sis on semantic relations or abstract concepts. Nodes in these graphs signify entities
or concepts, while edges represent relationships or connections between them. This
approach captures intricate relationships, dependencies, and interactions within
a user’s profile. For instance, Chen et al. [41] proposed a method for enhancing
user profiling within sequential recommender systems using two layers of graphs: a
global graph to record transitions among various behaviors across all users, and a
personalized graph to model individual user interactions and preferences.
11
content of the item by extracting keywords from the product descriptions. These
are more commonly used in text-intensive domains [6, 43]. Also, analyzing limited
content can be challenging and may reduce the performance of the approach,
particularly in domains that involve multimedia content like images and audio.
• Collaborative filtering: This approach operates on the idea that similar users
will have similar preferences. Therefore, this method clusters similar interest
users into groups. It uses an algorithm that aggregates feedback from multiple
users to suggest items that will appeal to target users by considering similarities
between different users [44]. These algorithms fall into two categories: memory-
based and model-based. Memory-based methods, such as correlation analysis,
compare the active user’s profile with similar ones in the database, relying on the
entire user-item matrix to tailor recommendations. Model-based methods learn
from the entire dataset to offer personalized suggestions without needing the
entire dataset in memory during runtime, providing scalability advantages. For
example, Amazon.com utilizes collaborative filtering to personalize web pages
with tailored recommendations based on individual customer interests.
12
Table 3: Survey of works using Machine learn-
ing techniques
Ref. Model Objective
[46] Random for- High-risk user identifi-
est and neural cation model
network
[47], Boosting Student profile model-
[48] algorithms, ing
Naı̈ve Bayes
[49] k-means clus- Recommendation
tering model
[50] PCA Anomaly detection
Table 4: Survey of works using Deep
learning techniques
13
semantic similarities between the geographical proximity of items and user requests.
Meanwhile, [52] focused on predicting user identity links across social networks to
improve recommendations using multi-layer perceptron. Their approach utilized a
heterogeneous graph representation and multiple attention layers for aggregating
user information.
In a different study, [53] proposed a two-stage collaborative filtering recom-
mendation model. This model incorporated a time-aware attention mechanism
for dynamic user preferences and a matching function learning model based on
deep matrix factorization and multiple-layer perceptron for user-item feature inter-
actions. In personalized recommendations, [54] aimed to enhance user profiling
for point-of-interest recommendations using graph neural network and attention
mechanisms. Their architecture included layers for encoding inputs, aggregating
neighborhood entities, and integrating user/location representations.
[55] introduced a temporal CNN approach for continual user representation
learning across tasks, leveraging partial parameters from previous tasks, while
[56] developed a candidate-aware user modeling framework for personalized news
recommendation, using CNN networks to model local click behavior context.
In other domains, [57] employed stacked autoencoders and clustering for user
behavior analysis for power grid user behavior analysis, and [58] employed a vari-
ational autoencoder-based model for user representation learning using end-to-end
learning algorithm. For user modeling in inconsistent client environments, [59]
proposed a federated learning method.
[60] introduced a user-based RNN model for personalized recommendations. In
student performance modeling, [61] developed a personalized federated learning
framework and proposed user profiling using semantic behavior modeling utilizing
attention-based Gated Recurrent Units (GRU). [62], proposed user profiling by
combining semantic behavior modeling with RNNs.
For user preference modeling, [63] utilized a Transformer-based architecture
with self-attention mechanisms for contextualized item embeddings. While [64]
introduced UserBERT for pre-training user models on unlabeled data, employing
contrastive self-supervision and behavior sequence matching tasks.
Once the user profiles have been constructed, they can be leveraged to provide
personalized services to users. The next section will explore two prominent data brokers
that profiles users and then sell their profiles to its clients.
14
provides insight into their current audience. Additionally, this data can assist in the
development of strategies aimed at expanding a company’s customer base [66].
Data management platforms (DMPs) help brokers integrate customer data with
profiles from third-party sources. The platforms synchronize with each other, for
instance by employing cookie synching, to recognize the user across various tracking
entities and match identifiers from disparate systems. DMPs have capacity to recog-
nize users not only on the Web, but also across different devices, by anonymously
matching data from multiple sources [67] using identifiers discussed in Section 6.3.
These identifiers are then transformed into an alphanumeric string via cryptographic
methods. Thus, a user can be recognized as the same individual across various devices
and platforms, and their profile can be connected to more comprehensive data.
Today, users use multiple platforms and devices, making it challenging for data
companies to match user data accurately. Deterministic matching and probabilistic
matching are two main approaches used for this purpose, with deterministic matching
relying on common identifiers like email addresses or mobile identifiers, and proba-
bilistic matching using algorithms and statistical models based on various data pieces.
After review of existing literature and going through articles, details on two data com-
panies are presented for examination and analysis that collect user’s data from diverse
domains and concatenate them into profiles for sharing it further.
6.1 Acxiom
Acxiom is one of the major players in the field of consumer data brokering and collec-
tion, claiming on their website that their primary focus is on providing the necessary
data, technology, and services to enhance customer experiences worldwide [65]. Acxiom
Marketing Solutions (AMS) functions as Acxiom’s data brokerage, while LiveRamp
is a Software-as-a-Service (SaaS) platform that helps marketers utilize their exist-
ing data sets. AMS gathers information from a variety of sources, including publicly
accessible records (such as data on births, deaths, marriages, divorces, and changes
of name and address), website cookies, surveys, and other digital tracking services.
Data is combined and sorted into various user categories. Sorting in this way goes
beyond simple demographics. For example, Bruce Schneier, a privacy expert, noted
in his book “Data and Goliath” that Acxiom categorizes its data collection into cat-
egories like “potential inheritor,” “adult with senior parent” and “households with a
diabetes focus or seniors needs”. These lists can then be sold to companies looking to
market to these specific groups [68]. LiveRamp’s primary service is “data onboarding”
or “offline matching,” which involves linking offline user records to online identifiers
tied to a browser or device. This enables LiveRamp’s clients to target users based on
various behavioral and attitudinal data, such as purchase history, TV show prefer-
ences, website traffic, and smart device usage [69]. According to LiveRamp’s Service
Policy [69], this approach allows advertising firms to display targeted ads to specific
individuals across email, mobile apps, websites, and addressable TVs.
Unique Identity Resolution: Acxiom’s AbiliTec is a highly effective recognition
solution that provides clients with a unique identity resolution service. By using various
offline and online touchpoints and multiple variables, AbiliTec goes beyond the basic
name and address information to achieve more precise recognition of individuals and
15
households. Acxiom’s Data Services API [70] states that the AbiliTec Link not only
connects consumer records across different databases, platforms, and devices, but also
assigns a single link for each record, allowing for a complete view of the user. As
a result, individuals and their families can be identified instantly using the “hashed
entity representation”, which utilizes phone numbers, email addresses, smartphone
IDs, or any combination of these, as well as name, address, city, and zip code.
6.2 Oracle
One of the largest multinational technology companies, Oracle, has been acquiring
several data brokers and consolidating their operations under the umbrella of Oracle
Data Cloud [71]. By merging information from many other data brokers, Oracle has
amassed an extensive repository of consumer data, which it has integrated into a full
service digital marketing platform [72].
To enhance targeted advertising, Oracle integrates offline data from loyalty card
programs, which are used by retail businesses to gather customer information, with
digital media. This connection between offline behavior and online profiles enables a
deeper understanding of consumers. Online profiles are generated using cookies saved
on a consumer’s device from partner websites. With Oracle Data Cloud, consumer
interests are tracked across both online and offline activities and displayed to adver-
tisers. This allows advertisers to utilize first-party and third-party data to create
personalized marketing campaigns and target specific audiences. Additionally, Oracle
provides data and campaign metrics in one convenient location. By using first-party
data, easily accessible third-party data, and new second-party data, companies can
build comprehensive customer profiles with Oracle. Oracle’s website and privacy pol-
icy indicate that personal information may be shared with third-party companies for
commercial purposes. As per Oracle’s privacy policy, user information is collected
through various methods. In certain instances, data is directly gathered using email
addresses provided by users during interactions with Oracle or its partners. In other
cases, unique identifiers like mobile device identifiers or cookie IDs on browsers are
used to gather user data. This information can be associated with interest segments
or profiles. Interest segments refer to a group of users who share similar preferences
or behavior and are utilized for direct marketing by Oracle’s customers. Profiles com-
prise a collection of attributes about a specific user or device, or a group of users or
devices that share similar attributes, which are used for marketing by Oracle’s clients.
A unique ID and matching process for Consumers: After generating data-driven
audiences, a business must establish connections with them, which can be challenging
due to single user using multiple devices. Each browser has its own set of cookies, while
on mobile devices, all applications share a mobile advertisement ID that functions
similarly to a cookie.
Oracle ID Graph (OIDG) connects these ID sources and validates them with accu-
racy against a high-quality data that is known to be true because it is made up
of verified transaction and subscription data [73]. OIDG consolidates all interactions
across various channels in order to develop a single actionable user profile. To iden-
tify users on both online and offline platforms, all of Oracle’s clients send their match
keys (which can be any unique user ID) to the company. An encrypted email is the
16
most common type of match key, as it can be gathered either online or offline. In
addition to Oracle hashed IDs (normalized SHA-256 hashed email address or phone
number), clients can utilize encrypted or hashed unique user IDs (UUIDs) based on
phone numbers, email addresses, physical addresses, client account numbers, and even
IP addresses. Oracle synchronizes match keys with the network of user profiles that are
connected by OIDG. All Oracle clients use OIDG to manage their IDs and attributes.
The data transmitted by various applications is gathered not only by organization
that owns that applications (for example, Facebook is owned by Meta Platforms, Inc.)
but also by various other data companies that work mainly by gathering, tracking and
linking a user to a profile and then selling further to its clients.
User-Identification Methods
Mobile Web
In-App Android IOS Windows
Browsers
• Cookies are small files that websites send to a user’s browser. These files enable
tracking and monitoring of the websites visited by the user and the specific items
they interact with or click on during their browsing sessions. “First-party” cookies,
are generated by the domain visited by a user, and “third-party” cookies, created by
domains other than the one being visited, represent the two types of cookies. Third-
party cookies allow advertising companies to monitor users’ browsing activities.
Another variation of cookie is “ever cookie”, a persistent type of cookie that is
stored in multiple locations within the user’s browser and device. Even if deleted,
an ever cookie has the ability to restore itself from alternative locations.
• Device fingerprinting is the practice of recognizing and monitoring individuals by
analyzing the unique attributes of their devices. This involves collecting various
details like browser version, operating system, language, installed plugins, and
settings to establish a distinctive identifier.
17
• HTML5 Local Storage provides a means to store and gather user information.
There are two types of local storage: “local storage” for data retention without
an expiration date, and “session storage” for temporary data during a specific ses-
sion. In comparison to cookies, HTML5 local storage offers superior capacity and
accessibility, while eliminating the need for web server call [74].
For mobile-based identification, identifiers used to track users are:
• Cookies as explained in browser-based identification methods, are also used in case
of Mobile-browsers.
• For mobile applications, identifiers used are Google’s advertising ID (which is a 32-
digit string of characters), Apple’s IDFA and Open Device Identification Number
(deprecated).
The last classification is platform-based i.e., based on type of platform being used by
the user:
• Android uses Google’s advertising ID. For example, the advertising ID of one of
our android test devices is “de7d3063-968a-4450-932c-7abf87a0a261” which can be
accessed in the device settings under Google −→ Ads.
• iOS uses Apple’s IDFA. For one of our iOS testing device, identifier is “3F946465-
AE34-4E7E-AF7E-C3C4A4647CCA” and can be acessed using an app named “Get
My IDFA” [75].
• Windows has its own advertising ID.
18
they willingly provide their personal information, thereby facilitating data collection
by these platforms.
Fig. 3: Source: Disconnect.me. This figure illustrates the presence of trackers on the
ICICI bank website. Colored domains signify tracking sites, while gray domains may
also track users. Left part: domains informed when the user opens the website without
logging in. Right part: domains informed when the user has logged in. (Accessed Date:
October, 2021)
Data-driven companies can derive substantial benefits from user profiles in various
areas. However, these profiles can also be used against users. Data brokers offer ser-
vice providers access to customer behavior information, often without the customer’s
knowledge. Armed with this information, these service providers can then target cus-
tomers with highly personalized ads, promotions, and even phone calls, all based on
the data provided by the data broker. As digital technology and data collection become
ubiquitous, such practices are becoming increasingly opaque. Users are often unaware
of what personal data about them is collected, analyzed, and shared or sold. Small
businesses may upload their customer data to data brokers, who combine it with online
data collected in real-time from other sources.
A entire section of business has flourished that focuses on exploiting and monetizing
user’s personal data. Users are often unaware of these companies as they do not interact
with them directly. These data companies capture data from every intersection of the
digital world, creating dynamic and fragmented profiles that are distributed across
19
multiple databases. Google engages in data surveillance by continuously tracking a
user’s various activities such as website visits, Gmail account usage, mobile device
movements, and YouTube usage, among others, without the user being fully aware
of its implications. This monitoring can result in the collection of highly personal
data, as demonstrated by the presence of trackers on a bank website even after a
user has logged in, as shown in Figure 3. Companies make no effort to improve the
existing nontransparent play. The degree of corporate control over the current digital
environment is a cause for concern with regard to information privacy. [77] investigated
the transformative impact of digialization creating privacy challenges. Their analysis
emphasized the need for robust solutions to safeguard patient information and pave
the way for responsible, ethical adoption of digital electron healthcare. Profile data
gathered by companies could be potentially abused for personal, organizational, or
political purposes.
The ubiquitous sharing of personal data from apps, websites, devices, and ser-
vices with third-parties and the subsequent use of this data by various companies has
raised concerns. E-commerce companies rely heavily on distributed computing, which
have their data centers spread globally. Data sets containing significant amounts of
personal information about users are gathered, which can be used legitimately or
exploited. For example, Amazon has operations across 12 regions, each with multiple
data centers that are susceptible to physical attacks and persistent cyber-attacks. In
2014, eBay’s customer data was compromised in a cyber-attack that resulted in the
theft of names, email addresses, physical addresses, telephone numbers, birth dates,
and encrypted passwords. The stakes are higher than ever due to the large number of
people involved in personal data security incidents keeping privacy of user at stake.
At Arkansas University, the professional development system was breached causing
exposure of personally identifiable information of individuals [78]. Table 5 displays
some high impact data breach incidents that occurred between 2021 and 2024. These
incidents create opportunities for cyber-criminals to exploit vulnerabilities, such as
through phishing, leading to severe consequences like unauthorized purchases, fund
theft, or identity theft for individuals.
20
Such personal information, as in the case of the InfoTrax breach, can be exploited to
carry out fraudulent activities. Stolen names, addresses, and SSNs could be used by
identity thieves to apply for credit cards in the victim’s name, which could affect the
victim’s credit score when the identity thief fails to pay the credit card bills. Under-
standing the financial and personal information of potential victims, including their
income and other personal details, can aid criminals in committing severe crimes such
as blackmailing or kidnapping. For instance, patients at Vastaamo clinic in Helsinki
were blackmailed after their data was stolen in two breaches. Attackers demanded a
€450,000 ($530,000) bitcoin ransom and published clinical records on the dark web.
Names, contact details, and therapy notes of hundreds were compromised [87]. User
profiles make it convenient to find users that have similar behaviour, making certain
section of the population vulnerable to criminal activities. For example, gathering pro-
files of young girls can enable criminals to morph their public photos and expose them
to pornography.
In 2016, an analytics company called “Cambridge Analytica” gained unauthorized
access to at least 87 million profiles of Facebook users, and utilized this data to create
voter profiles to influence their voting decisions. This company provided voter data and
analytics to both the Trump campaign and the campaign of former national security
adviser John Bolton during the 2016 election cycle. This breach and exploitation of
personal data was considered detrimental to democracy by various legislators [88].
Therefore, the abundance of user information increases the likelihood of attracting
intruders to the system, which increases the threats.
The data exposed in public forum during a particular security incident can be
combined with other data breaches to create detailed profiles of potential victims
by particularly determined attackers. Using this information, they can conduct much
more convincing phishing and social engineering attacks or even commit identity theft
against those whose information has been exposed. In next section, we explain security
policies in place and their challenges.
21
sensitive data during program execution. The key insight here is that as programs
execute, the state of the cache gets updated constantly. Considering the speed of CPU
caches, these side channels can leak large amounts of data quickly. Covert channels
involve intentional efforts by two parties to share system resources without violating
the underlying security mechanism, such as through shared storage. For example, a
communication can involve directly or indirectly writing data to a storage location
and another process directly or indirectly reading that location as in case of shared
storage. Therefore, while security policies are in place, they may not provide sufficient
protection for user data on their devices.
Maintaining user privacy is crucial when conducting profiling, therefore it is imper-
ative to implement necessary security measures to safeguard personal information.
There are policies being placed to protect user’s data in many countries. The Digital
Personal Data Protection Act 2023 of India, for example, aims to regulate the handling
of digital personal data in a manner that respects both individuals’ right to safeguard
their personal information and the necessity to process such data for lawful purposes,
as well as other related or incidental matters [94]. Next, we present an experimental
analysis of data leakage from mobile phones using these identifiers and trackers.
22
whom the information applies to be reasonably inferred by either direct or indirect
means” [95]. The General Data Protection Regulation (GDPR) states that personal
data should only be collected for explicit, legitimate, and specific purposes, and should
not be processed in any way that is incompatible with those purposes. Numerous stud-
ies in the literature focus on controlling user tracking and preventing the unauthorized
extraction of PII from mobile devices. These studies can be broadly categorized into
three groups: permission-based analysis [96], static and dynamic analysis of source
code [97, 98], and network-based analysis [11, 99–101]. In the following section, a tech-
nique is examined for gathering network traffic from smartphones and identifying any
clear text data that might expose sensitive user information and behavior. The exper-
iment entails two stages: collecting the traffic and detecting instances of plain-text
sensitive information.
23
Fig. 4: Data capturing framework
Fig. 5: Leaks captured from Myntra application (Version: 4.2201.1; Accessed Date:
March, 2023 )
24
judgment as crucial in dataset evaluation, we engaged 5 users in labeling the dataset,
and the task is nearly 60% complete.
25
8.5 PII Detection in network traffic
There are many approaches that can be used to analyze the network traffic of mobile
devices in an efficient and secure manner in the existing state-of-the-art. [11] used
machine learning in order to train C4.5 Decision Tree for detecting leaks in a central-
ized manner. Our work [102] analyzed the dataset provided by ReCon and provide
improved versions of classifiers inorder to enhance detection framework.
8.5.2 Methodology
To start our work we first pre-processed the dataset to select the relevant features using
tf-idf. As mentioned earlier, following the elimination of frequently appearing words
in HTTP flows using tf-idf, we proceed to iterate through all features. Probabilities
calculated to map a particular keyword to PII value is depicted in Figure 6. For each
feature, we calculate the confidence level using heuristics, similar to the approach in
the ReCon’s work.
KP II
Heuristic1 : Ptype,key =
Kall
Here, KP II gives number of times the key appeared in flows with positive PII
leakage and Kall gives number of times the key appeared in all flows. Features having
confidence level greater than chosen threshold are selected for further examination.
Figure 7 shows a snap-shot of features selected for a domain named “ea.com” for
android OS.
26
Algorithm 1: Feature extraction and selection
Input : a) Threshold for word frequency: f reqt ,
b) Stop-word list: stop words,
c) Threshold for tf-idf value: tf idft
Output: Selected features for training the model
Define list of separators:
separators ← {‘,’, ‘;’, ‘{}’, ‘[]’, ‘/’,. . . , ‘()’}
Initialize a 2-D matrix V .
for each HTTP flow i = 1, 2, . . . , n do
Wi ← Split(f lowi , separators)
end for
for each word wj ∈ Wi where 1 ≤ i ≤ n do
if wj appears in f lowi then
mark Vi,j as 1
else
mark Vi,j as 0
end if
end for
Randomize any PII values in Wi to obtain a new set Wi′
n
P
Calculate the word frequency for each word wj across all flows: fj = Vi,j
i=1
for each word wj do
if fj < f reqt and wj ∈ / PII then
Remove wj from the list of features.
end if
if fj < f reqt and wj ∈ PII then
Oversample wj to meet the threshold.
end if
Calculate the tf-idf value for each word wj across all flows.
end for
if tf idfi,j > tf idft or wj ∈ stop words then
remove wj from the list of features.
end if
for each f lowi do
Remove all features that appear adjacent to a PII leak in Wi′ .
end for
Output the selected features as a set of binary vectors Vi,j for each f lowi .
To initiate our experiments, we partitioned the ReCon dataset into two sets: a
training set containing 60 domains and a testing set with 12 domains. The train-
ing set comprises approximately 7200 data inputs and 6500 features. Our initial step
involved utilizing a Decision Tree to identify sensitive information in network traffic
packets. Our goal is to evaluate and improve the performance of the Decision Tree
model compared to previous works in detecting PII. We experimented with feature
manipulation, removing certain features from the dataset using various heuristics.
Detecting PII in network flows often requires recognizing intricate patterns and rela-
tionships among different data points. Leveraging the ability of neural networks to
model non-linear relationships, we decided to apply a neural network model to this
task after the Decision Tree experiments. Our neural network model employs the ReLU
activation function in all hidden layers to address vanishing gradient issues during
27
Fig. 6: a) Shows list of probabilities that particular keyword corresponds to a PII
value. b) Shoes list of probabilities calculated for keys in each domain and OS.
back-propagation. For our binary classification problem, the output layer uses the sig-
moid activation function to ensure the output Yi falls within the range [0, 1]. The loss
function L employed in the current context is the binary cross-entropy function.
n
1X
L=− (Yi · log Ŷi + (1 − Yi ) · log(1 − Ŷi ))
n i=1
28
where, vt is the moving average of squared gradients at tth iteration, γ is decay rate
controlling the weights given to previous squared gradients in the moving average.
Therefore, weight update follows:
α (∂L/∂w)
wt+1 = wt − √
vt + ϵ
where, α is the learning rate and ϵ is a small constant typically set in the range of
10−8 to 10−10 . The choice of α determines the time needed for model to converge. If
it is fixed too small, it slows the convergence, on the contrary, a large value of α will
possibly result in oscillation, preventing the error to decrease below a certain value.
In our study [104], upon examination, we found numerous features that are repeti-
tive or duplicate. For instance, features like ‘connection’ and ‘Connection’ are regarded
as distinct features despite having same meaning. Such duplication could adversely
impact the effectiveness of the detection framework.
Moreover, we observed that certain encoded strings, such as
“QVQwOjEuMDozLjA6NjA6akxKZ2d5RHF5NDhoMFVYbHJnanhqZ3B1eWV
CRXFjbFYyZ3E6NTIxMjQ6bmxpYzU”
are incorporated as features, yet they lack meaningful information. Including these
strings as features merely inflates the total number of features without adding any
value in terms of informative content. Hence, we implemented various techniques to
filter out such features, as elaborated below:
1. Removing duplicates: Removing duplicates involved a two-step process. Initially, we
transformed all feature names to lowercase, ensuring uniformity. Subsequently, we
grouped similar features together, effectively eradicating redundant columns from
the dataset.
2. To decode encoded strings and extract insights, we employed a tool named
CyberChef [105], which employs diverse decoding operations to unveil the underly-
ing information. Despite employing several decoding operations, it became apparent
that many encoded strings either failed to reveal meaningful information or did not
pertain to any PII. Consequently, we excluded such features from further analysis.
3. Stemming: We employed Porter’s Stemmer algorithm for suffix stripping to reduce
features to their root forms, a process known as stemming. Following this, we
clustered together all features that shared a common root form. This approach
enabled us to streamline and organize the features according to their linguistic
similarities. For instance, features ‘attempting’ are grouped under feature ‘attempt’.
4. We noticed that some features had lengths significantly longer than typical words,
while others were very short. We theorized that PII words would likely fall within
a certain length range and conform to English language patterns. For example,
URLs or file names often have longer lengths. After experimentation, we found that
setting a threshold of 70 effectively captured all relevant words. Hence, any features
exceeding this length were omitted from further analysis. Additionally, strings with
a length of 1 were also excluded.
After filtering out the relevant features, all experiments were rerun on the refined
dataset. The results of each experiment are detailed in the subsequent section.
29
Global Local
a) b) c)
Fig. 8: LIME explaining the prediction of model f (·) a) Shows the global context
and decision boundary created by complex non-linear model f (·). Grey point is the
predicted instance for which explanations are needed. b) Shows the decision bound-
ary for simple linear model m(·). Idea here is to see neighbourhood and do simple
explanation of local region. Features that are important in global context may not be
important in local area. c) Shows the perturbed data-points (black points). The expo-
nential kernel (πx ) can be considered as a heat map. The perturbed data-points are
weighted according to distance to the predicted instance
8.5.3 Results
1. Decision tree: We employed a similar decision tree (DT) model as proposed by
ReCon on their dataset. Applying the same model to our filtered dataset resulted
in a 62% accuracy on the test data, indicating that feature filtering did not yield
improvements compared to our previous work [102]. After removing domain and
OS features, the classifier’s performance improved, but feature filtering alone did
30
not prove beneficial. Incorporating heuristics-1 and setting a threshold to 0.2, as
in [102], reduced the features from 6,538 to 3,343 but led to a decline in accuracy.
The same model on the filtered dataset yielded a similar accuracy.
We then expanded our test samples, as in [102], while employing heuristics-1.
This approach improved the model’s performance on the filtered dataset to 68%.
This increase may be attributed to the test sample points now demonstrating a
distribution similar to that of the model’s training data points. However, the issue
of overfitting persisted. To address this, we applied cost complexity pruning to
the model, which raised the accuracy to 72.89% as shown in Figure 9. It shows
how accuracy of the decision tree classifier changes with different levels of pruning
(controlled by cost complexity parameter, ‘ccp alpha’) for both the training and
test datasets.
Fig. 9: Performance of decision tree after using cost complexity pruning. Here, x-
axis represents the values of the cost complexity parameter and y-axis represents the
corresponding accuracy scores
2. Neural network:We constructed a neural network (NN) model with 1 input layer
and 6 hidden layers, each containing 2048, 1024, 512, 256, 128, and 64 neurons
respectively. The output layer comprised 1 neuron for binary classification. Follow-
ing the approach of [102], we integrated 9 top features (including domain and OS)
selected via Chi-Square scoring into the model. This yielded a training accuracy of
52.6% and a testing accuracy of 62%, mirroring the results of [102]. Analysis of the
NN’s weights before and after training indicated the model’s inability to effectively
learn.
Subsequently, using 7 features (after excluding domain and OS from the initially
selected features) instead of 9, the NN maintained the same accuracy of 62% as in
[102]. However, unlike the DT model, which achieved 67% accuracy, this discrepancy
suggested that the NN model wasn’t learning optimally.
Following the implementation of Heuristic-1, the training accuracy surged to
92%, but testing accuracy after feature filtering showed improvement to 57% com-
pared to [102]. Nonetheless, these results suggested overfitting on the given dataset.
To address this issue, we reduced the model’s complexity from 6 to 3 hidden layers,
31
akin to [102]. While the training dataset exhibited a 95% accuracy, testing revealed
a 74.5% accuracy, which is still 2.5
Table 7 shows results of the best performing classifiers and indicate a significant
enhancement in the performance of models as a result of filtering the features.
3. Explainablity of obtained results after filetring using LIME: We utilized
LIME to refine the top classifiers in both categories by identifying crucial features
using the following heuristics.
Given a dataset of nval validation samples (subset that helps to tune hyperpa-
rameters and assess the model’s performance on data it hasn’t seen during training)
and d features, let impi,j represent the importance value of feature j in sam-
ple i, where i = 1 to nval and j = 1 to d. Initially, for each sample i, the set
impi,nonzero = {impi,j |impi,j ̸= 0 for j = 1 to d} is defined as the subset of non-
zero feature importance values. These non-zero importance values are then sorted
in descending order to form impi,sorted .
• Heuristic 2 (H2): A threshold is calculated as P ercentile(|impi,sorted |ni=1
val
, 75).
For each sample i, feature j is retained if |impi,j |>threshold
Tnval and included in
selected f eaturesi . Finally, f inal selected f eatures = i=1 selected f eaturesi
• Heuristic 3 (H3): Let ai = ⌈0.2 × d⌉ be the number of features chosen for sample
i, where 0.2 means selecting the top 20% of features, and top featurei be the set of
top ai features selected for sample i. The set Common f eatures = {top featurei |
countai ≥ 0.75 × nval } contains the features common to 75% or more of the
validation samples.
Once the essential features contributing to classifier predictions were identified, the
classifier was trained using only these selected features.
Using the LIME Tabular Explainer [107], we generated explanations for 200
validation samples, covering all 2202 features. Employing Heuristics-2 and 3, we
selected features of higher importance and retrained the best classifiers from Table
7. The results for each scenario are presented in Table 8. Remarkably, the NN classi-
fier, utilizing Heuristic 2, demonstrates comparable performance to the classifier in
[102] using SHAP, considering both training and test accuracy along with training
time. While there is minimal impact on classifier accuracy with reduced features
(38 in H2 and 124 in H3 ), training time is reduced compared to the variations
proposed in [102] using SHAP. Conversely, the DT classifier doesn’t benefit from
LIME, as decision trees are inherently interpretable, and LIME may not provide
significant additional value in this context.
8.5.4 Discussions
The results indicate that eliminating unnecessary features improves the performance
of machine learning models in predicting outcomes. Initially, our experiments followed
ReCon’s approach, where the DT model achieved a 62% accuracy on test data both
before and after filtering. Subsequently, we tested an NN model, initially matching
the DT’s accuracy. However, after removing two features from the dataset before
training, the DT’s accuracy improved from 62% to 67%, and post Heuristic-1, the
DT model exhibited signs of overfitting. In contrast, the NN model maintained a
32
62% accuracy, failing to adapt or learn due to unchanged model weights. We then
employed Heuristics-1 to enhance our training accuracy, but this resulted in over-
fitting. To address this, we simplified the model’s complexity to mitigate overfitting.
Consequently, our testing accuracy for the NN model increased from 47% to 72% after
this adjustment, as in [102], and from 57% to 74.5% after post-filtering. On the other
hand, applying pruning to the DT model yielded an accuracy of 72.89% on testing
data. After applying LIME for feature selection, the accuracy increased to 75.01%,
with a reduced training time of approximately 3.62 seconds for the NN. However, the
DT model without LIME provided superior results.
33
All first-party and third-party domains, as well as trackers present in applica-
tions that users interact with, can access data transmitted by users’ devices. In
this scenario, implementing role-based access systems can be beneficial by grant-
ing access to requesting domains based on specific conditions. For instance, a user
can share data with an application on their device to enable its functionality, while
limiting the access of embedded trackers and third-party domains to that data.
5. Is it possible to create a profit sharing model between users and entities that sell
users’ data?
The current data ecosystem involves various service providers, advertising agen-
cies, and data gathering companies that profit significantly from sharing and selling
individuals’ personal information. Users are often overlooked in this system. Profit
sharing models could be implemented to support the growth of these data-driven
firms while also providing users with a portion of the profits.
To enable users to take control of their data, it is necessary to establish a framework
that grants them the authority to determine the collection and utilization of their data.
This not only ensures a personalized experience with the service but also enables them
to directly benefit from a portion of the revenue generated by the service provider.
10 Conclusion
This article conducts a thorough analysis of user profiling, delving into its security and
privacy implications. It underscores the pros and cons of user profiling, stressing cau-
tion in granting access to personal information. We propose a taxonomy for building
user profiles, offering a valuable resource for researchers. Identifying gaps in existing
literature, we suggest future research directions. As a POC, we examine sensitive data
leakage in applications across different categories. In regions without data protection
laws, users face potential legal gaps in case of data breaches. It is crucial for users
to weigh the benefits and risks of sharing information on platforms, especially in the
absence of legal remedies. Users should stay vigilant, review terms and conditions,
and make informed decisions about data sharing, given continuous corporate mon-
itoring. Developing techniques that gather user profile information while respecting
privacy is essential. Privacy-preserving methods for profiling can enhance user comfort
with data collection. Ongoing efforts, like Apple’s App Tracking Transparency in iOS
14.5, showcase steps toward safeguarding user privacy. A transparent data ecosystem
that informs users about data sharing is crucial. Collaboration between commercial
organizations and users is essential to build a digital world that leverages modern
technologies while prioritizing privacy and security. Some countries, such as India, are
introducing policies to protect their citizens.
References
[1] Gauch, S., Speretta, M., Chandramouli, A., Micarelli, A.: User profiles for
personalized information access. The adaptive web, 54–89 (2007)
34
[2] AliceWyman, Ilias, C., Rodaro, M., TyDraniu, Joni, Ghelman, M., Rok, Gar-
denhire, L., Jeff, Fabi: Trackers and scripts Firefox blocks in Enhanced Tracking
Protection (2023). https://support.mozilla.org/en-US/kb/trackers-and-scripts-
firefox-blocks-enhanced-track
[3] Solanki, R.K., Laxmi, V., Bezawada, B., Gaur, M.S.: Mapperdroid: Verifying
app capabilities from description to permissions and api calls. Computers &
Security 111, 102493 (2021)
[5] Kanoje, S., Girase, S., Mukhopadhyay, D.: User profiling trends, techniques and
applications. ArXiv (2015)
[6] Gao, M., Liu, K., Wu, Z.: Personalisation in web computing and informatics:
Theories, techniques, applications, and future research. Information Systems
Frontiers 12(5), 607–629 (2010)
[7] Alam, I., Khusro, S.: Tailoring recommendations to groups of viewers on smart
tv: A real-time profile generation approach. IEEE Access 8, 50814–50827 (2020)
[8] Farnadi, G., Tang, J., De Cock, M., Moens, M.-F.: User profiling through
deep multimodal fusion. In: Proceedings of the Eleventh ACM International
Conference on Web Search and Data Mining, pp. 171–179 (2018)
[9] Wood, D., Apthorpe, N., Feamster, N.: Cleartext data transmissions in consumer
iot medical devices. In: Proceedings of the 2017 Workshop on Internet of Things
Security and Privacy, pp. 7–12 (2017)
[10] Moghaddam, H., Acar, G., Burgess, B., Mathur, A., Huang, D., Feamster,
N., Felten, E., Mittal, P., Narayanan, A.: Watching you watch: The tracking
ecosystem of over-the-top tv streaming devices, pp. 131–147 (2019)
[11] Ren, J., Rao, A., Lindorfer, M., Legout, A., Choffnes, D.: Recon: Revealing and
controlling pii leaks in mobile network traffic. In: Proceedings of the 14th Annual
International Conference on Mobile Systems, Applications, and Services, pp.
361–374 (2016)
[12] Bakopoulou, E., Tillman, B., Markopoulou, A.: Fedpacket: A federated learn-
ing approach to mobile packet classification. IEEE Transactions on Mobile
Computing 21(10), 3609–3628 (2021)
[13] Zhou, X., Wang, W., Jin, Q.: Multi-dimensional attributes and measures for
dynamical user profiling in social networking environments. Multimedia Tools
and Applications 74 (2014)
35
[14] Das, A., Borisov, N., Caesar, M.: Tracking mobile web users through motion
sensors: Attacks and defenses. In: NDSS (2016)
[15] Apthorpe, N., Reisman, D., Feamster, N.: A smart home is no castle: Privacy
vulnerabilities of encrypted iot traffic (2017)
[16] Stewart, S., Davies, J.: User profiling techniques : A critical review. Electronic
Workshops in Computing (1997)
[17] Abdel-Hafez, A., Xu, Y.: A survey of user modelling in social media websites.
Computer and Information Science 6(4) (2013)
[18] Eke, C.I., Norman, A.A., Shuib, L., Nweke, H.F.: A survey of user profil-
ing: State-of-the-art, challenges, and solutions. IEEE Access 7, 144907–144924
(2019)
[19] Zhao, S., Li, S., Ramos, J., Luo, Z., Jiang, Z., Dey, A.K., Pan, G.: User profil-
ing from their use of smartphone applications: A survey. Pervasive and Mobile
Computing 59, 101052 (2019)
[20] G. U., V., K., S., Deepa Shenoy, P., K. R., V.: An overview on user profiling
in online social networks. International Journal of Applied Information Systems
11(8), 25–42 (2017)
[21] Dong, X., Li, T., Li, X., Song, R., Ding, Z.: Review-based user profiling: A sys-
tematic mapping study. Enterprise, Business-Process and Information Systems
Modeling, 229–244 (2019)
[22] Peng, J., Choo, K.-K.R., Ashman, H.: User profiling in intrusion detection: A
review. Journal of Network and Computer Applications 72, 14–27 (2016)
[23] Jang, C., Chang, H., Ahn, H., Kang, Y., Choi, E.: Profile for effective ser-
vice management on mobile cloud computing. In: International Conference on
Advanced Communication and Networking, pp. 139–145 (2011). Springer
[25] Statista: Digital advertising market share of major companies worldwide 2023
(2023)
[27] Pieterson, W., Ebbers, W., Dijk, J.: The opportunities and barriers of user profil-
ing in the public sector. In: International Conference on Electronic Government,
pp. 269–280 (2005). Springer
36
employment/careersite/mycareer-at-ibm/
[29] Poo, D., Chng, B., Goh, J.-M.: A hybrid approach for user profiling. In: 36th
Annual Hawaii International Conference on System Sciences, 2003. Proceedings
of The, p. 9 (2003). IEEE
[30] Murmuria, R., Stavrou, A., Barbara, D., Sritapan, V.: Your data in your hands:
Privacy-preserving user behavior models for context computation. In: 2017
IEEE International Conference on Pervasive Computing and Communications
Workshops (PerCom Workshops), pp. 170–175 (2017). IEEE
[31] Kwon, M.-C., Choi, S.: User behavior classification based on smart watch and
machine learning algirithm. Contemporary Engineering Sciences 10(7), 345–352
(2017)
[32] Chen, J., Liu, Y., Zou, M.: Home location profiling for users in social media.
Information & Management 53(1), 135–143 (2016)
[33] Farseev, A., Akbari, M., Samborskii, I., Chua, T.-S.: 360◦ user profiling: past,
future, and applications. ACM SIGWEB Newsletter,(Summer) 10, 2956573–
2956577 (2016)
[34] Yang, Y.C.: Web user behavioral profiling for user identification. Decision
Support Systems 49(3), 261–271 (2010)
[35] Raghu, T., Kannan, P., Rao, H.R., Whinston, A.B.: Dynamic profiling of con-
sumers for customized offerings over the internet: A model and analysis. Decision
Support Systems 32(2), 117–134 (2001)
[36] Farid, M., Elgohary, R., Moawad, I., Roushdy, M.: User profiling approaches,
modeling, and personalization. In: Proceedings of the 11th International Con-
ference on Informatics & Systems (INFOS 2018) (2018)
[37] Fakhfakh, R., Feki, G., Ammar, A.B., Amar, C.B.: Personalizing information
retrieval: A new model for user preferences elicitation. In: 2016 IEEE Interna-
tional Conference on Systems, Man, and Cybernetics (SMC), pp. 002091–002096
(2016). IEEE
[38] Sakagami, H., Kamba, T.: Learning personal preferences on online newspaper
articles from user behaviors. Computer Networks and ISDN Systems 29(8-13),
1447–1455 (1997)
[39] Gentili, G., Micarelli, A., Sciarrone, F.: Infoweb: An adaptive information fil-
tering system for the cultural heritage domain. Applied Artificial Intelligence
17(8-9), 715–744 (2003)
[40] Nanas, N., Uren, V., De Roeck, A.: Building and applying a concept hierarchy
37
representation of a user profile. In: Proceedings of the 26th Annual International
ACM SIGIR Conference on Research and Development in Informaion Retrieval.
SIGIR ’03, pp. 198–204. Association for Computing Machinery, New York, NY,
USA (2003). https://doi.org/10.1145/860435.860473 . https://doi.org/10.1145/
860435.860473
[41] Chen, W., He, M., Ni, Y., Pan, W., Chen, L., Ming, Z.: Global and personal-
ized graphs for heterogeneous sequential recommendation by learning behavior
transitions and user intentions, pp. 268–277 (2022)
[42] Banouar, O., Raghay, S.: User profile construction for personalized access to
multiple data sources through matrix completion method. (2016)
[43] Cufoglu, A.: User profiling-a short review. International Journal of Computer
Applications 108(3) (2014)
[44] Nilashi, M., Jannach, D., Ibrahim, O.b., Ithnin, N.: Clustering- and regression-
based multi-criteria collaborative filtering with incremental updates. Elsevier
(2014)
[45] Li, S., Zhao, H.: A survey on representation learning for user modeling. In: Pro-
ceedings of the Twenty-Ninth International Conference on International Joint
Conferences on Artificial Intelligence, pp. 4997–5003 (2021)
[46] Chen, T., Yin, X., Peng, L., Rong, J., Yang, J., Cong, G.: Monitoring and
recognizing enterprise public opinion from high-risk users based on user portrait
and random forest algorithm. Axioms 10(2) (2021)
[47] Hamim, T., Benabbou, F., Sael, N.: Student profile modeling using boost-
ing algorithms. International Journal of Web-Based Learning and Teaching
Technologies (IJWLTT) 17(5), 1–13 (2022)
[48] Tripathi, A., Yadav, S., Rajan, R.: Naive bayes classification model for the
student performance prediction. In: 2019 2nd International Conference on Intel-
ligent Computing, Instrumentation and Control Technologies (ICICICT), vol. 1,
pp. 1548–1553 (2019)
[49] Ouaftouh, S., Zellou, A., Idri, A.: Social recommendation: A user profile
clustering-based approach. Concurrency and Computation: Practice and Expe-
rience 31(20), 5330 (2019)
[50] Bi, M., Xu, J., Wang, M., Zhou, F.: Anomaly detection model of user behavior
based on principal component analysis. Journal of Ambient Intelligence and
Humanized Computing 7, 547–554 (2016)
[51] Qi, Y., Hu, K., Zhang, B., Cheng, J., Lei, J.: Trilateral spatiotemporal attention
38
network for user behavior modeling in location-based search. In: Proceed-
ings of the 30th ACM International Conference on Information & Knowledge
Management, pp. 3373–3377 (2021)
[52] Wang, R., Zhu, H., Wang, L., Chen, Z., Gao, M., Xin, Y.: User identity link-
age across social networks by heterogeneous graph attention network modeling.
Applied Sciences 10(16), 5478 (2020)
[53] Wang, R., Wu, Z., Lou, J., Jiang, Y.: Attention-based dynamic user mod-
eling and deep collaborative filtering recommendation. Expert Systems with
Applications 188, 116036 (2022)
[54] Wang, X., Sun, G., Fang, X., Yang, J., Wang, S.: Modeling spatio-temporal
neighbourhood for personalized point-of-interest recommendation. In: IJCAI,
pp. 3530–3536 (2022)
[55] Yuan, F., Zhang, G., Karatzoglou, A., Jose, J., Kong, B., Li, Y.: One person, one
model, one world: Learning continual user representation without forgetting. In:
Proceedings of the 44th International ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 696–705 (2021)
[56] Qi, T., Wu, F., Wu, C., Huang, Y.: News recommendation with candidate-aware
user modeling. In: Proceedings of the 45th International ACM SIGIR Conference
on Research and Development in Information Retrieval, pp. 1917–1921 (2022)
[57] Deng, S., Cai, Q., Zhang, Z., Wu, X.: User behavior analysis based on
stacked autoencoder and clustering in complex power grid environment. IEEE
Transactions on Intelligent Transportation Systems 23(12), 25521–25535 (2021)
[58] Fazelnia, G., Simon, E., Anderson, I., Carterette, B., Lalmas, M.: Variational
user modeling with slow and fast features. Proceedings of the Fifteenth ACM
International Conference on Web Search and Data Mining (2022)
[59] Liu, Q., Wu, J., Huang, Z., Wang, H., Ning, Y., Chen, M., Chen, E., Yi, J., Zhou,
B.: Federated user modeling from hierarchical information. ACM Transactions
on Information Systems 41(2), 1–33 (2023)
[60] Donkers, T., Loepp, B., Ziegler, J.: Sequential user-based recurrent neural net-
work recommendations. In: Proceedings of the Eleventh ACM Conference on
Recommender Systems, pp. 152–160 (2017)
[61] Chu, Y.-W., Hosseinalipour, S., Tenorio, E., Cruz, L., Douglas, K., Lan, A., Brin-
ton, C.: Mitigating biases in student performance prediction via attention-based
personalized federated learning. In: Proceedings of the 31st ACM International
Conference on Information & Knowledge Management, pp. 3033–3042 (2022)
[62] Li, M., Han, X., Sheng, H., Ma, L., Kong, H., Liu, W., Mao, B.: A novel RNN
39
model with enhanced behavior semantic for network user profile. In: Tenth
International Conference on Advanced Cloud and Big Data, CBD 2022, Guilin,
China, November 4-5, 2022, pp. 190–193. IEEE, ??? (2022)
[63] Avny Brosh, T., Livne, A., Sar Shalom, O., Shapira, B., Last, M.: Bruce: Bundle
recommendation using contextualized item embeddings. In: Proceedings of the
16th ACM Conference on Recommender Systems, pp. 237–245 (2022)
[64] Wu, C., Wu, F., Qi, T., Huang, Y.: Userbert: Pre-training user model with con-
trastive self-supervision. In: Proceedings of the 45th International ACM SIGIR
Conference on Research and Development in Information Retrieval. SIGIR ’22,
pp. 2087–2092. Association for Computing Machinery, New York, NY, USA
(2022)
[66] Pinchot, J., Chawdhry, A.A., Paullet, K.: Data privacy issues in the age of data
brokerage: An exploratory literature review. Issues in Information Systems 19(3)
(2018)
[68] CIPPIC: Data Broker Profiles: Acxiom and LiveRamp. Online (2019). https://
databrokers.cippic.ca/2019/01/09/data-broker-profiles-acxiom-and-liveramp/
[72] Privacy Bee: These Are the Largest Data Brokers in America. https://
privacybee.com/blog/these-are-the-largest-data-brokers-in-america/
[74] Zawadziński, M., PRO, P., Sweeney, M.: 10. User Identification (2022). https:
//adtechbook.clearcode.cc/
40
[76] ibm Accessed: (9-September-2023) (2021). https://cloud.ibm.com/docs/
services/personality-insights?topic=personality-insights-about
[77] Paul, M., Maglaras, L., Ferrag, M.A., AlMomani, I.: Digitization of healthcare
sector: A study on privacy and security concerns. ICT Express (2023)
[80] Beeferman, J., Washington, J., Tribune, T.: Almost 2 million Tex-
ans affected by Texas Department of Insurance Data Breach. KXAN
Austin (2022). https://www.kxan.com/news/texas/almost-2-million-texans-
affected-by-texas-department-of-insurance-data-breach/
[81] CS Hub: Iotw: Twitter accused of covering up data breach that affects millions.
CS Hub Attacks News (2023)
[84] Gatlan, S.: 200,000 Facebook Marketplace User Records Leaked on Hacking
Forum. https://www.bleepingcomputer.com/news/security/200-000-facebook-
marketplace-user-records-leaked-on-hacking-forum/
[85] III.org: Facts + statistics: Identity theft and cybercrime (2022). https://
www.iii.org/fact-statistic/facts-statistics-identity-theft-and-cybercrime
[87] Osborne, C.: Finnish mental health patients blackmailed after suspected Data
Breach. The Daily Swig (2020). https://portswigger.net/daily-swig/finnish-
mental-health-patients-blackmailed-after-suspected-data-breach
[88] Confessore, N.: Cambridge Analytica and Facebook: The Scandal and the Fallout
So Far. The New York Times (2018). https://www.nytimes.com/2018/04/04/
us/politics/cambridge-analytica-scandal-fallout.html
[89] Adams, J., Almahmoud, H.: The meaning of privacy in the digital era. Interna-
tional Journal of Security and Privacy in Pervasive Computing (IJSPPC) 15(1),
1–15 (2023)
41
[90] Developers, A.: App Manifest Overview : android developers (2023). https://
developer.android.com/guide/topics/manifest/manifest-intro
[91] Felt, A.P., Ha, E., Egelman, S., Haney, A., Chin, E., Wagner, D.A.: Android
permissions: user attention, comprehension, and behavior. In: SOUPS (2012)
[92] Wijesekera, P., Baokar, A., Tsai, L., Reardon, J., Egelman, S., Wagner, D.,
Beznosov, K.: The feasibility of dynamically granted permissions: Aligning
mobile privacy with user preferences. In: 2017 IEEE Symposium on Security and
Privacy (SP), pp. 1077–1093 (2017)
[93] Reardon, J., Feal, Á., Wijesekera, P., On, A.E.B., Vallina-Rodriguez, N., Egel-
man, S.: 50 ways to leak your data: An exploration of apps’ circumvention of
the android permissions system. In: USENIX Security Symposium (2019)
[96] Pan, E., Ren, J., Lindorfer, M., Wilson, C., Choffnes, D.R.: Panoptispy:
Characterizing audio and video exfiltration from android applications (2018)
[97] Liu, B., Liu, B., Jin, H., Govindan, R.: Efficient privilege de-escalation for
ad libraries in mobile apps. Proceedings of the 13th Annual International
Conference on Mobile Systems, Applications, and Services (2015)
[98] Enck, W., Gilbert, P., Chun, B.-G., Cox, L.P., Jung, J., McDaniel, P., Sheth,
A.N.: Taintdroid: An information-flow tracking system for realtime privacy
monitoring on smartphones. ACM Transactions on Computer Systems (2010)
[99] Razaghpanah, A., Vallina-Rodriguez, N., Sundaresan, S., Kreibich, C., Gill, P.,
Allman, M., Paxson, V.: Haystack: In situ mobile traffic analysis in user space
(2015)
[100] Shuba, A., Bakopoulou, E., Asgari Mehrabadi, M., Choffnes, D., Markopoulou,
A.: Antshield: On-device detection of personal information exposure (2018)
[101] Shuba, A., Le, A., Alimpertis, E., Gjoka, M., Markopoulou, A.: Antmonitor:
System and applications. CoRR abs/1611.04268 (2016)
[102] Kohli, R., Chatterjee, S., Gupta, S., Gaur, M.S.: Tracking pii ex-filtration:
Exploring decision tree and neural network with explainable ai. In: IEEE Inter-
national Conference on Advanced Networks and Telecommunications Systems
(ANTS) (2023). IEEE
[103] Rao, A., Kakhki, A.M., Razaghpanah, A., Tang, A., Wang, S., Sherry, J., Gill,
42
P., Krishnamurthy, A., Legout, A., Mislove, A., Choffnes, D.R.: Using the middle
to meddle with mobile. (2013)
[104] Kohli, R., Gupta, S., Gaur, M.S.: A deep dive into relevant feature identification
for unveiling pii leakage in smartphones. In: International Conference on Signal
Processing and Communications (SPCOM) (2024). IEEE
[106] Ribeiro, M.T., Singh, S., Guestrin, C.: ” why should i trust you?” explaining
the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pp. 1135–
1144 (2016)
[108] Cufoglu, A.: Multi-dimensional clustering in user profiling. PhD thesis, Univer-
sity of Westminster (2012)
43
Table 5: Some high impact data breaches reported between 2022 and 2024. [[79], [80], [81], [82],[83],[84]]
Victim OccurredImpact Cause/vulnerability Compromised data
Texas May 1.8 • Security issue with a web application that man- Social security numbers,
Department 2022 million ages workers’ compensation information. addresses,dates of birth, phone
of Insurance numbers and information about
• This vulnerability was due to programming worker’s injuries.
code that allowed internet access to a protected
area of the application.
Twitter July 5.4 • Vulnerability on the social media site. Email addresses and phone num-
2022 million • Allows to obtain Twitter ID without any bers of celebrities, companies, ran-
authentication of any user by submitting a doms, etc.
phone number/email
• Possible even though the user has prohibited
this action in the privacy settings.
44
ChatGPT March ≈ 1% • Vulnerability in the code’s open-source library Credit card information and the
2023 data “Redis”. titles of some chats they initiated.
It was also possible to see another
• Used to cache user information for faster recall active user’s first and last name,
and access. email address, payment address,
the last four digits (only) of a credit
card number, and credit card expi-
ration date.
Norton May 2.5 • Unauthorized individual’s access to the com- Names, contact details, SSN num-
Healthcare 2023 million bers, dates of birth, health and
pany’s network storage devices insurance information, medical ID
numbers, driver’s license numbers,
government ID numbers, financial
account numbers, and digital signa-
tures
Facebook February 0.20 Facebook Marketplace database was stolen using Names, phone numbers, email
Market- 2024 million the ‘algoatson’ Discord handle after hacking the addresses, Facebook IDs, and Face-
place systems of a Meta contractor. book profile information.
Table 6: Summary of private data leaks in Mobile App Traffic (Duration: Oct 2022 -
March 2023 ). 3 These are versions of the applications that have been analyzed during
our experiments. 4 This is as per Indian playstore and here M represents Million and
B represents Billion.
Category App App ver- No. of Location User Identifiers Device Identifiers
sion3 down-
loads4
Pinterest 7.43.1 500M+ - First name, last name IP address, device configurations and
details, Google advertising ID, appli-
Social cation install and last update time
Skype 8.81.0.268 1B+ Location Name, gender, date-of-birth, Device model name
of user email-id, text messages,
and phone number, ID of user
also his and his contacts
contacts
LinkedIn 4.1.710 1B+ - First name, last name, pro- Device identifiers (Android ID)
file picture URL, headline
of profile, session-key and
session-password
WhatsApp 2.22.13.76 5B+ Country Phone number, email-id Device details (device name, OS,
android version), Android ID, device
ID, SDK version
Reddit 2020.8.2 100M+ Country User ID, user name, user Device details (Android version,
password, email-id device name, OS, OS version, hard-
ware ID, device fingerprint id, device
dimension), advertising ID, applica-
tion install and last update time,
device ID, SDK version.
Gradeup 11.07 10M+ - Name, phone number, OTP Device identifiers and configuration
Education used while registering,
email-address
Voot 4.2.7 100M+ - Gender, age, phone number, Device details, Google advertisement
Entertain- email-id ID
ment Zee5 35.1338119.0 100M+ Country, Email-ID, first name, last Device details (Android ver-
region login, gender, date of birth, sion,device name), device ID, install
phone number time.
Cleartrip 22.3.0 10M+ - Username, email-address, Device identifiers, device details
password, phone num- (model, network, device dimensions,
Travel ber, date-of-birth, gender, network type, CPU type, OS version,
registration details (check- screen dpi), advertising IDs.
in date, check-out date,
number of adults and chil-
dren, destination city and
country)
OYO 5.9.2 50M+ - Name of user, date of Device identifiers and google adver-
birth, gender, marital sta- tising ID.
tus, phone number, email-id
Dominos 9.8.18 50M+ User’s First name, last name, Device identifiers, device details
India address mobile number, email-id (model, network, device dimensions,
network type, CPU type, OS version,
Shopping screen DPI), advertising IDs
Myntra 4.2201.1 100M+ User’s First name, last name, Device details, IP address, network
address phone number, gender, first type
login date, OTP for login,
UPI ID, email-id
eBay 6.49.0.3 100M+ Country Email ID Device details (model, dimensions,
OS, hardware, manufacturer, OS
version, physical memory,processor,
architecture, RAM, disk space, pro-
cessor count, processor word size,
system up time, thermal state, time
zone, user language)
Airtel 4.40.10 100M+ Location Name, phone number, alter- siNumber, device details, advertise-
Thanks (not nate number, email-id ment IDs, local IP.
Others exact)
Adobe 22.6.0.22829 500M+ Latitude First name, last name, Device ID, OS, device details, device
Acrobat and email-address, contents of configurations, connection type, local
Reader longitude the PDF opened in applica- IP, advertising ID.
tion, list of files in Google
drive
Flipboard 4.2.93 500M+ - Name, email-id Device brand
Truecaller 12.17.8 500M+ - Phone number and email-id Device ID, device details (OS,
Model) and sim number
Zoom 500M+ Country Phone number Device details (model, dimensions,
OS), Device ID, SDK version.
45
Table 7: Best results obtained.
Model Train Test
DTpost heuristics,↑testing samples 93.79% 72.89% ⋆
N Npost heuristics,↑testing samples 95% 74.5% ⋆⋆
Table 8: Summary of results obtained using LIME
Model Train Test Tr. time (s)
DTbest,w/o LIME 93.79% 72.89% ⋆ a 0.095
DTwith SHAP [102] 99% 67% 0.16
DTwith LIME-H1 92.6% 71.2% 0.004
DTwith LIME-H2 91.86% 71.2% 0.005
N Nbest,w/o LIME 95% 74.5% ⋆⋆ b 16.54
N Nwith SHAP [102] 84.7% 75.32% 5.45
N Nwith LIME-H1 91.01% 73.38% 3.13
N Nwith LIME-H2 95.29% 75.01% 3.615
a
Taken from Table 7 (row having ⋆ marker)
b
Taken from Table 7 (row having ⋆⋆ marker)
46