0% found this document useful (0 votes)
16 views46 pages

Guarding Digital Privacy

The document discusses user profiling, highlighting its benefits for personalized recommendations and the associated privacy risks due to the collection of sensitive data without user consent. It analyzes privacy vulnerabilities in popular Android applications and proposes an enhanced machine learning framework to detect personal information exposure, achieving a notable accuracy improvement. The paper concludes with suggestions for future research directions to bolster digital security measures.

Uploaded by

Adrian Kulik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views46 pages

Guarding Digital Privacy

The document discusses user profiling, highlighting its benefits for personalized recommendations and the associated privacy risks due to the collection of sensitive data without user consent. It analyzes privacy vulnerabilities in popular Android applications and proposes an enhanced machine learning framework to detect personal information exposure, achieving a notable accuracy improvement. The paper concludes with suggestions for future research directions to bolster digital security measures.

Uploaded by

Adrian Kulik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Guarding Digital Privacy: Exploring User

Profiling and Security Enhancements


Rishika Kohli1*, Shaifu Gupta2 and Manoj Singh Gaur3
1,2 Department of Computer Science and Engineering, , Indian Institute of
Technology Jammu, Jagti, Jammu, 181221, Jammu and Kashmir, India.
3 Indian Institute of Technology Jammu, Jagti, Jammu, 181221, Jammu

and Kashmir, India.


arXiv:2504.07107v1 [cs.IR] 17 Mar 2025

*Corresponding author(s). E-mail(s): [email protected];


Contributing authors: [email protected];
[email protected];

Abstract
User profiling, the practice of collecting user information for personalized rec-
ommendations, has become widespread, driving progress in technology. However,
this growth poses a threat to user privacy, as devices often collect sensitive
data without their owners’ awareness. This article aims to consolidate knowl-
edge on user profiling, exploring various approaches and associated challenges.
Through the lens of two companies sharing user data and an analysis of 18 pop-
ular Android applications in India across various categories, including Social,
Education, Entertainment, Travel, Shopping and Others, the article unveils pri-
vacy vulnerabilities. Further, the article propose an enhanced machine learning
framework, employing decision trees and neural networks, that improves state-
of-the-art classifiers in detecting personal information exposure. Leveraging the
XAI (explainable artificial intelligence) algorithm LIME (Local Interpretable
Model-agnostic Explanations), it enhances interpretability, crucial for reliably
identifying sensitive data. Results demonstrate a noteworthy performance boost,
achieving a 75.01% accuracy with a reduced training time of 3.62 seconds for
neural networks. Concluding, the paper suggests research directions to strengthen
digital security measures.

Keywords: User profiling, Privacy Leak, Decision tree, Neural network, Explainable
artificial intelligence

1
1 Introduction
Personalized systems are designed to address the issue of information overload for
users searching for specific queries. They achieve this by customizing information to
individual users based on their profiles [1]. A user profile being a compilation of infor-
mation associated with a user, can help to achieve this goal and to attain target of
accurate recommendations and ultimately deliver personalized information.
As of year 2024, a vast array of data companies exists that track people across all
aspects of their lives, both online or offline, collecting and accumulating an unprece-
dented volume of consumer data. This data provides insight into the behaviour and
demographics of consumers and enable companies to analyze the patterns that can
lift the overall customer experience. Consequently this data has become a valuable
economic asset. For decades, regulatory authorities, journalists and civil society have
mentioned the lack of transparent policies by the companies regarding buying and
selling of their customer’s data.
In the data ecosystem, different companies continuously trade digital user profiles
with one another. These data companies combine and link data from various devices
of users such as computers, mobiles and diverse IoT devices. Every click on a web-
site transmits information using invisible script embedded in web pages to hundreds
of third-party companies. These invisible codes, called trackers, record information
about a user visiting a website and collect data on how a user interacts with the visited
pages. When the same tracker is present across numerous websites, it can develop a
detailed profile of user’s online activities and behaviour [2]. Similarly, every swipe on a
smartphone may trigger these hidden data sharing mechanisms, collecting rich infor-
mation about users. This information flow is not only limited to device manufacturers
and application owners but also extends to a significant number of other third-party
companies.
The increasing number of smart devices, has made the collection of personal data
a greater threat to user’s privacy. In order to provide services through smartphones,
applications require access to phone features such as the camera and microphone.
However, many applications ask for permissions to access sensitive system resources
(e.g., sensors like microphone, camera and GPS), personal information of the user
(e.g., email address and contact list), and also unique identifiers (e.g,. IMEI number)
that are not required to provide services but to track users [3]. Vulnerabilities in
the applications or operating systems of the devices can expose personal data. For
instance, whatsApp vulnerability (CVE-2019-3568 [4]) allowed attackers to remotely
install surveillance software on phones by calling the targeted device. This allowed the
installation of spyware, compromising user’s data on device, and potentially enabling
further attacks. These kind of vulnerabilities creates a loophole for hackers and makes
user’s sensitive and personal information more vulnerable.
This study covers everything about how user profiling works and why it matters for
privacy and security. Other studies usually focus on just one part—either user profiling
process [1], [5], [6], [7], [8] or why it might be a problem for privacy [9],[10],[11],[12].
This research provides a complete repository that explains all about user profiling and
talks about how it affects privacy and security. Rest of the article follows the following
structure. Section II covers objectives of this paper and Section III provides a review of

2
literature in the domain of user profiling. Section IV describes user profiling and its use-
cases. Section V illustrates the user profiling process while section VI provides study on
two data brokers. Privacy and security concerns regarding user profiling are presented
in Section VII. Section VIII describes an experimental study and implementation of
framework to detect privacy leakage from a variety of Android applications. Section
IX illustrates some open research areas. Article is concluded in Section X.

2 Objectives
This article conducts an extensive review of recent contributions exploring the latest
practices and their implications. The contributions of this article can be considered in
four parts as outlined below:
i) Review of user profiling, including methods, types, models, and processes, along
with highlighting challenges in current mechanisms.
ii) Analysis of privacy and security issues present in the user profiling process. This
study explores how thousands of commercial organizations collect, trade and make
use of personal data, and influence the lives of billions of people. Based on the review
of literature and articles, a study of two data collecting companies that operate by
collecting, analyzing and selling user’s data is presented.
iii) An experimental study of several mobile applications used in India is covered
to detect leaks of user’s personal data by intercepting the network traffic. Further, ML
classifiers that improve state-of-the-art frameworks in detecting personal information
exposures is presented and then XAI algorithm LIME is used to provide explanations
for the results generated by the classifiers.
iv) To the end, study provides several open research directions for future work.

3 Related Work
We briefly review some of the prominent approaches related to user profiling in this
section, highlighting the existing shortcomings. We short-listed these papers based on
relevance and clarity of understanding. Keyword search of “user profiling”, “privacy
leak”, “personal data leakage” etc, assisted us in the short-listing process to identify
relevance of different existing works to our text.
Some studies have employed diverse platforms like social media and smart devices
to deduce user actions, in order to develop user profiles and provide relevant recom-
mendations. Table 1 provides summary of some of these studies. Table 2 presents a
compilation of several studies aiming to address the security and privacy concerns asso-
ciated with data collection for user profiling. These studies encompass various aspects,
such as detecting potential disclosure of sensitive information from user devices and
identifying embedded trackers in smart devices. Several other studies have focused on
user profiling through surveys, and thus we present comprehensive information about
these investigations.
Stewart et al. [16], conducted a survey on methods for filtering vast amounts of
information from electric sources, including the internet, to create user profiles. These
techniques includes statistical term-based, neural networks and social filtering. [17] dis-
cussed user modelling techniques for constructing and representing profiles for social

3
Table 1: Summary of works in building user profiles
Ref. Objective Data Methodology Gap noted
Source(s)
[13] Build multidi- Twitter • Developed model to represent dynamic connections Scale of the experiments or the
mensional profile • Ranked list of users generated to find relevant diversity of used data is not
of users explained.
information based on individual requirements.

[8] Develop frame- - Used deep learning to predict user’s traits to create • Used small datasets.
work to fuse profile. • Not considered privacy and
diverse informa-
tion from various ethical aspects of social
sources. media data.

[7] Create smart-TV In-built • Detect faces in front of TV. • Unsuitability for public
based recommen- resources of • Generate anonymous and consolidated user and places.
dation system. smart TV. group profiles. • Accuracy issue due to low
• Create item profiles using stored videos, live chan-
nels, EPG data, and other sources. brightness.
• Recommend items to user by comparison between
two types of profiles.

Table 2: Summary of works on privacy and security aspects of data collection


Ref. Objective Data Methodology Gap noted
Source(s)
[9] Identify clear Medical IoT Isolated traffic originating from fixed set of Differentiation of encrypted traffic from
text sensitive devices IP addresses compressed clear-text traffic not done.
information
from encrypted
communications.
[14] Identify smart- Sensor of Developed a web page to collect sensor • Gyroscope lacks manual calibration
phone based on smartphones data • Complexity of real-world motion sensor
data transmitted tracking not studied.
via its sensors. • Fails to address practical constraints of
implementing proposed defences.

[15] Show encryption IoT devices Separate traffic into packet streams and • Only focused on passive network threat
alone is insuffi- label it by type of device and then corre- model
cient to preserve lating traffic rates with user interactions to • Identify same manufacturer device using
privacy of smart infer consumer behaviour.
homes DNS is limitation.

[10] Check presence of OTT TV Built crawler to interact with OTT chan- • Only focus on Roku Express and Ama-
trackers in OTT streaming nels. zon Fire TV Stick
devices. devices • Crawler has restricted capability.

[12] Identifying PII Android apps. Used Federated learning to classify outgo- • Not handled data and system hetero-
or ad requests in ing HTTP packets. geneity.
HTTP packets. • Didn’t address real-device resource con-
straint problem
• Didn’t consider possibility of attackers
among clients in a sub-network

[11] Reveal PII leak- Popular Automated the disclosure of PII leakage • Used manual labeling and crowd-
age and give users apps on iOS, using machine learning on a cross-platform sourcing from a specific group of users.
control over their Android, and scale. • Relied on a centralized server model,
data. Windows.
raising privacy concerns

media platforms, highlighting their strengths and weaknesses and providing a vision
for future research, for example, creation of more dynamic and intelligent profiles to
receive more appropriate results from users’ profiles. [18] reviewed the latest advance-
ments in user profiling, including methods, features, and taxonomies. It explored
techniques such as data acquisition, feature extraction, and profiling approaches, along
with performance metrics. The survey also addressed challenges like dataset size, the

4
cold start problem, and domain dependencies. Furthermore, it outlined future research
areas, such as developing versatile, dynamic, language-independent user profiles.
[19] reviewed studies related to profiling smartphone users through ordinary apps,
presenting a general framework for learning user information from smartphone applica-
tions. They included the method of data collection, pre-possessing, and user profiling,
with implications and suggestions for improving business services, user experience,
and profits, and developing mobile context-aware tools to improve the quality of life of
users in different aspects. Another work by [20] provided an overview of profiling users
on social networks. Since the data available on the web ranges from semi-structured
to unstructured, various approaches were presented to profile users in online social
networks. These approaches included clustering, face detection, user activities, con-
tent analysis and behavioral analysis. Authors [21] conducted a systematic mapping
study of profiling users based on reviews, presenting the latest trends in user profile
modeling and analysis.
[22] examined intrusion detection and prevention systems from the standpoint
of exploiting behavior including system behaviours that is generated by hosts and
networks and relate to the host activities and network status and user behaviours
that relate to the direct interaction between the user and the system, for example,
typing patterns. These behaviors were examined to determine whether a user was
legitimate to be on the system i.e. review of intrusion detection and prevention systems
for profiling users. In the first step of profiling users, the behavior was analyzed,
then categorised into system behaviors. As a result of this classification, data profiles
were then divided into system profiles and user profiles, with the latter being further
categorized into more specific categories namely biometric and psychometric profiles
based on their characteristics. A summary was then provided of the advantages and
limitations of these specific profiles and related analysis techniques.
In contrast to these studies, our work presents an examination of the user profiling
process pipeline reviewing latest work, starting with deciding the context of profiling
users, collecting data from various platforms using numerous approaches, and finally
constructing the profile using different techniques. This is followed by security and
privacy implications that profiling has on user’s life. Also, a case-study on two com-
panies has been carried out which are involved in profiling process and establishing
links between a profile to an individual on web. The identification of privacy leakage
from android apps justifies the data exfiltration process that occurs from user’s devices
without their prior consent and knowledge. Next we discuss in detail user profiling
process along with its usecases in Section V.

4 User Profiling
User profiling refers to the process of extracting data about the interests, preferences,
behaviors, and needs of a user, which can be used to identify an individual. Various
interpretations of user profiling have been proposed in the literature, depending on
the context in which it is used. For instance, in work by [23], user profiles are designed
on a mobile cloud environment to provide distributed IT services and resources to
users based on context information. Authors of this work manifests user profile as

5
information of user and service consumed by them. Information like user ID, user
name, hobbies, personal desire, and other details are stored in user’s profile, while the
service information part stored data about IT services used, such as service name,
context, provider, and frequency of access, etc.

4.1 The importance of User Profiling


User profile is a fundamental concept in personalisation system. These systems cus-
tomize content based on individual behaviour, interests, and preferences. This is used
to enhance user’s experience over the web and determine their intention [18]. In modern
digital era, personalization is vital, and profiles form the basis of advanced technolo-
gies used to provide benefits to users. For example, insurers in the United States
and the United Kingdom are pushing for wider acceptance of “telematics” devices in
automobiles to gather real-time reports on a driver’s behavior, which can be used to
determine insurance costs [24].
This way understanding user needs is vital for businesses to deliver tailored services.
Personalized systems filter information to match user interests and adapt to different
contexts. The next section covers different areas where user profiling has been used
for years.

4.1.1 Monetarily-driven entities


Online behavioral advertising, also known as interest-based advertising, involves col-
lecting data like search history or app usage, from a user’s device to understand their
behavior and preferences. This information is then used to deliver targeted adver-
tisements to users. Many companies, including Google and Facebook, heavily rely on
this practice for revenue generation. In 2023, Google, Facebook, and Amazon pro-
jected advertising revenues of 39%, 18%, and 7%, respectively [25]. Amazon serves as
a notable example of effective user profiling. In just two years, Amazon’s revenue gen-
erated through advertising sales has gone from 11 billion in 2020 to 31.16 billion U.S.
dollars in 2021 and is forecasted to reach 64.3 by 2026 [26]. The company collects data
from user interactions on its site, considering factors like time spent on pages, items
added to the cart, and purchases made. Amazon also incorporates external datasets,
such as census data, for demographic information. Amazon Privacy Notice, explains
that the company collects information such as personal details, browsing informa-
tion, and device specifics, allowing Amazon to create comprehensive customer profiles.
Company employs collaborative filtering in its recommendation technology, suggest-
ing products based on the preferences of customers with similar profiles. Additionally,
Amazon extends its advertisements to other platforms, creating a holistic view of
customers to enhance their personalized experience.

4.1.2 Non-monetary-driven entities


The public sector also utilizes user profiling to personalize e-government services
offered to citizens. Personalized portals benefit citizens with services they specifically
require, thereby increasing satisfaction levels. Profiling also aids in efficient and effec-
tive communication, deducing and anticipating citizens’ behavior and manipulating

6
them to target specific section of society for a general cause like women health care
hygiene. This gives governmental sector organizations enormous capabilities for their
e-government strategies [27]. Apart from providing e-services, the government makes
use of user profiling by collecting citizens data for various affairs, such as predict-
ing crimes and illicit activities based on citizens’ behavioural profiles developed over
time. Also, there are organizations that use internal systems to track and recommend
services for its employees based on their profile. For instance, IBM [28] utilizes data
analytics and AI algorithms to match employees’ profile with relevant projects and
opportunities. Accurately representing the user profile is crucial to obtain the best
results that closely align with their preferences. The next section will explain the
details of the user profiling process.

5 User Profiling Process


Numerous factors contribute to the creation and usage of a user’s profile. This section
covers a study of how a user is profiled. The taxonomy in the form of different phases
in profiling process is depicted in Figure 1. The process of building a user profile starts
with the data collection process, which consist of selecting the context of building the
user profile and then selecting the content for creation of that profile. Raw data about
the user is gathered from various platforms using various methods. The second step
constitutes building and maintaining of user profile model, which consist of ways to
represent a user model and various techniques that can be adopted for its construction
and modelling. In the third step, personalized services are provided, which starts with
identifying user on web and then matching their characteristics to the content of
the profile. Finally, customized services are delivered to the matched user. Following
sections contain a detailed description of every step of this process.

5.1 Data Collection


5.1.1 Nature of profile
Different types of data can exist in a profile ranging from user’s demographic data,
skills, needs and goals, behavior, interests and preferences. Depending on the nature
of data constituting the profile, user profile can be either static or dynamic. Static
profile is the one that contains user information that does not change or get modi-
fied. A work, [29] explained static profiling as a method to evaluate user’s static and
predictable characteristics. This profile keeps user information for a longer duration.
Manual information gathering methods are used by the static profiling that requires
user’s intervention to collect and analyze his static and predictable characteristics.
Dynamic profile in contrary to the static profile is automatically created by the agents
or models deployed in the system and therefore, the user attributes and contents get
updated with time.

7
Fig. 1: Taxonomy of user profiling

5.1.2 Data Collection Platforms


Gadgets like phones, watches, TVs, speakers, refrigerators, air conditioners, cars, etc
have now become the most important source of personal data. In addition to pro-
viding a service to their users, these devices collect personal information via diverse
applications and model user behavior. As an example, [30] employed the gyroscope

8
and accelerometer sensors of smartphones to gather user data and create a behav-
ioral model. The authors introduced a context-generation method that continuously
updates a trust score, indicating the likelihood of users engaging in activities simi-
lar to their regular patterns. Whereas, [31] used smart watch and smart phone to get
details of user’s activity and accelerometers sensor data to categorize user behaviour.
Web is an important source of data. Within seconds of a user clicking on a website
or interacting with a mobile application, trackers embedded in website/app profile a
user. One usually needs to create an account in order to use services of most web
platforms. User personal information such as their name, email id, birthday, profile
picture, and possibly even the financial information are all linked to his account.
All the interactions a user has while using social media service and e-marketplace,
including the type of content or product user finds interesting and shared by him,
interactions with other users and advertisements the user interacts with and responds
to are identified with this account. A profile can be developed based on the data a
user provides, and also their interactions. [13] and [32] created a user profile from the
personal information that is shared by users on social network sites. Websites can also
study their users’ shopping habits, how did they find their website, and if they are
interested in advertisements on pages they visit. With help of IP addresses, cookies,
and small image files called tracking pixels or web beacons, most of this information
can be reverse engineered. By combining information from different websites, a detailed
profile of a user can be generated. For example, [33] and [34] generated user profiles
from data gathered from user’s web search with an aim to get insights into user
behavior patterns.
Data collected from various platforms can be used by businesses to gain a deep
understanding of their customers and provide them with personalized services.

5.1.3 Data Collection Approaches


The collection of user’s personal data from diverse platforms mentioned above has
always been a challenge, and since huge data is gathered and the most important ones
have to be identified and filtered. The collection of data from user can be achieved
through explicit, implicit, or hybrid approaches. Explicit or Opt-in data collection
methods use manual techniques that require user intervention, such as personalized
applications that ask users to provide ratings and feedback, or surveys and registration
processes used when registering for a service on a website [35]. Profiles constructed
using an explicit approach are static and only accurate until the user changes their
interests and preferences. The Implicit or Observed data collection approach involves
the use of automatic agents or models to collect data without user intervention. Farid
et al. [36] proposed an agent-based model that builds an implicit user profile from
bookmarks and browsing activities, utilizing web and search logs. This approach aims
to analyze a user’s interests by processing search records, access records, and browsing
histories, and recording user actions such as tagging, bookmarking, and download-
ing while browsing web pages. Hybrid or Inferred approach takes advantage of both
implicit and explicit techniques to provide accurate recommendations and services.
Hybrid approaches result in profiles that are more adequate and accurate as they peri-
odically update user information. Fakhfakh et al. [37] utilized a hybrid approach for

9
collecting user preferences. They consulted users for their preferences using the explicit
approach, while the implicit approach discovers user preferences through MovieLens
datasets, wherein a mapping between MovieLens genres and user preferences is estab-
lished based on their semantic meanings and relationship. Next, we will expound
upon the methodology of formulating and modelling the user profile utilizing diverse
techniques.

5.2 User Profile Construction and Modelling


After completing the data collection phase, the subsequent step involves choos-
ing a suitable method to represent user profiles, then creating profiles employing
diverse methodologies followed with development of a model aimed at predicting user
preferences. Detailed descriptions of these processes are provided below.

5.2.1 User Profile Representation


The efficacy and accuracy of user profiling also relies on the appropriate representation
of collected information. In this section, we will discuss the various types of user models
based on their representation.
1. Term-Based (vector-space model): In vector-based model, each profile is composed
of a vector of keywords, where each keyword represents a topic of interest to the
user, and associated weights that represent the numerical representation of the
user’s interests related to those keywords. When a search query is submitted, the
system retrieves hypermedia documents present in web that are converted into
weighted keyword vectors. The Antagonomy system by [38], for instance, used this
model to create a personalized online newspaper that learned personal preferences
from user behaviors. However, one major drawback of the keyword-based profile
is polysemy ambiguity, which means keywords may have multiple meanings. For
example, the word “foot” can refer to either a body part or a scale unit, which may
lead to inaccuracies in the user profile [36].
2. Semantic Network-Based (ontology-based): To address the issue of ambiguity due
to polysemy in keyword-based profiles, weighted semantic networks are used to
create profiles where each node represents a concept. Initially, the network com-
prises a set of disconnected nodes. As additional user information is gathered, the
profile is enhanced by incorporating more keywords related to various concepts.
Subsequently, links are introduced to depict the connections between these con-
cepts. This method offers a way to represent profiles based on descriptive keywords
and knowledge using a consistent system that has been developed over a number
of years. One such system is WordNet, which is an extensive lexical database of
the English language. It stores a knowledge base that organizes English words into
semantic relations known as synonym sets. Users and their semantic preferences are
represented by the information inherent in an existing ontology. The InfoWeb [39]
system, which filters digital library documents online, utilizes semantic networks
to build profiles containing the long-term interests of users.
3. Concept–Based (hierarchy-based): Similar to semantic network based approach, in
this approach too user’s interests are represented in a form of a graph. However, here

10
nodes represent abstract topics of interest to the user. Profiles are represented as
vectors of weighted features, where features represent concepts instead of individual
words or phrases. [40] proposed the use of hierarchical concepts, which allows for
generalization. The hierarchy of concepts can either be fixed or dynamic, depending
on the user’s interest.
4. Graph-Based: Graph-based representations, however, focus on utilizing graphs or
networks to represent and analyze user-related information without specific empha-
sis on semantic relations or abstract concepts. Nodes in these graphs signify entities
or concepts, while edges represent relationships or connections between them. This
approach captures intricate relationships, dependencies, and interactions within
a user’s profile. For instance, Chen et al. [41] proposed a method for enhancing
user profiling within sequential recommender systems using two layers of graphs: a
global graph to record transitions among various behaviors across all users, and a
personalized graph to model individual user interactions and preferences.

5.2.2 User Profile Construction


Numerous techniques exist that can be adopted for constructing user profiles, contin-
gent on the context. The resultant profiles should ideally possess a dynamic disposition,
that is, they must be adaptable enough to accommodate any alterations in the user’s
immediate or long-term interests, and should be updated periodically. The ideal
approach is to opt for a construction technique that requires minimal user interven-
tion and feedback. A few of the construction methods reported in the literature are
discussed below:
1. Weighted Tags: One significant strategy for constructing a user profile is to assign
weights to the terms. Different techniques, such as Boolean or Frequency weighting,
are used for this purpose. The underlying principle behind these techniques is that
the occurrence of terms within documents characterizes the user profile to which
the document belongs. Thus, the more often a term appears in a class, the more
it reflects the characteristics of that class. Similarly, frequent usage of certain tags
mean higher user interest in associated topics [42].
2. Filtering techniques: In this approach, user profiling makes use of filtering mech-
anism for retrieving pertinent information that aligns with the user’s specific
requirements. Various filtering techniques that can be employed for this purpose
are given as:
• Rule-based approach: This approach involves selecting pertinent user informa-
tion using a predefined set of “if-then” rules. Pre-defined categories of users are
defined to determine which content can be incorporated into a user’s profile. For
instance, online brokerages usually classify their accounts based on gender and
age and then offer distinct services, products, or benefits based on these cate-
gories. However, the challenge with this approach is that it is difficult to obtain
marketing rules from domain experts to validate the effectiveness of the extracted
rules [6].
• Content-based filtering: Content-based filtering is an approach for creating user
profiles by comparing item profiles with user data. This technique identifies the

11
content of the item by extracting keywords from the product descriptions. These
are more commonly used in text-intensive domains [6, 43]. Also, analyzing limited
content can be challenging and may reduce the performance of the approach,
particularly in domains that involve multimedia content like images and audio.
• Collaborative filtering: This approach operates on the idea that similar users
will have similar preferences. Therefore, this method clusters similar interest
users into groups. It uses an algorithm that aggregates feedback from multiple
users to suggest items that will appeal to target users by considering similarities
between different users [44]. These algorithms fall into two categories: memory-
based and model-based. Memory-based methods, such as correlation analysis,
compare the active user’s profile with similar ones in the database, relying on the
entire user-item matrix to tailor recommendations. Model-based methods learn
from the entire dataset to offer personalized suggestions without needing the
entire dataset in memory during runtime, providing scalability advantages. For
example, Amazon.com utilizes collaborative filtering to personalize web pages
with tailored recommendations based on individual customer interests.

5.2.3 User Profile Modelling


After constructing users’ profiles, the subsequent steps entail developing a compu-
tational model capable of predicting user needs and preferences. Below are several
approaches used for modeling users.
1. Using representation learning: This approach emphasizes modeling users by learn-
ing latent representations for each user through the utilization of items, item
features, and/or user-item response matrices. Here, items refer to any entities that
users interact with, such as products in an e-commerce platform, articles in a news
website, or movies in a streaming service. They represent the objects of inter-
est for users. Representation learning enables the extraction of meaningful latent
features from both static data (e.g., tabular data) and sequential data (e.g., time-
series data). Li et al. [45] conducted an extensive analysis of recent advancements
in user modeling, with a specific emphasis on representation learning techniques.
The study categorizes these techniques into two main types: static and sequen-
tial representation learning. Static learning methods, such as matrix factorization
and deep collaborative filtering, are employed to capture user preferences and item
characteristics within a static context. These methods are fundamental to numer-
ous recommender systems and play a vital role in comprehending user behavior.
The study also discusses sequential learning methods, including recurrent neural
networks, which are designed to capture the dynamic evolution of user preferences
over time.
2. Using machine learning: One way to create user profiles is through the use of
machine learning methods, where the aim is to predict users’ preferences based on
a small number of instances of data about users and their behavior. An example
of this is the k-NN algorithm, which can be used to classify each user’s preferences
from a collection of labelled users’ preferences. A summary of works using machine
learning techniques is presented in Table 3. [46] analyzed user data from product

12
Table 3: Survey of works using Machine learn-
ing techniques
Ref. Model Objective
[46] Random for- High-risk user identifi-
est and neural cation model
network
[47], Boosting Student profile model-
[48] algorithms, ing
Naı̈ve Bayes
[49] k-means clus- Recommendation
tering model
[50] PCA Anomaly detection
Table 4: Survey of works using Deep
learning techniques

Ref. Model Objective


[51], Attention net- Modeling user behavior.
[52] works, graph
attention net-
works
[53], Neural net- Modeling user prefer-
[54] work ences.
[55], Convolutional Modeling user represen-
[56] neural net- tations and interests.
work (CNN)
[57], Auto- Analyzing and modeling
[58], encoders user behavior and repre-
[59] sentations.
[60], Recurrent Personalized recommen-
[61], neural net- dations, student perfor-
[62] work (RNN) mance prediction, and
user behavior modeling.
[63], Transformers Generating recommen-
[64] dations and modeling
user behavior.

discussions, employing a supervised random forest model to identify high-risk users


who have posted negative comments in new public opinions, supplemented by a
backpropagation neural network. [47] investigated boosting algorithms for student
profile modeling, exploring how student behaviors and traits influence academic per-
formance through adaptive learning support. [48] focused on student performance
prediction, comparing Support Vector Machine (SVM) and Naı̈ve Bayes classi-
fiers. [49] applied partitional clustering algorithms, particularly k-means, to group
profiles based on similar interests and preferences. To improve accuracy of recom-
mendations in e-commerce environment, [50] utilized principal component analysis
(PCA) to extract important features and then build anomaly detection system to
examine user behavior in database systems and web browsing environments. Deep
learning has emerged as a prominent approach for uncovering intricate patterns
within user data. Table 4 offers an overview of research employing deep learn-
ing for user behavior modeling. In [51], user behavior was modeled by exploiting

13
semantic similarities between the geographical proximity of items and user requests.
Meanwhile, [52] focused on predicting user identity links across social networks to
improve recommendations using multi-layer perceptron. Their approach utilized a
heterogeneous graph representation and multiple attention layers for aggregating
user information.
In a different study, [53] proposed a two-stage collaborative filtering recom-
mendation model. This model incorporated a time-aware attention mechanism
for dynamic user preferences and a matching function learning model based on
deep matrix factorization and multiple-layer perceptron for user-item feature inter-
actions. In personalized recommendations, [54] aimed to enhance user profiling
for point-of-interest recommendations using graph neural network and attention
mechanisms. Their architecture included layers for encoding inputs, aggregating
neighborhood entities, and integrating user/location representations.
[55] introduced a temporal CNN approach for continual user representation
learning across tasks, leveraging partial parameters from previous tasks, while
[56] developed a candidate-aware user modeling framework for personalized news
recommendation, using CNN networks to model local click behavior context.
In other domains, [57] employed stacked autoencoders and clustering for user
behavior analysis for power grid user behavior analysis, and [58] employed a vari-
ational autoencoder-based model for user representation learning using end-to-end
learning algorithm. For user modeling in inconsistent client environments, [59]
proposed a federated learning method.
[60] introduced a user-based RNN model for personalized recommendations. In
student performance modeling, [61] developed a personalized federated learning
framework and proposed user profiling using semantic behavior modeling utilizing
attention-based Gated Recurrent Units (GRU). [62], proposed user profiling by
combining semantic behavior modeling with RNNs.
For user preference modeling, [63] utilized a Transformer-based architecture
with self-attention mechanisms for contextualized item embeddings. While [64]
introduced UserBERT for pre-training user models on unlabeled data, employing
contrastive self-supervision and behavior sequence matching tasks.
Once the user profiles have been constructed, they can be leveraged to provide
personalized services to users. The next section will explore two prominent data brokers
that profiles users and then sell their profiles to its clients.

6 Data brokers: Understanding the role and


practices
A data broker is a business that gathers a large amount of personal information about
people from various online and offline sources and sells it to clients. This information
is used for identity verification, credit assessments, employment decisions, insurance,
housing, and marketing products. Brokers like Acxiom [65] also purchase data from
variety of other sources, analyze it, and refine it for future marketing purposes. Mar-
keters identify what they can buy from these large data brokers because this data

14
provides insight into their current audience. Additionally, this data can assist in the
development of strategies aimed at expanding a company’s customer base [66].
Data management platforms (DMPs) help brokers integrate customer data with
profiles from third-party sources. The platforms synchronize with each other, for
instance by employing cookie synching, to recognize the user across various tracking
entities and match identifiers from disparate systems. DMPs have capacity to recog-
nize users not only on the Web, but also across different devices, by anonymously
matching data from multiple sources [67] using identifiers discussed in Section 6.3.
These identifiers are then transformed into an alphanumeric string via cryptographic
methods. Thus, a user can be recognized as the same individual across various devices
and platforms, and their profile can be connected to more comprehensive data.
Today, users use multiple platforms and devices, making it challenging for data
companies to match user data accurately. Deterministic matching and probabilistic
matching are two main approaches used for this purpose, with deterministic matching
relying on common identifiers like email addresses or mobile identifiers, and proba-
bilistic matching using algorithms and statistical models based on various data pieces.
After review of existing literature and going through articles, details on two data com-
panies are presented for examination and analysis that collect user’s data from diverse
domains and concatenate them into profiles for sharing it further.

6.1 Acxiom
Acxiom is one of the major players in the field of consumer data brokering and collec-
tion, claiming on their website that their primary focus is on providing the necessary
data, technology, and services to enhance customer experiences worldwide [65]. Acxiom
Marketing Solutions (AMS) functions as Acxiom’s data brokerage, while LiveRamp
is a Software-as-a-Service (SaaS) platform that helps marketers utilize their exist-
ing data sets. AMS gathers information from a variety of sources, including publicly
accessible records (such as data on births, deaths, marriages, divorces, and changes
of name and address), website cookies, surveys, and other digital tracking services.
Data is combined and sorted into various user categories. Sorting in this way goes
beyond simple demographics. For example, Bruce Schneier, a privacy expert, noted
in his book “Data and Goliath” that Acxiom categorizes its data collection into cat-
egories like “potential inheritor,” “adult with senior parent” and “households with a
diabetes focus or seniors needs”. These lists can then be sold to companies looking to
market to these specific groups [68]. LiveRamp’s primary service is “data onboarding”
or “offline matching,” which involves linking offline user records to online identifiers
tied to a browser or device. This enables LiveRamp’s clients to target users based on
various behavioral and attitudinal data, such as purchase history, TV show prefer-
ences, website traffic, and smart device usage [69]. According to LiveRamp’s Service
Policy [69], this approach allows advertising firms to display targeted ads to specific
individuals across email, mobile apps, websites, and addressable TVs.
Unique Identity Resolution: Acxiom’s AbiliTec is a highly effective recognition
solution that provides clients with a unique identity resolution service. By using various
offline and online touchpoints and multiple variables, AbiliTec goes beyond the basic
name and address information to achieve more precise recognition of individuals and

15
households. Acxiom’s Data Services API [70] states that the AbiliTec Link not only
connects consumer records across different databases, platforms, and devices, but also
assigns a single link for each record, allowing for a complete view of the user. As
a result, individuals and their families can be identified instantly using the “hashed
entity representation”, which utilizes phone numbers, email addresses, smartphone
IDs, or any combination of these, as well as name, address, city, and zip code.

6.2 Oracle
One of the largest multinational technology companies, Oracle, has been acquiring
several data brokers and consolidating their operations under the umbrella of Oracle
Data Cloud [71]. By merging information from many other data brokers, Oracle has
amassed an extensive repository of consumer data, which it has integrated into a full
service digital marketing platform [72].
To enhance targeted advertising, Oracle integrates offline data from loyalty card
programs, which are used by retail businesses to gather customer information, with
digital media. This connection between offline behavior and online profiles enables a
deeper understanding of consumers. Online profiles are generated using cookies saved
on a consumer’s device from partner websites. With Oracle Data Cloud, consumer
interests are tracked across both online and offline activities and displayed to adver-
tisers. This allows advertisers to utilize first-party and third-party data to create
personalized marketing campaigns and target specific audiences. Additionally, Oracle
provides data and campaign metrics in one convenient location. By using first-party
data, easily accessible third-party data, and new second-party data, companies can
build comprehensive customer profiles with Oracle. Oracle’s website and privacy pol-
icy indicate that personal information may be shared with third-party companies for
commercial purposes. As per Oracle’s privacy policy, user information is collected
through various methods. In certain instances, data is directly gathered using email
addresses provided by users during interactions with Oracle or its partners. In other
cases, unique identifiers like mobile device identifiers or cookie IDs on browsers are
used to gather user data. This information can be associated with interest segments
or profiles. Interest segments refer to a group of users who share similar preferences
or behavior and are utilized for direct marketing by Oracle’s customers. Profiles com-
prise a collection of attributes about a specific user or device, or a group of users or
devices that share similar attributes, which are used for marketing by Oracle’s clients.
A unique ID and matching process for Consumers: After generating data-driven
audiences, a business must establish connections with them, which can be challenging
due to single user using multiple devices. Each browser has its own set of cookies, while
on mobile devices, all applications share a mobile advertisement ID that functions
similarly to a cookie.
Oracle ID Graph (OIDG) connects these ID sources and validates them with accu-
racy against a high-quality data that is known to be true because it is made up
of verified transaction and subscription data [73]. OIDG consolidates all interactions
across various channels in order to develop a single actionable user profile. To iden-
tify users on both online and offline platforms, all of Oracle’s clients send their match
keys (which can be any unique user ID) to the company. An encrypted email is the

16
most common type of match key, as it can be gathered either online or offline. In
addition to Oracle hashed IDs (normalized SHA-256 hashed email address or phone
number), clients can utilize encrypted or hashed unique user IDs (UUIDs) based on
phone numbers, email addresses, physical addresses, client account numbers, and even
IP addresses. Oracle synchronizes match keys with the network of user profiles that are
connected by OIDG. All Oracle clients use OIDG to manage their IDs and attributes.
The data transmitted by various applications is gathered not only by organization
that owns that applications (for example, Facebook is owned by Meta Platforms, Inc.)
but also by various other data companies that work mainly by gathering, tracking and
linking a user to a profile and then selling further to its clients.

6.3 Tracking and linking user profiles


In Figure 2, an overview of the identifiers used for tracking users across various
browsers, devices, and platforms is presented. The ability to identify users is primarily
determined by the specific device being used and whether they are accessing content
through a web browser or a mobile app.
Cookies, device fingerprints, HTML local Storage, and ETags are commonly
employed in browser-based identification methods.

User-Identification Methods

Browser-based identification Mobile-based identification Platform-based identification

Mobile Web
In-App Android IOS Windows
Browsers

Cookies Device HTML local


ETags Cookies Apple's Open Device GAID (Google IDFA
fingerprints storage Google Windows
identifier for Identification Advertising (Identifier for
Cookies Advertising AID
Advertising Number Identifier) Advertisers)
ID
(IDFA) (ODIN)
First-party Third-party Local Session First-party Third-party
Evercookies
Cookies Cookies Storage Storage Cookies Cookies

Fig. 2: Identifiers used to track users across devices and platforms

• Cookies are small files that websites send to a user’s browser. These files enable
tracking and monitoring of the websites visited by the user and the specific items
they interact with or click on during their browsing sessions. “First-party” cookies,
are generated by the domain visited by a user, and “third-party” cookies, created by
domains other than the one being visited, represent the two types of cookies. Third-
party cookies allow advertising companies to monitor users’ browsing activities.
Another variation of cookie is “ever cookie”, a persistent type of cookie that is
stored in multiple locations within the user’s browser and device. Even if deleted,
an ever cookie has the ability to restore itself from alternative locations.
• Device fingerprinting is the practice of recognizing and monitoring individuals by
analyzing the unique attributes of their devices. This involves collecting various
details like browser version, operating system, language, installed plugins, and
settings to establish a distinctive identifier.

17
• HTML5 Local Storage provides a means to store and gather user information.
There are two types of local storage: “local storage” for data retention without
an expiration date, and “session storage” for temporary data during a specific ses-
sion. In comparison to cookies, HTML5 local storage offers superior capacity and
accessibility, while eliminating the need for web server call [74].
For mobile-based identification, identifiers used to track users are:
• Cookies as explained in browser-based identification methods, are also used in case
of Mobile-browsers.
• For mobile applications, identifiers used are Google’s advertising ID (which is a 32-
digit string of characters), Apple’s IDFA and Open Device Identification Number
(deprecated).
The last classification is platform-based i.e., based on type of platform being used by
the user:
• Android uses Google’s advertising ID. For example, the advertising ID of one of
our android test devices is “de7d3063-968a-4450-932c-7abf87a0a261” which can be
accessed in the device settings under Google −→ Ads.
• iOS uses Apple’s IDFA. For one of our iOS testing device, identifier is “3F946465-
AE34-4E7E-AF7E-C3C4A4647CCA” and can be acessed using an app named “Get
My IDFA” [75].
• Windows has its own advertising ID.

7 Privacy and Security issues of user profiling


While user profiling can bring benefits to various domains, it also poses potential
security and privacy risks. In the realm of online platforms, data privacy and security
are interrelated concerns. Data privacy concerns the authorized access of user data,
while data security focuses on implementing measures to protect privacy in the event
of any violation.
Online activities such as searches, browsing histories, purchases, and social media
engagement can provide digital cues about an individual’s interests and personality.
For example, IBM’s Personality Insights service can create a detailed user profile from
digital communication, including text messages, emails, tweets, and forums [76]. This
detailing goes beyond just demographics and location data. Online behavior can reveal
various aspects of a user’s personality, such as their level of extroversion, environmen-
tal awareness, political inclination, or travel preferences. Companies collecting large
amounts of data from both online and offline sources can use this information to create
rich user profiles for billions of people.
The hardest part of living in a digital world is in striking a balance between free
service consumption and the disclosure of personal information to businesses for sale.
Platforms such as Google aim to streamline users’ digital and physical lives, offering
convenient and cost-effective services. These platforms collect user data to improve
their services and enhance customer satisfaction. Since users prefer such platforms,

18
they willingly provide their personal information, thereby facilitating data collection
by these platforms.

7.1 Privacy Issues

Fig. 3: Source: Disconnect.me. This figure illustrates the presence of trackers on the
ICICI bank website. Colored domains signify tracking sites, while gray domains may
also track users. Left part: domains informed when the user opens the website without
logging in. Right part: domains informed when the user has logged in. (Accessed Date:
October, 2021)

Data-driven companies can derive substantial benefits from user profiles in various
areas. However, these profiles can also be used against users. Data brokers offer ser-
vice providers access to customer behavior information, often without the customer’s
knowledge. Armed with this information, these service providers can then target cus-
tomers with highly personalized ads, promotions, and even phone calls, all based on
the data provided by the data broker. As digital technology and data collection become
ubiquitous, such practices are becoming increasingly opaque. Users are often unaware
of what personal data about them is collected, analyzed, and shared or sold. Small
businesses may upload their customer data to data brokers, who combine it with online
data collected in real-time from other sources.
A entire section of business has flourished that focuses on exploiting and monetizing
user’s personal data. Users are often unaware of these companies as they do not interact
with them directly. These data companies capture data from every intersection of the
digital world, creating dynamic and fragmented profiles that are distributed across

19
multiple databases. Google engages in data surveillance by continuously tracking a
user’s various activities such as website visits, Gmail account usage, mobile device
movements, and YouTube usage, among others, without the user being fully aware
of its implications. This monitoring can result in the collection of highly personal
data, as demonstrated by the presence of trackers on a bank website even after a
user has logged in, as shown in Figure 3. Companies make no effort to improve the
existing nontransparent play. The degree of corporate control over the current digital
environment is a cause for concern with regard to information privacy. [77] investigated
the transformative impact of digialization creating privacy challenges. Their analysis
emphasized the need for robust solutions to safeguard patient information and pave
the way for responsible, ethical adoption of digital electron healthcare. Profile data
gathered by companies could be potentially abused for personal, organizational, or
political purposes.
The ubiquitous sharing of personal data from apps, websites, devices, and ser-
vices with third-parties and the subsequent use of this data by various companies has
raised concerns. E-commerce companies rely heavily on distributed computing, which
have their data centers spread globally. Data sets containing significant amounts of
personal information about users are gathered, which can be used legitimately or
exploited. For example, Amazon has operations across 12 regions, each with multiple
data centers that are susceptible to physical attacks and persistent cyber-attacks. In
2014, eBay’s customer data was compromised in a cyber-attack that resulted in the
theft of names, email addresses, physical addresses, telephone numbers, birth dates,
and encrypted passwords. The stakes are higher than ever due to the large number of
people involved in personal data security incidents keeping privacy of user at stake.
At Arkansas University, the professional development system was breached causing
exposure of personally identifiable information of individuals [78]. Table 5 displays
some high impact data breach incidents that occurred between 2021 and 2024. These
incidents create opportunities for cyber-criminals to exploit vulnerabilities, such as
through phishing, leading to severe consequences like unauthorized purchases, fund
theft, or identity theft for individuals.

7.2 Security Issues


The insights obtained from the gathered data are of significant value to criminals.
Illicit trading of individuals’ personal information, such as their financial details, can
assist fraudsters in locating their targets and lead to further harm through techniques
such as fake calls and identity fraud. As per a study by [85], nearly 47% of Americans
experienced financial identity theft in the year 2020, with the loss from identity fraud
rising from 502.5 billion in 2019 to 712.4 billion in 2020, marking a 42% increase. In
2014, an intruder took advantage of the vulnerabilities in InfoTrax’s server and a cus-
tomer’s website by deploying a malicious code that enabled remote access to InfoTrax’s
server, allowing them to extract data from systems. By interrogating certain databases
that contained full names, email addresses, phone numbers, addresses, Social Security
Numbers (SSNs), admin IDs and passwords, and distributor user IDs and passwords,
the intruder obtained the personal information of nearly one million customers. The
attacker also obtained access to over 2300 unique full payment card information [86].

20
Such personal information, as in the case of the InfoTrax breach, can be exploited to
carry out fraudulent activities. Stolen names, addresses, and SSNs could be used by
identity thieves to apply for credit cards in the victim’s name, which could affect the
victim’s credit score when the identity thief fails to pay the credit card bills. Under-
standing the financial and personal information of potential victims, including their
income and other personal details, can aid criminals in committing severe crimes such
as blackmailing or kidnapping. For instance, patients at Vastaamo clinic in Helsinki
were blackmailed after their data was stolen in two breaches. Attackers demanded a
€450,000 ($530,000) bitcoin ransom and published clinical records on the dark web.
Names, contact details, and therapy notes of hundreds were compromised [87]. User
profiles make it convenient to find users that have similar behaviour, making certain
section of the population vulnerable to criminal activities. For example, gathering pro-
files of young girls can enable criminals to morph their public photos and expose them
to pornography.
In 2016, an analytics company called “Cambridge Analytica” gained unauthorized
access to at least 87 million profiles of Facebook users, and utilized this data to create
voter profiles to influence their voting decisions. This company provided voter data and
analytics to both the Trump campaign and the campaign of former national security
adviser John Bolton during the 2016 election cycle. This breach and exploitation of
personal data was considered detrimental to democracy by various legislators [88].
Therefore, the abundance of user information increases the likelihood of attracting
intruders to the system, which increases the threats.
The data exposed in public forum during a particular security incident can be
combined with other data breaches to create detailed profiles of potential victims
by particularly determined attackers. Using this information, they can conduct much
more convincing phishing and social engineering attacks or even commit identity theft
against those whose information has been exposed. In next section, we explain security
policies in place and their challenges.

7.3 Existing security policy and its challenges


The right to privacy lacks a clear definition and legal protection, especially in the face
of internet-driven data collection and sharing. New technologies pose fresh challenges
to privacy, calling for a reevaluation of laws and safeguards in the digital age [89].
Many smart devices use a permission-based security model to prevent unauthorized
access to sensitive system resources and user data. For instance, Android, which is
among the most widely used operating systems for smart devices, requires application
developers to request permission from users to access sensitive resources like GPS
sensors, cameras, microphones, and private data via their Android manifest file [90].
This model gives users control over their device by allowing them to decide which apps
can access their resources and data. However, this security mechanism is not entirely
foolproof [91, 92], as intruders can use two common techniques to bypass it: side
channels and covert channel attacks, which exposes alternate mechanisms to access
the system resources that unfortunately are not audited by the permission mechanism
[93]. Side channel attacks exploit security vulnerabilities inherent in the system to
access system resources, for example, by leveraging the state of the CPU cache to infer

21
sensitive data during program execution. The key insight here is that as programs
execute, the state of the cache gets updated constantly. Considering the speed of CPU
caches, these side channels can leak large amounts of data quickly. Covert channels
involve intentional efforts by two parties to share system resources without violating
the underlying security mechanism, such as through shared storage. For example, a
communication can involve directly or indirectly writing data to a storage location
and another process directly or indirectly reading that location as in case of shared
storage. Therefore, while security policies are in place, they may not provide sufficient
protection for user data on their devices.
Maintaining user privacy is crucial when conducting profiling, therefore it is imper-
ative to implement necessary security measures to safeguard personal information.
There are policies being placed to protect user’s data in many countries. The Digital
Personal Data Protection Act 2023 of India, for example, aims to regulate the handling
of digital personal data in a manner that respects both individuals’ right to safeguard
their personal information and the necessity to process such data for lawful purposes,
as well as other related or incidental matters [94]. Next, we present an experimental
analysis of data leakage from mobile phones using these identifiers and trackers.

8 Case study of Privacy Leakage


In a study by [11], the exposure of Personally Identifiable Information (PII) from
widely used applications on iOS, Android, and Windows platforms was uncovered.
[12] identified PII exposure within HTTP packets through the classification of android
apps packets. To classify outgoing HTTP packets, federated Learning was applied,
in keeping with the data sensitivity and privacy concerns of the users. As of 2023,
instances of PII exfiltration from mobile apps persist. This section presents a case
study concentrating on identifying possible exfiltration of sensitive personal data from
Android applications installed on smartphones.

8.1 Clear text data transmissions in smartphones


Mobile devices often need access to personal and crucial user information in order
to function effectively and fulfill their intended purposes. However, this necessity can
also expose potential privacy risks, as sensitive data becomes susceptible to unau-
thorized access, misuse, or exploitation. Mobile apps and the third-party libraries
integrated within them might transmit PII to numerous external application servers,
which require this data for service provision or user tracking. Encryption plays a vital
role in ensuring privacy in digital communications. However, data packets transmitted
without encryption can be easily intercepted by adversaries and network observers.
Therefore, if data is transmitted from devices in plain-text format, it may be consid-
ered a significant design flaw. Even if plain-text data is compressed, it can still be
effortlessly restored to its original form by recovering the compressed message and
employing widely used compression algorithms. For instance, [9] identified clear text
sensitive medical information, known as e-PHI, within encrypted communications.
The National Institute of Standards and Technology provides a definition of PII
as “Any representation of information that permits the identity of an individual to

22
whom the information applies to be reasonably inferred by either direct or indirect
means” [95]. The General Data Protection Regulation (GDPR) states that personal
data should only be collected for explicit, legitimate, and specific purposes, and should
not be processed in any way that is incompatible with those purposes. Numerous stud-
ies in the literature focus on controlling user tracking and preventing the unauthorized
extraction of PII from mobile devices. These studies can be broadly categorized into
three groups: permission-based analysis [96], static and dynamic analysis of source
code [97, 98], and network-based analysis [11, 99–101]. In the following section, a tech-
nique is examined for gathering network traffic from smartphones and identifying any
clear text data that might expose sensitive user information and behavior. The exper-
iment entails two stages: collecting the traffic and detecting instances of plain-text
sensitive information.

8.2 Experimental Design


Figure 4 illustrates the design of our network data capturing framework for Android
devices. The network data capturing and monitoring system comprises a smartphone
and a man-in-the-middle framework or tool. We perform the experiments on an emula-
tor, and the analysis of iOS applications remains a part of our future work. We utilize
Genymotion emulator to simulate mobile devices due to their wider variety and faster,
lightweight nature compared to other frameworks. To intercept the traffic of mobile
devices, we use Mitmproxy, a command-line tool that acts as an HTTP/HTTPS proxy
and records the traffic.
Mitmproxy connects the smartphone to the Internet, therefore recording all incom-
ing and outgoing traffic. As long as the client trusts mitmproxy’s built-in certificate
authority, Mitmproxy will decrypt the traffic. Most of the time, this means installing
the mitmproxy CA certificate on the client device. So, the certificate is added to the
trusted store of the emulator. For this the device (emulator) is rooted. In this experi-
ment, the applications mentioned in Table 6 that have at least 10 million downloads are
selected. By doing so, it is ensured that we are analyzing the most popular applications
used by the majority of users as of March, 2023. These applications are downloaded
from the playstore and we manually sign up and log in to the application because the
tools that generate automatic UI events lack this functionality. UI/Application Exer-
ciser Monkey is used to generate synthetic user inputs and does a 10 minutes tour
of the application. The traffic captured undergoes preprocessing by employing a set
of decoding operations utilizing the UTF-8 scheme in order to store the traffic in the
appropriate format in log files for final analysis.
Before running the application, we grant it all the permissions it requires. We then
execute each application while capturing the network traffic it generates. In our tests,
certain applications generated no traffic either because they detected our device was
rooted or because they required a SIM card or valid phone number that our test device
lacked. Therefore, we removed these types of applications from our list and analyzed
the remaining ones.

23
Fig. 4: Data capturing framework

Fig. 5: Leaks captured from Myntra application (Version: 4.2201.1; Accessed Date:
March, 2023 )

8.3 Collected Dataset and Results


In order to gain a comprehensive understanding of mobile tracking and advertising,
we chose to diversify our app categories. Therefore, we run a total of 18 applications
under 6 categories: Social, Education, Entertainment, Travel, Shopping and Others.
Let Dt,A be the data transmitted at time t from Mobile application A. We manually
check Dt,A for the presence of three parameters: location, user-identifier, and device
identifier. For instance, Figure 5 depicts the user’s details such as his name, gender,
phone-number, OTP required for login, email-address, UPI ID and residential address
are exposed in plain-text from Myntra application. The discovered leaks of privacy
have been summarized in Table 6. Among various app categories, social and shopping
apps exhibit the highest degree of user location leakage, while user gender and date of
birth are commonly exfiltrated user’s identifiers across all app categories. Furthermore,
the exposure of device identifiers reveals the pervasive tracking ecosystem prevalent
within the applications. Results show that a substantial amount of user information
can be extracted from network traffic and is mostly hidden from users. We have opted
for crowdsourcing to label the dataset for the presence of PII. Recognizing human

24
judgment as crucial in dataset evaluation, we engaged 5 users in labeling the dataset,
and the task is nearly 60% complete.

8.4 Privacy Inference Analysis


The purposes of collecting data from various applications can be classified as either
obvious or non-obvious. The obvious purpose of data collection is to ensure proper app
functionality for users, which can be further divided into the categories of “benign”
and “non-Benign”. For instance, in the case of Skype, it is obvious and considered
benign to transmit a user’s name from the device for the application to function, while
transmitting and revealing a user’s text messages is non-benign and violates privacy.
Non-obvious purposes of data collection can also be categorized as “Benign” and “Non-
Benign”. Apart from collecting data for the primary app function, some apps collect
more data and permissions than necessary for other purposes. For example, Skype’s
transmission of a user’s static location provided during profile creation is considered
benign, whereas transmitting a user’s phone number is deemed non-benign.
Many apps offer convenient login options that allow users to sign in using their
Google or Facebook credentials. Although this simplifies the login process, it also
means that the app receives the user’s Google or Facebook account information.
LinkedIn traffic analysis has revealed that certain details, such as a user’s Google pro-
file picture, are shared and transmitted through URLs. While these leaks may not
initially appear to be dangerous, they provide enough data for potential eavesdroppers
to create a profile of the user.
Adversary can build a profile of a user using temporal expansion based on the
aggregated data. Although it appears that at t=t0 , adversary can infer very little
information about user. But over time t = t0 + ∆t1 , it is possible to infer more details
about the user. Finally t = t0 + ∆t1 + ∆t2 gives a detailed look into the life of a
user. Spatial expansion is another aspect of generating a user’s profile, involving the
concatenation of data from various applications into a unified profile. Let’s consider
the data gathered by an adversary from a Skype application, represented as dskype .
By analyzing data from applications like Cleartrip (dcleartrip ) and Myntra (dmyntra ),
the adversary can create a more comprehensive profile d=dskype +dcleartrip +dmyntra of
the user’s behavior. For instance, in a threat model, the adversary may use dskype
to track the user’s current location, dcleartrip to determine their vacation plans, and
dmyntra to find the user’s residential address. This type of information could poten-
tially be exploited by an observer for criminal activities such as burglary when the
user is away on vacation. Emphasizing the significance of strengthening user data
privacy becomes apparent in this threat model, aiming to prevent similar situations
from occurring in the future. For instance, [14] studied the feasibility of fingerprinting
motion sensors. Authors identified the type of smartphone based on the data transmit-
ted by accelerometer and gyroscope. It was concluded that it is feasible to fingerprint
smartphones from its motion sensors and can be accessed by web page publisher or
advertisers without users’ awareness.

25
8.5 PII Detection in network traffic
There are many approaches that can be used to analyze the network traffic of mobile
devices in an efficient and secure manner in the existing state-of-the-art. [11] used
machine learning in order to train C4.5 Decision Tree for detecting leaks in a central-
ized manner. Our work [102] analyzed the dataset provided by ReCon and provide
improved versions of classifiers inorder to enhance detection framework.

8.5.1 Dataset Overview and Pre-processing


In ReCon’s work, controlled experiments are conducted using smartphones, followed
by factory resets and connecting the devices to Meddle [103]. With Meddle, all network
traffic is routed through a proxy server using VPN tunnels. Traffic is intercepted and
modified by software middleboxes once it reaches the proxy server. Their raw IP traffic
is logged using tcpdump and HTTP flows are extracted using bro. In the following
steps, they look for PII loaded onto devices that are conspicuous.
Algorithm 1 shows the feature extraction and selection technique: We utilized a
bag-of-words model inspired by ReCon’s approach. This model treats certain charac-
ters (such as , ; [] / ()) as separators, defining words (feature) as sequences of characters
between them. A binary vector is created, marking a word with an integer greater than
1 based on its occurrence in a flow, or with 0 otherwise. However, the model generates
an extensive output, and several steps are taken to reduce and select relevant features.
Initially, a feature is eliminated if its word frequency is below a predefined threshold.
To preserve rarely occurring PII, oversampling is conducted to surpass the selected
threshold. To prevent the model from utilizing PII values as features, randomization
of PII values in each flow during training is done. Lastly, stop-word-based filtering
using tf-idf is employed to eliminate commonly occurring words in HTTP flows (e.g.,
content-length, en-us, etc.). Only features with low tf-idf values, which did not appear
adjacent to a PII leak in a flow, are considered.

8.5.2 Methodology
To start our work we first pre-processed the dataset to select the relevant features using
tf-idf. As mentioned earlier, following the elimination of frequently appearing words
in HTTP flows using tf-idf, we proceed to iterate through all features. Probabilities
calculated to map a particular keyword to PII value is depicted in Figure 6. For each
feature, we calculate the confidence level using heuristics, similar to the approach in
the ReCon’s work.
KP II
Heuristic1 : Ptype,key =
Kall
Here, KP II gives number of times the key appeared in flows with positive PII
leakage and Kall gives number of times the key appeared in all flows. Features having
confidence level greater than chosen threshold are selected for further examination.
Figure 7 shows a snap-shot of features selected for a domain named “ea.com” for
android OS.

26
Algorithm 1: Feature extraction and selection
Input : a) Threshold for word frequency: f reqt ,
b) Stop-word list: stop words,
c) Threshold for tf-idf value: tf idft
Output: Selected features for training the model
Define list of separators:
separators ← {‘,’, ‘;’, ‘{}’, ‘[]’, ‘/’,. . . , ‘()’}
Initialize a 2-D matrix V .
for each HTTP flow i = 1, 2, . . . , n do
Wi ← Split(f lowi , separators)
end for
for each word wj ∈ Wi where 1 ≤ i ≤ n do
if wj appears in f lowi then
mark Vi,j as 1
else
mark Vi,j as 0
end if
end for
Randomize any PII values in Wi to obtain a new set Wi′
n
P
Calculate the word frequency for each word wj across all flows: fj = Vi,j
i=1
for each word wj do
if fj < f reqt and wj ∈ / PII then
Remove wj from the list of features.
end if
if fj < f reqt and wj ∈ PII then
Oversample wj to meet the threshold.
end if
Calculate the tf-idf value for each word wj across all flows.
end for
if tf idfi,j > tf idft or wj ∈ stop words then
remove wj from the list of features.
end if
for each f lowi do
Remove all features that appear adjacent to a PII leak in Wi′ .
end for
Output the selected features as a set of binary vectors Vi,j for each f lowi .

To initiate our experiments, we partitioned the ReCon dataset into two sets: a
training set containing 60 domains and a testing set with 12 domains. The train-
ing set comprises approximately 7200 data inputs and 6500 features. Our initial step
involved utilizing a Decision Tree to identify sensitive information in network traffic
packets. Our goal is to evaluate and improve the performance of the Decision Tree
model compared to previous works in detecting PII. We experimented with feature
manipulation, removing certain features from the dataset using various heuristics.
Detecting PII in network flows often requires recognizing intricate patterns and rela-
tionships among different data points. Leveraging the ability of neural networks to
model non-linear relationships, we decided to apply a neural network model to this
task after the Decision Tree experiments. Our neural network model employs the ReLU
activation function in all hidden layers to address vanishing gradient issues during

27
Fig. 6: a) Shows list of probabilities that particular keyword corresponds to a PII
value. b) Shoes list of probabilities calculated for keys in each domain and OS.

Fig. 7: Features extracted using Algorithm 1 for a single domain.

back-propagation. For our binary classification problem, the output layer uses the sig-
moid activation function to ensure the output Yi falls within the range [0, 1]. The loss
function L employed in the current context is the binary cross-entropy function.
n
1X
L=− (Yi · log Ŷi + (1 − Yi ) · log(1 − Ŷi ))
n i=1

RMSprop optimizer is used to minimize L. At iteration t for weight w, RMSprop


update is given by:  2
∂L
vt = γ ∗ vt−1 + (1 − γ) ∗
∂w t

28
where, vt is the moving average of squared gradients at tth iteration, γ is decay rate
controlling the weights given to previous squared gradients in the moving average.
Therefore, weight update follows:
α (∂L/∂w)
wt+1 = wt − √
vt + ϵ

where, α is the learning rate and ϵ is a small constant typically set in the range of
10−8 to 10−10 . The choice of α determines the time needed for model to converge. If
it is fixed too small, it slows the convergence, on the contrary, a large value of α will
possibly result in oscillation, preventing the error to decrease below a certain value.
In our study [104], upon examination, we found numerous features that are repeti-
tive or duplicate. For instance, features like ‘connection’ and ‘Connection’ are regarded
as distinct features despite having same meaning. Such duplication could adversely
impact the effectiveness of the detection framework.
Moreover, we observed that certain encoded strings, such as
“QVQwOjEuMDozLjA6NjA6akxKZ2d5RHF5NDhoMFVYbHJnanhqZ3B1eWV
CRXFjbFYyZ3E6NTIxMjQ6bmxpYzU”
are incorporated as features, yet they lack meaningful information. Including these
strings as features merely inflates the total number of features without adding any
value in terms of informative content. Hence, we implemented various techniques to
filter out such features, as elaborated below:
1. Removing duplicates: Removing duplicates involved a two-step process. Initially, we
transformed all feature names to lowercase, ensuring uniformity. Subsequently, we
grouped similar features together, effectively eradicating redundant columns from
the dataset.
2. To decode encoded strings and extract insights, we employed a tool named
CyberChef [105], which employs diverse decoding operations to unveil the underly-
ing information. Despite employing several decoding operations, it became apparent
that many encoded strings either failed to reveal meaningful information or did not
pertain to any PII. Consequently, we excluded such features from further analysis.
3. Stemming: We employed Porter’s Stemmer algorithm for suffix stripping to reduce
features to their root forms, a process known as stemming. Following this, we
clustered together all features that shared a common root form. This approach
enabled us to streamline and organize the features according to their linguistic
similarities. For instance, features ‘attempting’ are grouped under feature ‘attempt’.
4. We noticed that some features had lengths significantly longer than typical words,
while others were very short. We theorized that PII words would likely fall within
a certain length range and conform to English language patterns. For example,
URLs or file names often have longer lengths. After experimentation, we found that
setting a threshold of 70 effectively captured all relevant words. Hence, any features
exceeding this length were omitted from further analysis. Additionally, strings with
a length of 1 were also excluded.
After filtering out the relevant features, all experiments were rerun on the refined
dataset. The results of each experiment are detailed in the subsequent section.

29
Global Local

a) b) c)

Fig. 8: LIME explaining the prediction of model f (·) a) Shows the global context
and decision boundary created by complex non-linear model f (·). Grey point is the
predicted instance for which explanations are needed. b) Shows the decision bound-
ary for simple linear model m(·). Idea here is to see neighbourhood and do simple
explanation of local region. Features that are important in global context may not be
important in local area. c) Shows the perturbed data-points (black points). The expo-
nential kernel (πx ) can be considered as a heat map. The perturbed data-points are
weighted according to distance to the predicted instance

To evaluate the interpretability of our best-performing models (decision tree and


neural network), we utilized LIME (Local Interpretable Model-Agnostic Explanations)
[106]. The main idea behind LIME is that a model is trained to approximate the
predictions of the underlying black-box/complex model. LIME generates a new dataset
consisting of perturbed samples and the corresponding predictions of the complex
model as shown in Figure 8. On this new dataset, LIME then trains an interpretable
model m such as linear model or decision tree that is weighted by the proximity of the
sampled instances to the instance of interest. Let ⃗x ∈ Rd , represent an instance being
explained. The weighted loss is defined as:
X 2
L(f, m, πx ) = πx (⃗ p) − m(⃗
p) (f (⃗ p))
⃗∈P
p

The function f represents the complex model being explained, with f : Rd → R.


P represents the set of all perturbed instances, each having a weight πx . πx (⃗
p) is the
proximity measure between an instance p⃗ and ⃗x, to define local neighborhood of ⃗x. It
is therefore, sum of squared distance between predictions of complex model f (⃗ p) and
predictions of simple model m(⃗ p).

8.5.3 Results
1. Decision tree: We employed a similar decision tree (DT) model as proposed by
ReCon on their dataset. Applying the same model to our filtered dataset resulted
in a 62% accuracy on the test data, indicating that feature filtering did not yield
improvements compared to our previous work [102]. After removing domain and
OS features, the classifier’s performance improved, but feature filtering alone did

30
not prove beneficial. Incorporating heuristics-1 and setting a threshold to 0.2, as
in [102], reduced the features from 6,538 to 3,343 but led to a decline in accuracy.
The same model on the filtered dataset yielded a similar accuracy.
We then expanded our test samples, as in [102], while employing heuristics-1.
This approach improved the model’s performance on the filtered dataset to 68%.
This increase may be attributed to the test sample points now demonstrating a
distribution similar to that of the model’s training data points. However, the issue
of overfitting persisted. To address this, we applied cost complexity pruning to
the model, which raised the accuracy to 72.89% as shown in Figure 9. It shows
how accuracy of the decision tree classifier changes with different levels of pruning
(controlled by cost complexity parameter, ‘ccp alpha’) for both the training and
test datasets.

Fig. 9: Performance of decision tree after using cost complexity pruning. Here, x-
axis represents the values of the cost complexity parameter and y-axis represents the
corresponding accuracy scores

2. Neural network:We constructed a neural network (NN) model with 1 input layer
and 6 hidden layers, each containing 2048, 1024, 512, 256, 128, and 64 neurons
respectively. The output layer comprised 1 neuron for binary classification. Follow-
ing the approach of [102], we integrated 9 top features (including domain and OS)
selected via Chi-Square scoring into the model. This yielded a training accuracy of
52.6% and a testing accuracy of 62%, mirroring the results of [102]. Analysis of the
NN’s weights before and after training indicated the model’s inability to effectively
learn.
Subsequently, using 7 features (after excluding domain and OS from the initially
selected features) instead of 9, the NN maintained the same accuracy of 62% as in
[102]. However, unlike the DT model, which achieved 67% accuracy, this discrepancy
suggested that the NN model wasn’t learning optimally.
Following the implementation of Heuristic-1, the training accuracy surged to
92%, but testing accuracy after feature filtering showed improvement to 57% com-
pared to [102]. Nonetheless, these results suggested overfitting on the given dataset.
To address this issue, we reduced the model’s complexity from 6 to 3 hidden layers,

31
akin to [102]. While the training dataset exhibited a 95% accuracy, testing revealed
a 74.5% accuracy, which is still 2.5
Table 7 shows results of the best performing classifiers and indicate a significant
enhancement in the performance of models as a result of filtering the features.
3. Explainablity of obtained results after filetring using LIME: We utilized
LIME to refine the top classifiers in both categories by identifying crucial features
using the following heuristics.
Given a dataset of nval validation samples (subset that helps to tune hyperpa-
rameters and assess the model’s performance on data it hasn’t seen during training)
and d features, let impi,j represent the importance value of feature j in sam-
ple i, where i = 1 to nval and j = 1 to d. Initially, for each sample i, the set
impi,nonzero = {impi,j |impi,j ̸= 0 for j = 1 to d} is defined as the subset of non-
zero feature importance values. These non-zero importance values are then sorted
in descending order to form impi,sorted .
• Heuristic 2 (H2): A threshold is calculated as P ercentile(|impi,sorted |ni=1
val
, 75).
For each sample i, feature j is retained if |impi,j |>threshold
Tnval and included in
selected f eaturesi . Finally, f inal selected f eatures = i=1 selected f eaturesi
• Heuristic 3 (H3): Let ai = ⌈0.2 × d⌉ be the number of features chosen for sample
i, where 0.2 means selecting the top 20% of features, and top featurei be the set of
top ai features selected for sample i. The set Common f eatures = {top featurei |
countai ≥ 0.75 × nval } contains the features common to 75% or more of the
validation samples.
Once the essential features contributing to classifier predictions were identified, the
classifier was trained using only these selected features.
Using the LIME Tabular Explainer [107], we generated explanations for 200
validation samples, covering all 2202 features. Employing Heuristics-2 and 3, we
selected features of higher importance and retrained the best classifiers from Table
7. The results for each scenario are presented in Table 8. Remarkably, the NN classi-
fier, utilizing Heuristic 2, demonstrates comparable performance to the classifier in
[102] using SHAP, considering both training and test accuracy along with training
time. While there is minimal impact on classifier accuracy with reduced features
(38 in H2 and 124 in H3 ), training time is reduced compared to the variations
proposed in [102] using SHAP. Conversely, the DT classifier doesn’t benefit from
LIME, as decision trees are inherently interpretable, and LIME may not provide
significant additional value in this context.

8.5.4 Discussions
The results indicate that eliminating unnecessary features improves the performance
of machine learning models in predicting outcomes. Initially, our experiments followed
ReCon’s approach, where the DT model achieved a 62% accuracy on test data both
before and after filtering. Subsequently, we tested an NN model, initially matching
the DT’s accuracy. However, after removing two features from the dataset before
training, the DT’s accuracy improved from 62% to 67%, and post Heuristic-1, the
DT model exhibited signs of overfitting. In contrast, the NN model maintained a

32
62% accuracy, failing to adapt or learn due to unchanged model weights. We then
employed Heuristics-1 to enhance our training accuracy, but this resulted in over-
fitting. To address this, we simplified the model’s complexity to mitigate overfitting.
Consequently, our testing accuracy for the NN model increased from 47% to 72% after
this adjustment, as in [102], and from 57% to 74.5% after post-filtering. On the other
hand, applying pruning to the DT model yielded an accuracy of 72.89% on testing
data. After applying LIME for feature selection, the accuracy increased to 75.01%,
with a reduced training time of approximately 3.62 seconds for the NN. However, the
DT model without LIME provided superior results.

9 Open research directions


Extensive research has been conducted on user profiling; however, a disparity remains
between the present technology and future requirements. The most demanding ele-
ments of the user profiling process involve establishing profiles for new users and
consistently updating existing users’ profiles to align with their evolving needs, inter-
ests, and preferences [108]. Therefore, the primary hurdle in developing an adaptive
personalized application lies in constructing an automated user profile.
We highlight several open research areas aimed at enhancing user privacy through
the implementation of security measures.
1. Could user information be shared with third parties in an unclonable format? Can
an unclonable transformation be applied to user information before sharing it? Is it
possible to share user information with first-party domains and then with third-party
domains in a transparent manner, without hiding it from users?
If a user declines to share information with external domains, it should be
prohibited from leaving an internal system and being transmitted to the outside
world. However, if the user needs to share certain information with a domain in
order to access its services, it can be shared in a format that cannot be replicated
or cloned. Before transmitting the data from the user’s device, a transformation
can be applied to prevent the creation of replicas of the user’s data.
2. How can users be informed and provide consent to first-party entities who have
collected their data to share it with third parties?
Is it possible to establish a system that can track the movement of user data
regardless of where it travels. Once a user has given consent to share personal infor-
mation with any domain, those domains should refrain from selling their customers’
data to external third parties. If they do, users should receive notifications.
3. What are the possibilities of sharing information via smart contracts?
The increasing popularity of smart contract technology is due to its advantages,
including security, transparency, cost-effectiveness, and autonomy. Essentially,
smart contracts consist of predefined rules agreed upon by involved parties. Such a
system could be advantageous in the context of data sharing. When users and the
domains they interact with agree on conditions related to data sharing, users gain
greater control over their personal information.
4. Can role-based access be provided to different requesting domains by end users?

33
All first-party and third-party domains, as well as trackers present in applica-
tions that users interact with, can access data transmitted by users’ devices. In
this scenario, implementing role-based access systems can be beneficial by grant-
ing access to requesting domains based on specific conditions. For instance, a user
can share data with an application on their device to enable its functionality, while
limiting the access of embedded trackers and third-party domains to that data.
5. Is it possible to create a profit sharing model between users and entities that sell
users’ data?
The current data ecosystem involves various service providers, advertising agen-
cies, and data gathering companies that profit significantly from sharing and selling
individuals’ personal information. Users are often overlooked in this system. Profit
sharing models could be implemented to support the growth of these data-driven
firms while also providing users with a portion of the profits.
To enable users to take control of their data, it is necessary to establish a framework
that grants them the authority to determine the collection and utilization of their data.
This not only ensures a personalized experience with the service but also enables them
to directly benefit from a portion of the revenue generated by the service provider.

10 Conclusion
This article conducts a thorough analysis of user profiling, delving into its security and
privacy implications. It underscores the pros and cons of user profiling, stressing cau-
tion in granting access to personal information. We propose a taxonomy for building
user profiles, offering a valuable resource for researchers. Identifying gaps in existing
literature, we suggest future research directions. As a POC, we examine sensitive data
leakage in applications across different categories. In regions without data protection
laws, users face potential legal gaps in case of data breaches. It is crucial for users
to weigh the benefits and risks of sharing information on platforms, especially in the
absence of legal remedies. Users should stay vigilant, review terms and conditions,
and make informed decisions about data sharing, given continuous corporate mon-
itoring. Developing techniques that gather user profile information while respecting
privacy is essential. Privacy-preserving methods for profiling can enhance user comfort
with data collection. Ongoing efforts, like Apple’s App Tracking Transparency in iOS
14.5, showcase steps toward safeguarding user privacy. A transparent data ecosystem
that informs users about data sharing is crucial. Collaboration between commercial
organizations and users is essential to build a digital world that leverages modern
technologies while prioritizing privacy and security. Some countries, such as India, are
introducing policies to protect their citizens.

References
[1] Gauch, S., Speretta, M., Chandramouli, A., Micarelli, A.: User profiles for
personalized information access. The adaptive web, 54–89 (2007)

34
[2] AliceWyman, Ilias, C., Rodaro, M., TyDraniu, Joni, Ghelman, M., Rok, Gar-
denhire, L., Jeff, Fabi: Trackers and scripts Firefox blocks in Enhanced Tracking
Protection (2023). https://support.mozilla.org/en-US/kb/trackers-and-scripts-
firefox-blocks-enhanced-track

[3] Solanki, R.K., Laxmi, V., Bezawada, B., Gaur, M.S.: Mapperdroid: Verifying
app capabilities from description to permissions and api calls. Computers &
Security 111, 102493 (2021)

[4] CVE-2019-3568. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-


3568. Accessed: December 09, 2023 (2019)

[5] Kanoje, S., Girase, S., Mukhopadhyay, D.: User profiling trends, techniques and
applications. ArXiv (2015)

[6] Gao, M., Liu, K., Wu, Z.: Personalisation in web computing and informatics:
Theories, techniques, applications, and future research. Information Systems
Frontiers 12(5), 607–629 (2010)

[7] Alam, I., Khusro, S.: Tailoring recommendations to groups of viewers on smart
tv: A real-time profile generation approach. IEEE Access 8, 50814–50827 (2020)

[8] Farnadi, G., Tang, J., De Cock, M., Moens, M.-F.: User profiling through
deep multimodal fusion. In: Proceedings of the Eleventh ACM International
Conference on Web Search and Data Mining, pp. 171–179 (2018)

[9] Wood, D., Apthorpe, N., Feamster, N.: Cleartext data transmissions in consumer
iot medical devices. In: Proceedings of the 2017 Workshop on Internet of Things
Security and Privacy, pp. 7–12 (2017)

[10] Moghaddam, H., Acar, G., Burgess, B., Mathur, A., Huang, D., Feamster,
N., Felten, E., Mittal, P., Narayanan, A.: Watching you watch: The tracking
ecosystem of over-the-top tv streaming devices, pp. 131–147 (2019)

[11] Ren, J., Rao, A., Lindorfer, M., Legout, A., Choffnes, D.: Recon: Revealing and
controlling pii leaks in mobile network traffic. In: Proceedings of the 14th Annual
International Conference on Mobile Systems, Applications, and Services, pp.
361–374 (2016)

[12] Bakopoulou, E., Tillman, B., Markopoulou, A.: Fedpacket: A federated learn-
ing approach to mobile packet classification. IEEE Transactions on Mobile
Computing 21(10), 3609–3628 (2021)

[13] Zhou, X., Wang, W., Jin, Q.: Multi-dimensional attributes and measures for
dynamical user profiling in social networking environments. Multimedia Tools
and Applications 74 (2014)

35
[14] Das, A., Borisov, N., Caesar, M.: Tracking mobile web users through motion
sensors: Attacks and defenses. In: NDSS (2016)

[15] Apthorpe, N., Reisman, D., Feamster, N.: A smart home is no castle: Privacy
vulnerabilities of encrypted iot traffic (2017)

[16] Stewart, S., Davies, J.: User profiling techniques : A critical review. Electronic
Workshops in Computing (1997)

[17] Abdel-Hafez, A., Xu, Y.: A survey of user modelling in social media websites.
Computer and Information Science 6(4) (2013)

[18] Eke, C.I., Norman, A.A., Shuib, L., Nweke, H.F.: A survey of user profil-
ing: State-of-the-art, challenges, and solutions. IEEE Access 7, 144907–144924
(2019)

[19] Zhao, S., Li, S., Ramos, J., Luo, Z., Jiang, Z., Dey, A.K., Pan, G.: User profil-
ing from their use of smartphone applications: A survey. Pervasive and Mobile
Computing 59, 101052 (2019)

[20] G. U., V., K., S., Deepa Shenoy, P., K. R., V.: An overview on user profiling
in online social networks. International Journal of Applied Information Systems
11(8), 25–42 (2017)

[21] Dong, X., Li, T., Li, X., Song, R., Ding, Z.: Review-based user profiling: A sys-
tematic mapping study. Enterprise, Business-Process and Information Systems
Modeling, 229–244 (2019)

[22] Peng, J., Choo, K.-K.R., Ashman, H.: User profiling in intrusion detection: A
review. Journal of Network and Computer Applications 72, 14–27 (2016)

[23] Jang, C., Chang, H., Ahn, H., Kang, Y., Choi, E.: Profile for effective ser-
vice management on mobile cloud computing. In: International Conference on
Advanced Communication and Networking, pp. 139–145 (2011). Springer

[24] UK Telematics Online. http://www.uktelematicsonline.co.uk/ (2023)

[25] Statista: Digital advertising market share of major companies worldwide 2023
(2023)

[26] ilsr.org: amazon’s toll road - ilsr.org (2021). https://ilsr.org/wp-content/


uploads/2021/11/ILSR-AmazonTollRoad-KeyFindings-Final.pdf

[27] Pieterson, W., Ebbers, W., Dijk, J.: The opportunities and barriers of user profil-
ing in the public sector. In: International Conference on Electronic Government,
pp. 269–280 (2005). Springer

[28] IBM Accessed: (9-September-2023) (2023). https://www.ibm.com/

36
employment/careersite/mycareer-at-ibm/

[29] Poo, D., Chng, B., Goh, J.-M.: A hybrid approach for user profiling. In: 36th
Annual Hawaii International Conference on System Sciences, 2003. Proceedings
of The, p. 9 (2003). IEEE

[30] Murmuria, R., Stavrou, A., Barbara, D., Sritapan, V.: Your data in your hands:
Privacy-preserving user behavior models for context computation. In: 2017
IEEE International Conference on Pervasive Computing and Communications
Workshops (PerCom Workshops), pp. 170–175 (2017). IEEE

[31] Kwon, M.-C., Choi, S.: User behavior classification based on smart watch and
machine learning algirithm. Contemporary Engineering Sciences 10(7), 345–352
(2017)

[32] Chen, J., Liu, Y., Zou, M.: Home location profiling for users in social media.
Information & Management 53(1), 135–143 (2016)

[33] Farseev, A., Akbari, M., Samborskii, I., Chua, T.-S.: 360◦ user profiling: past,
future, and applications. ACM SIGWEB Newsletter,(Summer) 10, 2956573–
2956577 (2016)

[34] Yang, Y.C.: Web user behavioral profiling for user identification. Decision
Support Systems 49(3), 261–271 (2010)

[35] Raghu, T., Kannan, P., Rao, H.R., Whinston, A.B.: Dynamic profiling of con-
sumers for customized offerings over the internet: A model and analysis. Decision
Support Systems 32(2), 117–134 (2001)

[36] Farid, M., Elgohary, R., Moawad, I., Roushdy, M.: User profiling approaches,
modeling, and personalization. In: Proceedings of the 11th International Con-
ference on Informatics & Systems (INFOS 2018) (2018)

[37] Fakhfakh, R., Feki, G., Ammar, A.B., Amar, C.B.: Personalizing information
retrieval: A new model for user preferences elicitation. In: 2016 IEEE Interna-
tional Conference on Systems, Man, and Cybernetics (SMC), pp. 002091–002096
(2016). IEEE

[38] Sakagami, H., Kamba, T.: Learning personal preferences on online newspaper
articles from user behaviors. Computer Networks and ISDN Systems 29(8-13),
1447–1455 (1997)

[39] Gentili, G., Micarelli, A., Sciarrone, F.: Infoweb: An adaptive information fil-
tering system for the cultural heritage domain. Applied Artificial Intelligence
17(8-9), 715–744 (2003)

[40] Nanas, N., Uren, V., De Roeck, A.: Building and applying a concept hierarchy

37
representation of a user profile. In: Proceedings of the 26th Annual International
ACM SIGIR Conference on Research and Development in Informaion Retrieval.
SIGIR ’03, pp. 198–204. Association for Computing Machinery, New York, NY,
USA (2003). https://doi.org/10.1145/860435.860473 . https://doi.org/10.1145/
860435.860473

[41] Chen, W., He, M., Ni, Y., Pan, W., Chen, L., Ming, Z.: Global and personal-
ized graphs for heterogeneous sequential recommendation by learning behavior
transitions and user intentions, pp. 268–277 (2022)

[42] Banouar, O., Raghay, S.: User profile construction for personalized access to
multiple data sources through matrix completion method. (2016)

[43] Cufoglu, A.: User profiling-a short review. International Journal of Computer
Applications 108(3) (2014)

[44] Nilashi, M., Jannach, D., Ibrahim, O.b., Ithnin, N.: Clustering- and regression-
based multi-criteria collaborative filtering with incremental updates. Elsevier
(2014)

[45] Li, S., Zhao, H.: A survey on representation learning for user modeling. In: Pro-
ceedings of the Twenty-Ninth International Conference on International Joint
Conferences on Artificial Intelligence, pp. 4997–5003 (2021)

[46] Chen, T., Yin, X., Peng, L., Rong, J., Yang, J., Cong, G.: Monitoring and
recognizing enterprise public opinion from high-risk users based on user portrait
and random forest algorithm. Axioms 10(2) (2021)

[47] Hamim, T., Benabbou, F., Sael, N.: Student profile modeling using boost-
ing algorithms. International Journal of Web-Based Learning and Teaching
Technologies (IJWLTT) 17(5), 1–13 (2022)

[48] Tripathi, A., Yadav, S., Rajan, R.: Naive bayes classification model for the
student performance prediction. In: 2019 2nd International Conference on Intel-
ligent Computing, Instrumentation and Control Technologies (ICICICT), vol. 1,
pp. 1548–1553 (2019)

[49] Ouaftouh, S., Zellou, A., Idri, A.: Social recommendation: A user profile
clustering-based approach. Concurrency and Computation: Practice and Expe-
rience 31(20), 5330 (2019)

[50] Bi, M., Xu, J., Wang, M., Zhou, F.: Anomaly detection model of user behavior
based on principal component analysis. Journal of Ambient Intelligence and
Humanized Computing 7, 547–554 (2016)

[51] Qi, Y., Hu, K., Zhang, B., Cheng, J., Lei, J.: Trilateral spatiotemporal attention

38
network for user behavior modeling in location-based search. In: Proceed-
ings of the 30th ACM International Conference on Information & Knowledge
Management, pp. 3373–3377 (2021)

[52] Wang, R., Zhu, H., Wang, L., Chen, Z., Gao, M., Xin, Y.: User identity link-
age across social networks by heterogeneous graph attention network modeling.
Applied Sciences 10(16), 5478 (2020)

[53] Wang, R., Wu, Z., Lou, J., Jiang, Y.: Attention-based dynamic user mod-
eling and deep collaborative filtering recommendation. Expert Systems with
Applications 188, 116036 (2022)

[54] Wang, X., Sun, G., Fang, X., Yang, J., Wang, S.: Modeling spatio-temporal
neighbourhood for personalized point-of-interest recommendation. In: IJCAI,
pp. 3530–3536 (2022)

[55] Yuan, F., Zhang, G., Karatzoglou, A., Jose, J., Kong, B., Li, Y.: One person, one
model, one world: Learning continual user representation without forgetting. In:
Proceedings of the 44th International ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 696–705 (2021)

[56] Qi, T., Wu, F., Wu, C., Huang, Y.: News recommendation with candidate-aware
user modeling. In: Proceedings of the 45th International ACM SIGIR Conference
on Research and Development in Information Retrieval, pp. 1917–1921 (2022)

[57] Deng, S., Cai, Q., Zhang, Z., Wu, X.: User behavior analysis based on
stacked autoencoder and clustering in complex power grid environment. IEEE
Transactions on Intelligent Transportation Systems 23(12), 25521–25535 (2021)

[58] Fazelnia, G., Simon, E., Anderson, I., Carterette, B., Lalmas, M.: Variational
user modeling with slow and fast features. Proceedings of the Fifteenth ACM
International Conference on Web Search and Data Mining (2022)

[59] Liu, Q., Wu, J., Huang, Z., Wang, H., Ning, Y., Chen, M., Chen, E., Yi, J., Zhou,
B.: Federated user modeling from hierarchical information. ACM Transactions
on Information Systems 41(2), 1–33 (2023)

[60] Donkers, T., Loepp, B., Ziegler, J.: Sequential user-based recurrent neural net-
work recommendations. In: Proceedings of the Eleventh ACM Conference on
Recommender Systems, pp. 152–160 (2017)

[61] Chu, Y.-W., Hosseinalipour, S., Tenorio, E., Cruz, L., Douglas, K., Lan, A., Brin-
ton, C.: Mitigating biases in student performance prediction via attention-based
personalized federated learning. In: Proceedings of the 31st ACM International
Conference on Information & Knowledge Management, pp. 3033–3042 (2022)

[62] Li, M., Han, X., Sheng, H., Ma, L., Kong, H., Liu, W., Mao, B.: A novel RNN

39
model with enhanced behavior semantic for network user profile. In: Tenth
International Conference on Advanced Cloud and Big Data, CBD 2022, Guilin,
China, November 4-5, 2022, pp. 190–193. IEEE, ??? (2022)

[63] Avny Brosh, T., Livne, A., Sar Shalom, O., Shapira, B., Last, M.: Bruce: Bundle
recommendation using contextualized item embeddings. In: Proceedings of the
16th ACM Conference on Recommender Systems, pp. 237–245 (2022)

[64] Wu, C., Wu, F., Qi, T., Huang, Y.: Userbert: Pre-training user model with con-
trastive self-supervision. In: Proceedings of the 45th International ACM SIGIR
Conference on Research and Development in Information Retrieval. SIGIR ’22,
pp. 2087–2092. Association for Computing Machinery, New York, NY, USA
(2022)

[65] Acxiom UK. https://www.acxiom.co.uk/. Accessed: July 09, 2023

[66] Pinchot, J., Chawdhry, A.A., Paullet, K.: Data privacy issues in the age of data
brokerage: An exploratory literature review. Issues in Information Systems 19(3)
(2018)

[67] Christl, W., Spiekermann, S.: Networks of Control: A Report on Corporate


Surveillance, Digital Tracking, Big Data & Privacy. Cracked Labs (2016)

[68] CIPPIC: Data Broker Profiles: Acxiom and LiveRamp. Online (2019). https://
databrokers.cippic.ca/2019/01/09/data-broker-profiles-acxiom-and-liveramp/

[69] LIVERAMP: Ad Targeting: People-Based Targeting With IdentityLink (2020).


https://liveramp.com/blog/ad-targeting-people-based-targeting/

[70] developer.myacxiom.com: Endpoints (2023). https://developer.myacxiom.com/


code/api/endpoints/abilitec-link

[71] Oracle Corporation: What is Data Cloud? https://www.oracle.com/in/


advertising/what-is-data-cloud/ (Accessed: 2024-02-24)

[72] Privacy Bee: These Are the Largest Data Brokers in America. https://
privacybee.com/blog/these-are-the-largest-data-brokers-in-america/

[73] docs.oracle.com: Oracle Data Cloud Platform Help Center (2021).


https://docs.oracle.com/en/cloud/saas/data-cloud/data-cloud-help-center/
IntegratingBlueKaiPlatform/id management.html

[74] Zawadziński, M., PRO, P., Sweeney, M.: 10. User Identification (2022). https:
//adtechbook.clearcode.cc/

[75] Uno, H.: Get my idfa (2021). https://apps.apple.com/us/app/get-my-idfa/


id1580368827

40
[76] ibm Accessed: (9-September-2023) (2021). https://cloud.ibm.com/docs/
services/personality-insights?topic=personality-insights-about

[77] Paul, M., Maglaras, L., Ferrag, M.A., AlMomani, I.: Digitization of healthcare
sector: A study on privacy and security concerns. ICT Express (2023)

[78] University, A.S.: (2014). http://www.astate.edu/news/arkansas-state-reports-


data-breach-related-to-dhs-childhood-services

[79] tdi.texas.gov Accessed: (9-August-2023) (2023). https://www.tdi.texas.gov/


data-security-event/index.html

[80] Beeferman, J., Washington, J., Tribune, T.: Almost 2 million Tex-
ans affected by Texas Department of Insurance Data Breach. KXAN
Austin (2022). https://www.kxan.com/news/texas/almost-2-million-texans-
affected-by-texas-department-of-insurance-data-breach/

[81] CS Hub: Iotw: Twitter accused of covering up data breach that affects millions.
CS Hub Attacks News (2023)

[82] Intelligence, S.: ChatGPT Confirms Data Breach. https://


securityintelligence.com/articles/chatgpt-confirms-data-breach/

[83] spectrumnews1.com: Attorney weighs in on Norton ransomware attack letter


(2023). https://spectrumnews1.com/ky/louisville/news/2023/12/22/attorney-
weighs-in-on-norton-cyber-attack-letter

[84] Gatlan, S.: 200,000 Facebook Marketplace User Records Leaked on Hacking
Forum. https://www.bleepingcomputer.com/news/security/200-000-facebook-
marketplace-user-records-leaked-on-hacking-forum/

[85] III.org: Facts + statistics: Identity theft and cybercrime (2022). https://
www.iii.org/fact-statistic/facts-statistics-identity-theft-and-cybercrime

[86] FTC: InfoTrax Systems, L.C. (2020). https://www.ftc.gov/enforcement/cases-


proceedings/162-3130/infotrax-systems-lc

[87] Osborne, C.: Finnish mental health patients blackmailed after suspected Data
Breach. The Daily Swig (2020). https://portswigger.net/daily-swig/finnish-
mental-health-patients-blackmailed-after-suspected-data-breach

[88] Confessore, N.: Cambridge Analytica and Facebook: The Scandal and the Fallout
So Far. The New York Times (2018). https://www.nytimes.com/2018/04/04/
us/politics/cambridge-analytica-scandal-fallout.html

[89] Adams, J., Almahmoud, H.: The meaning of privacy in the digital era. Interna-
tional Journal of Security and Privacy in Pervasive Computing (IJSPPC) 15(1),
1–15 (2023)

41
[90] Developers, A.: App Manifest Overview : android developers (2023). https://
developer.android.com/guide/topics/manifest/manifest-intro

[91] Felt, A.P., Ha, E., Egelman, S., Haney, A., Chin, E., Wagner, D.A.: Android
permissions: user attention, comprehension, and behavior. In: SOUPS (2012)

[92] Wijesekera, P., Baokar, A., Tsai, L., Reardon, J., Egelman, S., Wagner, D.,
Beznosov, K.: The feasibility of dynamically granted permissions: Aligning
mobile privacy with user preferences. In: 2017 IEEE Symposium on Security and
Privacy (SP), pp. 1077–1093 (2017)

[93] Reardon, J., Feal, Á., Wijesekera, P., On, A.E.B., Vallina-Rodriguez, N., Egel-
man, S.: 50 ways to leak your data: An exploration of apps’ circumvention of
the android permissions system. In: USENIX Security Symposium (2019)

[94] meity: (2023). https://www.meity.gov.in/content/digital-personal-data-


protection-act-2023

[95] csrc.nist.gov Accessed: (9-July-2023) (2015). https://csrc.nist.gov/glossary/


term/PII

[96] Pan, E., Ren, J., Lindorfer, M., Wilson, C., Choffnes, D.R.: Panoptispy:
Characterizing audio and video exfiltration from android applications (2018)

[97] Liu, B., Liu, B., Jin, H., Govindan, R.: Efficient privilege de-escalation for
ad libraries in mobile apps. Proceedings of the 13th Annual International
Conference on Mobile Systems, Applications, and Services (2015)

[98] Enck, W., Gilbert, P., Chun, B.-G., Cox, L.P., Jung, J., McDaniel, P., Sheth,
A.N.: Taintdroid: An information-flow tracking system for realtime privacy
monitoring on smartphones. ACM Transactions on Computer Systems (2010)

[99] Razaghpanah, A., Vallina-Rodriguez, N., Sundaresan, S., Kreibich, C., Gill, P.,
Allman, M., Paxson, V.: Haystack: In situ mobile traffic analysis in user space
(2015)

[100] Shuba, A., Bakopoulou, E., Asgari Mehrabadi, M., Choffnes, D., Markopoulou,
A.: Antshield: On-device detection of personal information exposure (2018)

[101] Shuba, A., Le, A., Alimpertis, E., Gjoka, M., Markopoulou, A.: Antmonitor:
System and applications. CoRR abs/1611.04268 (2016)

[102] Kohli, R., Chatterjee, S., Gupta, S., Gaur, M.S.: Tracking pii ex-filtration:
Exploring decision tree and neural network with explainable ai. In: IEEE Inter-
national Conference on Advanced Networks and Telecommunications Systems
(ANTS) (2023). IEEE

[103] Rao, A., Kakhki, A.M., Razaghpanah, A., Tang, A., Wang, S., Sherry, J., Gill,

42
P., Krishnamurthy, A., Legout, A., Mislove, A., Choffnes, D.R.: Using the middle
to meddle with mobile. (2013)

[104] Kohli, R., Gupta, S., Gaur, M.S.: A deep dive into relevant feature identification
for unveiling pii leakage in smartphones. In: International Conference on Signal
Processing and Communications (SPCOM) (2024). IEEE

[105] CyberChef — gchq.github.io. https://gchq.github.io/CyberChef/. [Accessed 05-


04-2024]

[106] Ribeiro, M.T., Singh, S., Guestrin, C.: ” why should i trust you?” explaining
the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pp. 1135–
1144 (2016)

[107] Ribeiro, M.T.: LIME Documentation. https://lime-ml.readthedocs.io/en/


latest/lime.html#module-lime.lime tabular

[108] Cufoglu, A.: Multi-dimensional clustering in user profiling. PhD thesis, Univer-
sity of Westminster (2012)

43
Table 5: Some high impact data breaches reported between 2022 and 2024. [[79], [80], [81], [82],[83],[84]]
Victim OccurredImpact Cause/vulnerability Compromised data
Texas May 1.8 • Security issue with a web application that man- Social security numbers,
Department 2022 million ages workers’ compensation information. addresses,dates of birth, phone
of Insurance numbers and information about
• This vulnerability was due to programming worker’s injuries.
code that allowed internet access to a protected
area of the application.

Twitter July 5.4 • Vulnerability on the social media site. Email addresses and phone num-
2022 million • Allows to obtain Twitter ID without any bers of celebrities, companies, ran-
authentication of any user by submitting a doms, etc.
phone number/email
• Possible even though the user has prohibited
this action in the privacy settings.

44
ChatGPT March ≈ 1% • Vulnerability in the code’s open-source library Credit card information and the
2023 data “Redis”. titles of some chats they initiated.
It was also possible to see another
• Used to cache user information for faster recall active user’s first and last name,
and access. email address, payment address,
the last four digits (only) of a credit
card number, and credit card expi-
ration date.
Norton May 2.5 • Unauthorized individual’s access to the com- Names, contact details, SSN num-
Healthcare 2023 million bers, dates of birth, health and
pany’s network storage devices insurance information, medical ID
numbers, driver’s license numbers,
government ID numbers, financial
account numbers, and digital signa-
tures
Facebook February 0.20 Facebook Marketplace database was stolen using Names, phone numbers, email
Market- 2024 million the ‘algoatson’ Discord handle after hacking the addresses, Facebook IDs, and Face-
place systems of a Meta contractor. book profile information.
Table 6: Summary of private data leaks in Mobile App Traffic (Duration: Oct 2022 -
March 2023 ). 3 These are versions of the applications that have been analyzed during
our experiments. 4 This is as per Indian playstore and here M represents Million and
B represents Billion.
Category App App ver- No. of Location User Identifiers Device Identifiers
sion3 down-
loads4
Pinterest 7.43.1 500M+ - First name, last name IP address, device configurations and
details, Google advertising ID, appli-
Social cation install and last update time
Skype 8.81.0.268 1B+ Location Name, gender, date-of-birth, Device model name
of user email-id, text messages,
and phone number, ID of user
also his and his contacts
contacts
LinkedIn 4.1.710 1B+ - First name, last name, pro- Device identifiers (Android ID)
file picture URL, headline
of profile, session-key and
session-password
WhatsApp 2.22.13.76 5B+ Country Phone number, email-id Device details (device name, OS,
android version), Android ID, device
ID, SDK version
Reddit 2020.8.2 100M+ Country User ID, user name, user Device details (Android version,
password, email-id device name, OS, OS version, hard-
ware ID, device fingerprint id, device
dimension), advertising ID, applica-
tion install and last update time,
device ID, SDK version.
Gradeup 11.07 10M+ - Name, phone number, OTP Device identifiers and configuration
Education used while registering,
email-address
Voot 4.2.7 100M+ - Gender, age, phone number, Device details, Google advertisement
Entertain- email-id ID
ment Zee5 35.1338119.0 100M+ Country, Email-ID, first name, last Device details (Android ver-
region login, gender, date of birth, sion,device name), device ID, install
phone number time.
Cleartrip 22.3.0 10M+ - Username, email-address, Device identifiers, device details
password, phone num- (model, network, device dimensions,
Travel ber, date-of-birth, gender, network type, CPU type, OS version,
registration details (check- screen dpi), advertising IDs.
in date, check-out date,
number of adults and chil-
dren, destination city and
country)
OYO 5.9.2 50M+ - Name of user, date of Device identifiers and google adver-
birth, gender, marital sta- tising ID.
tus, phone number, email-id
Dominos 9.8.18 50M+ User’s First name, last name, Device identifiers, device details
India address mobile number, email-id (model, network, device dimensions,
network type, CPU type, OS version,
Shopping screen DPI), advertising IDs
Myntra 4.2201.1 100M+ User’s First name, last name, Device details, IP address, network
address phone number, gender, first type
login date, OTP for login,
UPI ID, email-id
eBay 6.49.0.3 100M+ Country Email ID Device details (model, dimensions,
OS, hardware, manufacturer, OS
version, physical memory,processor,
architecture, RAM, disk space, pro-
cessor count, processor word size,
system up time, thermal state, time
zone, user language)
Airtel 4.40.10 100M+ Location Name, phone number, alter- siNumber, device details, advertise-
Thanks (not nate number, email-id ment IDs, local IP.
Others exact)
Adobe 22.6.0.22829 500M+ Latitude First name, last name, Device ID, OS, device details, device
Acrobat and email-address, contents of configurations, connection type, local
Reader longitude the PDF opened in applica- IP, advertising ID.
tion, list of files in Google
drive
Flipboard 4.2.93 500M+ - Name, email-id Device brand
Truecaller 12.17.8 500M+ - Phone number and email-id Device ID, device details (OS,
Model) and sim number
Zoom 500M+ Country Phone number Device details (model, dimensions,
OS), Device ID, SDK version.

45
Table 7: Best results obtained.
Model Train Test
DTpost heuristics,↑testing samples 93.79% 72.89% ⋆
N Npost heuristics,↑testing samples 95% 74.5% ⋆⋆
Table 8: Summary of results obtained using LIME
Model Train Test Tr. time (s)
DTbest,w/o LIME 93.79% 72.89% ⋆ a 0.095
DTwith SHAP [102] 99% 67% 0.16
DTwith LIME-H1 92.6% 71.2% 0.004
DTwith LIME-H2 91.86% 71.2% 0.005
N Nbest,w/o LIME 95% 74.5% ⋆⋆ b 16.54
N Nwith SHAP [102] 84.7% 75.32% 5.45
N Nwith LIME-H1 91.01% 73.38% 3.13
N Nwith LIME-H2 95.29% 75.01% 3.615

a
Taken from Table 7 (row having ⋆ marker)
b
Taken from Table 7 (row having ⋆⋆ marker)

46

You might also like