0% found this document useful (0 votes)
17 views15 pages

CookieGraph Understanding and Detecting First-Party Tracking Cookies

The document presents CookieGraph, a machine learning-based approach to detect and block first-party tracking cookies, which are increasingly used by advertisers as third-party cookies are being blocked by major browsers. CookieGraph achieves a detection accuracy of 90.18%, outperforming existing solutions and demonstrating robustness against cookie name manipulation without causing significant website breakage. The study reveals that first-party tracking cookies are prevalent on 89.86% of the top-million websites, with a majority ghostwritten by third-party scripts, highlighting the need for effective countermeasures in web security.

Uploaded by

sibiga2664
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views15 pages

CookieGraph Understanding and Detecting First-Party Tracking Cookies

The document presents CookieGraph, a machine learning-based approach to detect and block first-party tracking cookies, which are increasingly used by advertisers as third-party cookies are being blocked by major browsers. CookieGraph achieves a detection accuracy of 90.18%, outperforming existing solutions and demonstrating robustness against cookie name manipulation without causing significant website breakage. The study reveals that first-party tracking cookies are prevalent on 89.86% of the top-million websites, with a majority ghostwritten by third-party scripts, highlighting the need for effective countermeasures in web security.

Uploaded by

sibiga2664
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

CookieGraph:

Understanding and Detecting First-Party Tracking Cookies


Shaoor Munir Sandra Siby Umar Iqbal
[email protected] [email protected] [email protected]
University of California, Davis Imperial College London Washington University in St. Louis
Davis, CA, USA London, United Kingdom St. Louis, MO, USA

Steven Englehardt Zubair Shafiq Carmela Troncoso


[email protected] [email protected] [email protected]
Independent Researcher University of California, Davis EPFL
Highland Park, NJ, USA Davis, CA, USA Lausanne, Switzerland

ABSTRACT KEYWORDS
As third-party cookie blocking is becoming the norm in mainstream cookies, machine learning, privacy, tracking, web security
web browsers, advertisers and trackers have started to use first-
party cookies for tracking. To understand this phenomenon, we ACM Reference Format:
conduct a differential measurement study with versus without third- Shaoor Munir, Sandra Siby, Umar Iqbal, Steven Englehardt, Zubair Shafiq,
party cookies. We find that first-party cookies are used to store and Carmela Troncoso. 2023. CookieGraph: Understanding and Detect-
and exfiltrate identifiers to known trackers even when third-party ing First-Party Tracking Cookies. In Proceedings of the 2023 ACM SIGSAC
cookies are blocked. Conference on Computer and Communications Security (CCS ’23), November
26–30, 2023, Copenhagen, Denmark. ACM, New York, NY, USA, 15 pages.
As opposed to third-party cookie blocking, first-party cookie
https://doi.org/10.1145/3576915.3616586
blocking is not practical because it would result in major breakage
of website functionality. We propose CookieGraph, a machine
learning-based approach that can accurately and robustly detect and
block first-party tracking cookies. CookieGraph detects first-party 1 INTRODUCTION
tracking cookies with 90.18% accuracy, outperforming the state- Major browser vendors such as Safari, Firefox, and Google Chrome
of-the-art CookieBlock by 17.31%. We show that CookieGraph have either blocked or are in the process of blocking third-party
is robust against cookie name manipulation, while CookieBlock’s cookies — cookies set on domains that differ from the domain of the
accuracy drops by 15.87%. While blocking all first-party cookies site visited by a user [25, 82, 91]. Because third-party cookies are
results in major breakage on 32% of the sites with SSO logins, and accessible across different sites that a user visits, they are used for
CookieBlock reduces it to 10%, we show that CookieGraph does cross-site tracking (i.e., linking a user’s browsing activity across
not cause any major breakage on these sites. different sites). Due to their ubiquitous use in tracking, the question
Our deployment of CookieGraph shows that first-party track- arises as to how trackers will respond to third-party cookie blocking.
ing cookies are used on 89.86% of the top-million websites. We find First-party cookies — cookies that are set on the same domain as
that 96.61% of these first-party tracking cookies are in fact ghost- that being visited by a user – are of particular interest to advertisers
written by third-party scripts embedded in the first-party context. and trackers because they will still be available in the face of third-
We also find evidence of first-party tracking cookies being set by party cookie blocking. However, since first-party cookies are only
fingerprinting scripts. The most prevalent first-party tracking cook- accessible from the setting domain, it remains to be seen how they
ies are set by major advertising entities such as Google, Facebook, can be used in lieu of third-party cookies for cross-site tracking.
and TikTok. Prior literature has shown that first-party cookies set by third-
party scripts can be exfiltrated to tracking endpoints [44, 54, 77].
CCS CONCEPTS Prior work has also shown that trackers use browser fingerprinting
• Security and privacy → Privacy protections; Usability in se- to re-spawn first-party cookies [55]. Yet, there is no work studying
curity and privacy; Domain-specific security and privacy ar- the full spectrum of tracking possible through first-party cookies;
chitectures; • Computing methodologies → Classification and and crucially, no countermeasures exist to specifically detect and
regression trees. block first-party tracking cookies. To fill this gap, we first inves-
tigate the use of first-party cookies by known trackers and then
use our findings to develop a machine-learning based approach,
This work is licensed under a Creative Commons Attribution CookieGraph, to detect and block first-party tracking cookies.
International 4.0 License. We first perform a differential measurement study comparing
the use of first- and third-party cookies on a 20% sample of top-
CCS ’23, November 26–30, 2023, Copenhagen, Denmark million websites across parallel crawls, with third-party cookies
© 2023 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0050-7/23/11. enabled and blocked. We show that third-party cookie blocking
https://doi.org/10.1145/3576915.3616586 does not significantly impact the sharing of identifiers to known

3490
CCS ’23, November 26–30, 2023, Copenhagen, Denmark Shaoor Munir et al.

tracking endpoints because major trackers are already using first- that set first-party tracking cookies, including major adver-
party cookies. Our analysis reveals that these trackers store identi- tising entities such as Google, and show that fingerprinting
fiers in first-party cookies based on probabilistic and deterministic scripts set first-party cookies on 1,908 sites.
information.
Paper Organization: The rest of this paper is organized as follows:
Unlike third-party cookies, blocking all first-party cookies is
Section 2 provides an overview of the recent developments and
not practical, as some of these cookies might be required for le-
related work on cookies. Section 3 describes the threat model of
gitimate website functionality. An alternative could be the use of
first-party cookies. Section 4 presents our differential measurement
privacy-enhancing request blocking tools [62, 79, 80] that would
study to evaluate the impact of third-party cookie blocking on the
also block the cookies set by the requested resources. Unfortunately,
use of first-party cookies by trackers. Section 5 describes the design
our evaluation shows that these tools also cause breakage because
and evaluation of CookieGraph. Section 6 presents results from
tracking cookies are often set by domains that also set functional
our deployment of CookieGraph. We discuss the limitations of
cookies. Researchers have recently started to develop approaches to
CookieGraph in Section 7 and conclude in Section 8.
detect and block (both first-party and third-party) tracking cookies
[42, 59]. However, these approaches rely on content-based features
such as cookie names and values, which can lead to a high number 2 BACKGROUND & RELATED WORK
of false positives (and consequently major website breakage) while 2.1 Adoption of third-party cookies for tracking
also being susceptible to evasion [79]. Cookies were originally designed to recognize returning users, e.g.,
To address these limitations, we design and implement Cook- to maintain virtual shopping carts [70]. Soon, they were adopted by
ieGraph, a machine-learning approach to specifically detect first- third-parties to track users across websites and serve targeted ads
party tracking cookies. Instead of using content-based features, [7]. Early standardization efforts focused on limiting unintended
CookieGraph attempts to capture fundamental tracking behaviors cookie sharing across domains [47] and, despite well-known privacy
exhibited by first-party cookies that we discover in our differen- concerns [1], largely ignored the misuse of cookies by third-parties
tial measurement study. CookieGraph is able to detect first-party for cross-site tracking. Over the years, the use of third-party cookies
tracking cookies with 90.18% accuracy, outperforming the state-of- for cross-site tracking has become prevalent [43, 48, 76, 77]. Prior
the-art CookieBlock [42] by 17.31%. We also show that blocking all research shows that the vast majority of third-party cookies are
first-party cookies results in major breakage on 32% of the sites with set by advertising and tracking services (ATS) [48] and third-party
SSO logins, which is improved to 10% by CookieBlock. In contrast, cookies outnumber first-party cookies by a factor of two [43] – and
CookieGraph does not cause any major breakage on these sites. up to four when they contain identifiers [77].
Moreover, CookieGraph is robust to evasion through cookie name
manipulation, while CookieBlock’s accuracy degrades by 15.87%. 2.2 Countermeasures against third-party
We deploy CookieGraph on a 20% sample of the top-million
websites to find 108,947 first-party tracking cookies on 89.86% of the
cookies
websites. The most prevalent first-party tracking cookies are set by 2.2.1 Safari. Since its inception in 2003, Safari has blocked third-
major advertising entities, such as Google, Facebook, and TikTok, party cookies from domains that have not been visited by the user
and then exfiltrated to a large number of other advertising and as full-fledged websites [84]. In 2017, Safari introduced Intelligent
tracking endpoints. We find that 96.61% of the first-party tracking Tracking Prevention (ITP). ITP uses machine learning to automat-
cookies are in fact ghostwritten by third-party scripts, 223 of which ically detect third-party trackers. It revoked storage access from
also conduct fingerprinting, that are served from a total of 2,099 classified domains if users did not interact with them on a daily
distinct third-party domains. basis [85]. Since 2017, ITP went through several iterations, i.e., ITP
In summary, our key contributions are as follows: 1.1 [86], ITP 2.0 [87], ITP 2.1 [88], ITP 2.2 [89] and ITP 2.3 [90],
eventually leading to full third-party cookie blocking [91].
(1) We conduct a large-scale differential measurement study 2.2.2 Firefox. Firefox experimented with third-party cookie block-
to understand the usage of first-party cookies by trackers ing in 2013 [50, 51], but did not ship default-on third-party cookie
when third-party cookies are blocked. Our analysis shows blocking until the release of Enhanced Tracking Protection (ETP)
that blocking third-party cookies does not reduce the num- in 2018 [71]. ETP blocks third-party cookies based on a blocklist
ber of tracking requests containing identifiers and provides of trackers provided by Disconnect [6]. As of 2022, Firefox has
evidence that trackers already use first-party cookies in lieu launched Total Cookie Protection (TPC) which partitions all third-
of third-party cookies for tracking. party cookie access [25]. Partitioning ensures that cookies set by
(2) We introduce CookieGraph, a machine-learning based a third party on one site are distinct from those set by the same
countermeasure to detect and block first-party tracking third-party on other websites, eliminating the third party’s ability
cookies. CookieGraph captures fundamental tracking be- to track users across those websites.
haviors of first-party cookies that. CookieGraphoutperforms
the state-of-the-art in terms of accuracy, robustness, and 2.2.3 Internet Explorer and Microsoft Edge. Amongst the main-
breakage minimization. stream browsers that have deployed countermeasures against third-
(3) We deploy CookieGraph on a 20% sample of the top- party cookies, Internet Explorer (IE) and Microsoft Edge have the
million websites to measure the prevalence of first-party most permissive protections. IE blocked third-party cookies from
tracking cookies. We detect a total of 2,099 distinct domains domains that did not specify their cookie usage policy with the

3491
CookieGraph: Understanding and Detecting First-Party Tracking Cookies CCS ’23, November 26–30, 2023, Copenhagen, Denmark

P3P response header [2]. However, website owners often misrepre- are used in cross-site tracking have not been studied so far. Oh et
sented their own cookie usage policies, which rendered P3P ineffec- al. [72] investigated the sharing of first-party data with trackers
tive [68]. Since 2019, Microsoft Edge has blocked access to cookies in lieu of third-party cookie blockage, determining that identifiers
and storage in a third-party context from some trackers, based on such as email addresses were also shared to popular trackers. Their
Disconnect’s tracking protection list [6, 15, 82]. experiments show that trackers make use of identifiers like email
addresses to link user activity across different sites. They make
2.2.4 Chrome. Google Chrome is the only mainstream browser use of this knowledge to perform identity entanglement, where an
that does not restrict third-party cookies in any way in its default attacker can make use of an email address or other identifiers to
mode. In 2020, Google announced plans to phase out third-party influence the advertisements shown to a victim. This sharing of
cookies in Chrome by 2022 [78]. However, the plan has been post- additional information when third-party cookies are blocked allows
poned several times and the latest timeline suggests the phasing trackers to track users across different sites.
out of cookies by late 2024 [56]. Previous research has also shown that it is non-trivial to generate
first-party identifiers that are accessible across websites. Prior re-
2.3 Adoption of first-party cookies for tracking search has found that trackers often leverage browser fingerprinting
While third-party cookies are widely considered as the main mech- to generate first-party tracking cookies [55]. Browser fingerprint-
anism for cross-site tracking, trackers have also relied on first-party ing provides unique identifiers that are accessible across websites
cookies for various forms of tracking, as described below. but drift over time [65]. However, identifiers generated through
Same-site tracking. As early as 2012, Roesner et al. [76], noted that browser fingerprinting can be stored in cookies that persist even
third-party tracking scripts, embedded on the main webpage (i.e., after fingerprints change. In addition to browser fingerprinting, sev-
in first-party context), set first-party cookies. First-party cookies eral advertising and tracking services, such as Google Ad Manager
enable same-site tracking, where trackers can determine whether [20] and ID5 [29], specify in their documentation that they also use
a user is revisiting a website or internal pages of a site. While publisher-provided identifiers (PPIDs), such as email addresses, to
not as invasive as tracking users across different sites, significant set first-party cookies.
information about a user can be gleaned from tracking their activity We note that techniques such as CNAME cloaking also allow
on the sites they frequent (e.g., a social media or news site). advertisers and trackers to use first-party cookies. However, as
prior work has extensively studied first-party cookie leaks due to
Cross-domain same-site tracking. First-party cookies can also be CNAME cloaking, we do not focus on CNAME cloaking in this
used for cross-domain same-site tracking, where a website’s cookies paper.
are shared by trackers to other domains. In 2020, Fouad et al. [54]
found that trackers sync first-party cookies to several third-parties
on as many as 67.96% of the websites they tested. In 2021, Chen et al.
2.4 Countermeasures against first-party cookies
[44] found that more than 90% of the websites contain at least one 2.4.1 Deployed countermeasures. Safari is the only mainstream
first-party cookie that is set by a third-party script. Similar to Fouad browser that has deployed protections against first-party tracking
et al., they found that at least one first-party cookie is exfiltrated to a cookies. Safari’s ITP expires first-party cookies and storage set
third-party domain on more than half of the tested websites, raising by scripts in 7 days if users do not interact with the website [84].
concerns that these cookies might be used for tracking. Sanchez et This limit is lowered to 24 hours if ITP detects link decoration
al. [77] echoed these concerns, uncovering several instances where being used for tracking [84]. However, first-party cookie tracking
first-party cookies were ghostwritten by third-parties and then does not require link decoration to be effective. In cases where
exfiltrated to other third-parties. They conclude, through a large- link decoration is not used, trackers can still track users within the
scale measurement study of top websites and multiple case studies, 7-day window and beyond if users interact with the website within
that even after blocking third-party cookies, users are still at risk the 7-day window.
of first-party cookie based tracking.
2.4.2 Countermeasures proposed by prior research. There exist two
Cross-domain sharing of first-party cookies presents a bigger
machine-learning based approaches to detect tracking cookies. Hu
privacy issue for users than same-site tracking. While same-site
et al. [59]’s approach uses sub-strings in cookie names (e.g., track,
tracking is only restricted to domains that are able to set first-party
GDPR) as features to detect first-party and third-party tracking
cookies, cross-domain sharing of first-party cookies allows other
cookies. Bollinger et al. [42] proposed CookieBlock. CookieBlock
trackers, which are not collaborating with the first-party domains,
uses several cookie attributes such as the domain name of the
to receive information about user activity. This simplifies operations
setter, cookie name, path, value, expiration, etc. as features to detect
for trackers as instead of collaborating with each different publisher
first-party and third-party tracking cookies. These approaches rely
to set first-party cookies, they can instead leverage tracking cookies
on hard-coded content features, which makes them susceptible to
set by another tracker to monitor user activity. With this practice,
adversarial evasions (as we show later in Section 5.5.3). Moreover,
not only the third-party domains that are setting first-party cookies
these approaches mainly rely on self-disclosed cookie labels as
can track users’ activities on the site, but tracking is also extended
ground truth, which can be unreliable [83].
to other domains that receive these first-party cookies.
Cross-site tracking. While third-party cookies have been used 2.4.3 Request blocking approaches. Request blocking through browser
extensively in cross-site tracking, i.e., where a tracker links a user’s extensions, such as Adblock Plus [3], and machine-learning-based
activity across sites, the mechanisms by which first-party cookies tracker detection approaches proposed by prior research, e.g., [79],

3492
CCS ’23, November 26–30, 2023, Copenhagen, Denmark Shaoor Munir et al.

can block first-party cookies set by tracking requests. However, re-


F
quest blocking is prone to cause breakage because it blocks access to 1 UI D F+
content or cookies that might be essential for website functionality. PPI D
We confirm this is the case in Section 5.5.3)
F + PPI D
Unique focus of this paper. Prior work has only incidentally 2 UI D
measured the usage of first-party tracking cookies, and existing
approaches to detect first-party tracking cookies are lacking. In F PPI D
PPI D
this paper, we fill this void by conducting a large-scale study to 3 UI D
measure the prevalence of first-party tracking cookies and develop
an accurate and robust machine-learning approach, CookieGraph,
aimed at detecting first-party cookies. Figure 1: Cross-site tracking. The flow of information and
identifiers through an identity graph for cross-site tracking.
3 THREAT MODEL Initially, the user visits publishers 1 and 2 from one device.
Tracker A, on publishers 1 and 2, collects and sends finger-
In this section, we describe the threat model of tracking via first-
print 𝐹 to the identity graph. The identity graph returns a
party cookies.1
𝑈 𝐼𝐷 for all the publisher visits, by matching fingerprints
First- vs third-party cookies. Before describing the threat model, we sent by each respective publisher. A publisher-provided ID,
define what we mean by first- and third-party cookies. Cookies can 𝑃𝑃𝐼𝐷, is also sent when visiting publisher 2. The user visits
either be set by the Set-Cookie HTTP response header or by using publisher 3 on a different device, thus tracker A is unable to
document.cookie in JavaScript. Cookies set via response header construct a fingerprint that matches 𝐹 . Publisher 3 sends a
from the same domain as the first-party are first-party cookies. publisher-provided ID that matches 𝑃𝑃𝐼𝐷 provided by pub-
Similarly, cookies set via response header from a different domain lisher 2. As a result, the identity graph matches and returns
than the first-party are third-party cookies. When cookies are set the same 𝑈 𝐼𝐷 for publisher 3. This ID is stored in first-party
by a script, their classification depends on whether the script is cookies on the user’s device for each respective publisher.
embedded in a first- or third-party execution context. The cookies
set by third-party scripts running in the first-party context are first-
party cookies. The cookies set by third-party scripts running in a Same-site tracking. A user visits the same publisher’s site multiple
third-party context (e.g., third-party iframes) are third-party cookies. times. During the first visit of the user, tracker A sets a first-party
There are three main entities in this threat model: users (the victim), cookie on the user’s device. Upon subsequent visits by the user,
trackers (the adversary), and publishers. tracker A can read the first-party cookie set and know that it is the
same user who is revisiting the site. When performing same-site
We assume that the user:
tracking, tracker A is able to gather information about the user
• visits different websites using one or more desktop/mobile across the pages maintained by the same publisher.
devices that have distinct fingerprints [64]
Same-site cross-domain tracking. After setting a first-party
• is not averse to logging in to those websites and providing PII
cookie on a user’s device, tracker A also shares the first-party
(personally identifiable information) such as email addresses
cookie with a different tracker B that is not present in the first-
• has third-party cookies disabled and first-party cookies en-
party context (and thus is unable to set a first-party cookie of its
abled
own). On each subsequent visit of the user, tracker A shares the
We assume that the publisher: first-party cookie and the pages visited by the user with tracker B.
• controls the content on the site being visited by the user Thus, without setting its own first-party cookie and directly col-
• embeds the tracker in the first-party context, allowing the luding with the publisher, tracker B is also able to track the user’s
tracker to set first-party cookies activity on the same site.
• shares email and other deterministic identifiers (e.g., user- Cross-site tracking. Consider a scenario in which a user visits
name, phone number) with the tracker, if provided by the three different sites (publishers 1, 2, 3) where tracker A is embed-
user ded in the first-party context. The user visits sites 1 and 2 on one
device and site 3 on a different device. Publishers 2 and 3 ask the
We assume that the tracker: user for a deterministic identifier (e.g., email address) which we
• is present in a first-party context on the publisher’s site denote as 𝑃𝑃𝐼𝐷 (Publisher-Provided ID). Tracker A also constructs
• can set and read first-party cookies using document.cookie fingerprints on sites 1 and 2, denoted by 𝐹𝑖 , where 𝑖 denotes the
• can collect information such as IP addresses, screen resolu- publisher visited.
tion etc., which can be used to construct a device fingerprint When the user visits sites 1 and 2, tracker A collects fingerprints
Trackers can use the information shared by the publisher, and 𝐹 1 and 𝐹 2 , which are the same (i.e., 𝐹 1 = 𝐹 2 = 𝐹 ) as they are all
the fingerprints collected by their own scripts to perform same-site, constructed for the same device. This allows tracker A to infer
cross-domain same-site, and cross-site tracking, described below: that the same user/device is visiting both sites. Tracker A links the
deterministic identifiers and fingerprints belonging to the same
1 Thisthreat model is informed by prior literature [44, 54, 55, 72, 77] and our case user/device by constructing an identity graph (refer to Appendix
studies of popular tracking services described in Appendix A.1 A.1 for examples). The gray edge in Figure 1 shows the link in the

3493
CookieGraph: Understanding and Detecting First-Party Tracking Cookies CCS ’23, November 26–30, 2023, Copenhagen, Denmark

identity graph constructed by tracker A for the fingerprints on sites 4.2 Tracking When Third-Party Cookies Are
1 and 2. Blocked
The user then visits site 3 from a different device where tracker
We first study whether blocking third-party cookies effectively elim-
A is not able to construct the same fingerprint 𝐹 . Publisher 3 asks
inates ATS requests. We compare the number of requests containing
the user for a deterministic identifier (e.g., email address), which is
identifiers with and without third-party cookies.
the same as the 𝑃𝑃𝐼 𝐷 provided by the user to publisher 2. Based
on this additional information, tracker A can add a black edge to
the identity graph.
60
Tracker A is finally able to connect all nodes in the identity
graph to the user. Tracker A then assigns all connected nodes in the
50
identity graph the same ID 𝑈 𝐼 𝐷, which it can store in a first-party

% of Sites
cookie on each of the sites. On each subsequent visit by the user 40
to any of the sites, tracker A can now simply read the first-party
cookie containing 𝑈 𝐼 𝐷. Because 𝑈 𝐼 𝐷 is the same across sites 1, 2, 30
and 3, this allows tracker A to track the user across different sites.
20

4 MEASUREMENTS 10
In this section, we conduct a preliminary measurement study to
investigate the usage of first-party cookies by advertising and track- 0

ta lec om

er t

io m

ad jec m

eo t
om
m

om

rv com

op com
ing services (ATS) when third-party cookies are blocked.

ne
e
an k.n

o
go .co

.c

.c

.c
.c

ru ub n.c

x.
t.

.
nd gle

go con atic
og ou ics

es
gm lic

en
ag

ic

it
yt

o
at

cr
o
al

pr
ic
b

se
4.1 Data Collection an

p
e-

sy

le
bi
gl

le

og
le
o

og
go

go
go
Data collection. We use OpenWPM (v0.17.0) and Firefox (v102)
[52] to crawl a sample of 20K out of the top-million websites. To
ensure that our crawls cover websites of variable popularity, we Figure 2: Presence of top-10 tracking domains. The plot shows
crawl the top 1K sites – ensuring coverage of the most popular the percentage of sites where at least one request containing
websites– uniformly sample 9K sites from the sites ranked 1K-100K, an identifier is sent to a tracking domain.
and another 10K from sites ranked 100k-1M in the Tranco list ( ) 3P-Allowed: Third-party cookies allowed
[74]. To capture behaviors that may be different in the landing and ( ) 3P-Blocked: Third-party cookies blocked
internal pages of a website [40], we perform an interactive crawl
that covers both kinds of pages. Specifically, for each site, we crawl Table 1 shows the average number of requests for two parallel
its landing page and then select up to 20 internal pages to visit crawls conducted with third-party cookies allowed and blocked. We
at random. We conduct four parallel crawls: two with third-party see that there is only a modest reduction in the overall number of
cookies enabled (3P-Allowed) and two with third-party cookies ATS requests when third-party cookies are blocked. The difference
blocked (3P-Blocked). Parallelizing the crawls minimizes temporal in the number of ATS requests containing identifiers is 10.82%. This
variations across crawls and mitigates the effect of the dynamic is surprising because cookie syncing, which is widely used for same-
behavior of websites. site-cross-domain and cross-site tracking [54, 73], entails sharing
We run all crawls in the US to minimize the impact of the EU third-party identifier cookies in query parameters [43, 44, 49]. With
GDPR and do not interact with cookie banners. We also turn off third-party cookies blocked, cookie syncing between third-parties
additional protections against tracking provided by Firefox [10]. cannot occur, and we would expect to see a larger drop in identifiers
We repeat failed crawls up to four times. We successfully conducted shared in ATS requests. We conclude that third-party cookie
the four parallel crawls for 99.31% of the 20K websites. blocking does not effectively limit the exfiltration of identifiers
to trackers.
Labeling tracking activity. To label tracking, we use EasyList [8]
and EasyPrivacy [9]. Specifically, we use them to label requests as Table 1: Average number of requests per site in 3P-Allowed
tracking (ATS) or not tracking (Non-ATS). We label a request as and 3P-Blocked configurations
tracking (ATS) if its URL matches the rules in either one of the lists.
Otherwise, we label it as not tracking (Non-ATS). Request Count 3P-Allowed 3P-Blocked Change
Since the basic premise of tracking is to identify users, we are
Total 771.47 766.43 -0.65%
particularly interested in sharing of identifiers in these tracking
Tracking 303.46 288.08 -5.07%
requests. In line with prior work [53, 63], we define identifiers as
Non-Tracking 468.01 478.35 2.21%
a string that is longer than 8 characters and matches the regex
Tracking with ID 126.43 112.75 -10.82%
[𝑎 − 𝑧𝐴 − 𝑍 0 − 9_ = −]. Using this definition, we look for identifiers
Tracking without ID 177.02 175.32 -0.96%
in URL query parameters [75] and cookie values [43, 44, 49, 77].

3494
CCS ’23, November 26–30, 2023, Copenhagen, Denmark Shaoor Munir et al.

Table 2: Average number of first-party cookies per site in cookies are being used in lieu of third-party cookies to circumvent
3P-Allowed and 3P-Blocked configurations third-party cookie blockage.
We first compare the average number of first-party cookies in
1P Cookie Count 3P-Allowed 3P-Blocked Change 3P-Allowed and 3P-Blocked crawls in Table 2. We observe only a
minor difference in the average number of first-party cookies set
Total 132.65 137.73 -3.84%
Set by Trackers 109.74 114.53 -4.36% with third-party cookies allowed/blocked. It is also noteworthy that
Set by Non-Trackers 22.90 23.20 -1.31% 83.15% of the first-party cookies are set by ATS scripts. A further
Set by Trackers with ID 64.09 66.37 -3.55% 57.94% of them are identifier cookies. We conclude that the vast
Set by Trackers without ID 45.64 48.15 -5.50% majority of first-party cookies are in fact set by ATS and that they
are not significantly impacted by third-party cookie blocking.
70 Next, we compare the setting of first- and third-party identifier
cookies by ATS domains (eTLD+1 of the setting script URL) to
60 understand if first-party cookie usage is equally prevalent across
different ATSes. Figure 3 plots the percentage of sites where at least
50
one first-party and/or third-party identifier cookie is set by a top-10
% of Sites

40
ATS domain.
For the six Google-owned ATS domains, which showed a neg-
30 ligible difference in requests containing identifiers after blocking
third-party cookies, there is also little to no change in the use of
20 first-party identifier cookies across both crawls. These domains
do not set a large number of third-party identifier cookies, even
10 when those are allowed, which likely explains why they were not
impacted by third-party cookie blocking.
0
On the contrary, the other set of ATS domains for which we ob-
io m

ad jec m

eo t
om
m

er t
m

om

rv com

op com

ne
e

serve a reduction of identifiers (i.e., Pubmatic, Rubicon, and OpenX)


o

o
o
an k.n

go .co

.c

.c

.c
.c

ru ub n.c

x.
t.

.
nd gle

go con atic
cs

es
ic

en

do use more third-party identifier cookies than first-party identifier


ag lecl

ag
ti

ic

it
o
at
ly

cr
o
pr
ic
ub
na

cookies when third-party cookies are authorized. This observation


se
m
-a

do

p
sy
le

also explains the drastic drop in the number of requests containing


le
bi
t
og

le

og
le

og
og
go

identifiers to these other ATS domains after blocking third-party


go
go

cookies in Figure 2. We conclude that trackers that are not


affected by third-party cookie blocking are using first-party
Figure 3: Comparison of percentage of sites on which first- cookies as a replacement.
party and third-party identifier cookies are set by ATS do-
mains.
( ) first-party identifier cookies set when third-party cook- 4.4 Takeaway
ies are allowed Our differential measurement study reveals that third-party cookie
( ) first-party identifier cookies set when third-party cook- blocking does not effectively prevent tracking. There is only a
ies are blocked negligible reduction in the exfiltration of identifiers to trackers
( ) third-party identifier cookies set when third-party when third-party cookies are blocked. We find that this is because
cookies are allowed ATSes use first-party cookies in lieu of third-party cookies.
We also find that the impact of third-party cookie blocking is not
uniform across different trackers. Some ATS domains show more
Next, we analyze whether third-party cookie blocking disparately reduction in the exfiltration of identifiers than others. This disparity
impacts different ATS domains (eTLD+1). Figure 2 plots the percent- exists because some trackers only use first-party cookies regardless
age of sites with at least one ATS request with identifiers. Six of the of the availability of third-party cookies; while others are using
top-10 ATS domains, all owned by Google, show only a negligible both first-party and third-party cookies to store identifiers.
reduction in the number of ATS requests with identifiers when
third-party cookies are blocked. In contrast, three other ATS do-
mains, owned by Pubmatic, Rubicon, and OpenX, show a significant 5 COOKIEGRAPH: DETECTING FIRST-PARTY
reduction. TRACKING COOKIES
4.3 Tracking Through First-Party Cookies In this section, we describe CookieGraph, a graph-based machine
Table 1 shows that even after blocking third-party cookies, there learning approach to detect first-party ATS cookies. CookieGraph
is only a small decrease in ATS requests containing identifiers creates a graph representation of a webpage’s execution based
(10.82%). The identifiers in these ATS requests are likely originat- on HTML, network, JavaScript, and storage information collected
ing from some storage mechanism other than third-party cookies. by an instrumented browser. In this graph, first-party cookies are
Since recent prior work has shown that ATS are increasingly using represented as storage nodes. CookieGraph extracts distinguishing
first-party cookies [44, 77], we next investigate whether first-party features of these cookies and uses a random forest classifier to

3495
CookieGraph: Understanding and Detecting First-Party Tracking Cookies CCS ’23, November 26–30, 2023, Copenhagen, Denmark

Webpage cr aw l u si n g Cook i e f eat u r e ex t r act i on


Open W PM i n st r u m en t ed an d l abel i n g u si n g
Fi r ef ox br ow ser f i l t er l i st s an d Cook i epedi a
(t h i r d-par t y cook i es bl ock ed)

ATS

Non -ATS

Pr ocess cr aw l dat a t o bu i l d
a gr aph r epr esen t at i on of Fi r st -par t y cook i e
page ex ecu t i on cl assi f i cat i on

Figure 4: Overview of CookieGraph pipeline: (1) Webpage crawl using an instrumented browser; (2) Construction of a graph
representation to represent the instrumented webpage execution information; (3) Feature extraction for graph nodes that
represent first-party cookies; and (4) Classifier training to detect first-party ATS cookies.

detect first-party ATS cookies. Figure 4 provides an overview of interaction (exfiltration or infiltration) to the element that initiated
CookieGraph’s pipeline. the interaction. Because of our focus on identifiers, CookieGraph
only captures cookie values that are at least 8 characters long (but it
5.1 Design and Implementation would be trivial to extend it to consider smaller cookie values). Fig-
Browser instrumentation. CookieGraph uses our extended ver- ure 5 illustrates how CookieGraph creates a graph representation.
sion of OpenWPM [52] to capture webpage execution information In this example, a third-party script from tracker1.com executes in
across HTML, network, JavaScript, and the storage layers of a a first-party context on the webpage, example.com. The script first
webpage. Our analysis reveals significant usage of localStorage in reads infoCookie (1), which contains tracking information such
addition to cookies. In total, we found 217,444 unique first-party as the publisher ID and a user signature. Then, the script sends
cookie names and 99,682 unique localStorage names. In addition to the content of the cookie to tracker1.com’s sync endpoint via an
this, we found 13,571 instances where the same first-party cookie HTTP POST request (2). The endpoint returns a user ID (UID) in
was also stored in local storage. Thus, we use the term “storage” to the response body (3), which is stored in both a first-party cookie
refer to both cookies and localStorage. In most cases, the description and localStorage named IDStore (4). At a later point, the script
for cookies is also applicable to localStorage and vice versa. reads the value from IDStore (5) and exfiltrates the UID to two
CookieGraph captures HTML elements created by scripts, net- other tracking endpoints: to tracker2.com via a URL parameter (6)
work requests sent by HTML elements (as they are parsed) and and to tracker3.com via an HTTP header (7).
scripts, responses received by the browser, exfiltration/infiltration Figure 6 shows the graph representation that CookieGraph
of identifiers in network requests/responses, and read/write opera- generates for the execution of the example script. The nodes in the
tions on the browser’s storage mechanisms. graph represent the script, the storage, and the network endpoints.
The edge numbers show the actions performed in Figure 5. The
Graph construction. The nodes in CookieGraph’s graph repre-
dotted and dashed lines in the graph show the infiltration and
sent HTML elements, network requests, scripts, and storage ele-
exfiltration behaviors captured by CookieGraph. CookieGraph
ments. When localStorage and first-party cookie nodes share the
is not only able to capture the interactions of the script with the
exact same name, CookieGraph considers them as one storage
storage and the network endpoints, but is also able to precisely link
node. CookieGraph’s edges represent a wide range of interactions
exfiltration and infiltration of the first-party cookie via an edge from
among different types of nodes e.g., scripts sending HTTP requests,
the cookie node to the endpoint.
scripts setting cookies etc. In addition to interactions considered by
prior work [79], CookieGraph incorporates edges that model the Feature extraction. We use CookieGraph’s representation to
actions associated to tracking using first-party cookies. We identify extract two kinds of features.
these actions from the result of our measurement study in Sec- Structural features represent relationships between nodes in the
tion 4, and the case studies described in Appendix A.1. Cookies are graph, such as ancestry information and connectivity. Structural
typically set with the values infiltrated with HTTP responses and features capture the relationships between the first-party cookie
are exfiltrated via URL parameters and request headers or bodies; nodes and scripts on the page. For example, how many scripts
CookieGraph captures infiltrations and exfiltrations by linking interacted with a cookie or whether a script that interacted with a
the script-read/write cookies in the first-party execution context cookie also interacted with other cookies.
to the requests of reader/writer script that contains those cookie Flow features represent first-party ATS cookie behavior. We
values. In addition to plain text cookie values, CookieGraph also extract three types of flow features. First, we count the number of
monitors Base64-, MD5-, SHA-1-, and SHA-256- encoded cookie times a cookie was read or written. Second, we count the number of
values in URLs, headers, request, and response bodies. Cookie- times a cookie was infiltrated via HTTP responses or exfiltrated via
Graph tracks the value of each cookie and associates the relevant URL parameters, request headers, or request bodies. Third, features

3496
CCS ’23, November 26–30, 2023, Copenhagen, Denmark Shaoor Munir et al.

Browser example.com
Storage 2. send infoCookie value to sync point

1. read InfoCookie via 3. receive UID in response payload tracker1.com


document.cookie
infoCookie
JavaScript
4. store IDStore via (tracker1.com) 6. exfiltrate IDStore value in URL
document.cookie tracker2.com
IDStore 7. exfiltrate IDStore value in
5. read UID via document.cookie HTTP Header tracker3.com

Figure 5: Example scenario to illustrate CookieGraph’s graph construction (shown in Figure 5).
3,4 Filter lists. We rely on filter lists [8, 9] as previous work has found
them to be reasonably reliable in detecting ATS endpoints [62, 79].
2 Filter lists are designed to label resource URLs, rather than cookies.
We adapt them to label cookies by assigning the label of a particular
infoCookie 1 2,3 resource to all the cookies set by that resource. Since both ATS and
JavaScript Non-ATS cookies can be set by the same resource, this labeling
6 procedure could result in a non-trivial number of false positives.
6 To limit the number of false positives in our ground truth, we only
4, 5
label Non-ATS cookies based on filter lists: i.e., if a script that sets a
7
cookie is not marked by any of the filter lists, we label these cookies
IDStore 5, 6
as Non-ATS. Conservatively, if any one of the filter lists marks the
7 cookie’s setter as ATS, we label the cookie as Unknown.
Cookiepedia. Inspired by prior work [42], we use Cookiepedia [14]
5, 7 as an additional source of cookie labels. Cookiepedia is a database
of cookies maintained by a well-known Consent Management Plat-
form (CMP) called OneTrust [42, 58]. For each cookie/domain pair,
Figure 6: Graph representation of Figure 5 in CookieGraph. Cookiepedia provides its purpose, defined primarily through the
network nodes, script nodes, and storage nodes. While cookie integration with OneTrust. Each cookie is assigned one of
the solid lines show the interactions of the script nodes with four labels: strictly necessary, functional, analytics, and advertis-
the storage and request nodes, the dashed (- - -) and dotted (. . ing/tracking. As Cookiepedia-reported purposes are self-declared,
.) lines represent the exfiltration and infiltration edges that we adopt a conservative approach: we only label a cookie-domain
are captured by CookieGraph. pair as ATS if a cookie’s purpose is declared as advertising/tracking
or analytics in a particular domain. If the declared purpose is strictly
necessary or functional, we label the cookie as Unknown, as the
related to the setter of the cookie. Concretely, whether the setter’s
cookie might have been, mistakenly or intentionally, mislabeled.
domain also acted as an end-point for other cookie exfiltrations, and
We combine the results of the labeling approaches to obtain a
whether the setter’s domain was involved in redirect chains (since
final label for the cookies. If both approaches label a cookie as Un-
redirects are commonly used in tracking). The intuition behind
known, its final label is Unknown. If only one of the approaches has
the third category of features is that domains involved in setting
a known label, this is the final label. If Cookiepedia marks a cookie
first-party ATS cookies are also involved in sharing information
as ATS and filter lists mark it as Non-ATS, we give precedence to
with other ATSes.
the Cookiepedia label and assign the final label as ATS because
CookieGraph does not use content features, such as cookie
websites are unlikely to self-declare their Non-ATS cookies as ATS.
names, as they can be trivially modified to evade detection [63, 79].
Using this labeling process, 82,098 out of 304,162 (26.99%) first-
party cookie and domain pairs have a known (ATS or Non-ATS)
5.2 Evaluation label and the rest are labeled as Unknown. We observe that cookies
Similar to previous work on graph-based webpage modelling [62, set by the same script across two different sites are often labeled
79], we use a random forest classifier to distinguish between ATS ATS in one instance and Unknown in another instance because
and Non-ATS cookies. We first train and test the accuracy of this Cookiepedia does not have data for the latter. As it is unlikely
classifier on a carefully labeled dataset. Then, we deploy it on our that an ATS script changes purpose across sites, we propagate the
20K website dataset. ATS label to all instances set by the same script. Using this label
propagation, we label 37.92% of the data, with 53,183 (46.10%) ATS
5.2.1 Ground truth labeling. We use two complementary approaches and 62,184 (53.90%) Non-ATS labels.
to construct our ground truth for first-party ATS cookies. We repre-
sent each first-party cookie as a cookie-domain pair since the same 5.2.2 Classification. We train and test the classifier on the labeled
cookie name can occur on multiple sites. dataset using standard 10-fold cross-validation. We ensure that

3497
CookieGraph: Understanding and Detecting First-Party Tracking Cookies CCS ’23, November 26–30, 2023, Copenhagen, Denmark

there is no overlap in the websites used for training and testing in These findings confirm our conclusions in Section 4: first-party
each fold. Similar to Section 5.1, we limit the classifier to cookies ATS cookies are used to store identifiers which are then exfiltrated
whose value is at least 8 characters long. The classifier has 90.07% to multiple endpoints.
precision and 92.09% recall, with an overall accuracy of 90.18%, Error analysis. We conduct a manual analysis of CookieGraph’s
indicating that the classifier is successful in detecting ATS cookies. false positives and false negatives to understand failures.
We find that the cookies that were most misclassified as ATS are
5.3 Feature Analysis those whose publicly available descriptions indicate they are used
to track visitors on a page (e.g., __attentive_id, messagesUtk,
omnisendAnonymousID) [4, 11, 13]. We also find a few instances of
well-known Google Analytics cookies _ga and _gid that are labeled
in ground truth as Non-ATS, but are classified by CookieGraph
as ATS. Our manual inspection also shows that the false positives
are not caused by misclassifications, but mostly that the tracking
cookies flagged by CookieGraph were mislabeled as Non-ATS in
the ground truth. In other words, CookieGraph has likely cor-
rectly classified these tracking cookies. We note that even after our
procedures to improve ground truth labels, there may be cookies
that did not have self-disclosed labels or were served from slightly
different scripts (thereby missing our hash-based script matching).
This is a limitation of our ground-truth, as it relies on either the self-
declaration of the cookie purpose or a match between the setting
scripts to determine if a cookie is ATS. We leave the investigation
of methods of improving the ground truth labeling to future work.
Regarding false negatives, i.e., ATS cookies missed by Cookie-
Graph, we mainly observe two cases. First, we have the case of finite
Figure 7: Feature distribution of cookie exfiltrations (top) coverage of encodings. A representative case is the _pin_unauth
and storage sets (bottom) for ATS and Non-ATS cookies. ATS cookie. Its value is double-base64-encoded, which is not included
cookies are exfiltrated and set more than Non-ATS cookies, in the list of potential encoding schemes used by CookieGraph to
resulting in flow features based on exfiltrations and sets detect exfiltration. These false negatives can be averted by using
being helpful for the classifier. a more comprehensive list of encoding schemes or by performing
full-blown information flow tracking instead of approximating ex-
We conduct feature analysis to understand the most influen- filtration flows; however, the latter would come at a performance
tial features in the classification of cookies. We find that the most cost, as we discuss in Section 5.4.
influential features are the flow features, which capture cookie Second, we have the case of lack of coverage of actions. Our
exfiltrations, set operations, and redirections by cookie setters. Fig- crawl to create the graphs in CookieGraph may not capture all
ure 7 shows the distributions for the number of cookie exfiltrations possible actions on a webpage. If CookieGraph does not capture
(top) and the number of times a cookie is set (bottom), for ATS sufficient activity during webpage execution, some cookies may not
and Non-ATS cookies. ATS cookies are much more likely to be be triggered and therefore, the analysis will miss them. We further
exfiltrated than Non-ATS cookies: ATS have a median number of discuss these cases of false negatives in Section 7.1
6 exfiltrations (mean/std is 11.11/15.95) as compared to a median
of 0 for Non-ATS (mean/std is 0.62/5.29). Also, ATS cookies tend 5.4 Comparison with Existing Countermeasures
to be set much more frequently by scripts, with a median of 3 set In this section, we compare CookieGraph with existing counter-
operations (mean/std is 4.86/6.99) as compared to 1 for Non-ATS measures that are used to restrict the effect of first-party cookies.
cookies (mean/std is 2.17/6.08).
Intelligent Tracking Prevention (ITP) is used by Safari as a
Our analysis of 26,242 first-party ATS cookies which were set
broad countermeasure against online tracking activities. Under ITP,
more than 3 times on the same site on our crawls shows that 50.2%
Safari limits the maximum expiry time of a first-party cookie set
percent of these cookies were set with the same value but different
through JavaScript and HTTP requests received from IP addresses
expiry values. This points towards periodically re-setting a cookie
different from the host website to seven days [84]. In addition,
being an approach used by trackers to evade expiry limits enforced
Safari limits this time to only 24 hours for known trackers.2 This
by ITP (Safari) [87] and ETP (Firefox) [10]. For the rest of the ATS
can be a prudent countermeasure if the first-party tracking cookies
cookies, it appears the most common use case of re-setting is to
were meant to be a storage for the identifier for the repeat visits
update the ID value stored in the cookie. As we described in our
of the user. However, as we have shown in the previous section,
threat model and case studies, the ID values stored in these cookies
first-party tracking cookies are shared with a large number of other
are updated when the ATS obtains more information about the user
domains immediately after being set. This sharing of identifiers
and finds a new match in the ID graph. Thus, it is not surprising
among different trackers is meant to enhance their ability to track
that these ATS cookies are continuously being updated with new,
improved identifiers for the user. 2 Firefox also limits the expiry time for cookies set by known trackers to 24 hours [69].

3498
CCS ’23, November 26–30, 2023, Copenhagen, Denmark Shaoor Munir et al.

Table 3: Classification accuracy of CookieGraph, WebGraph, We divide our breakage analysis into four categories of typical
and CookieBlock website usage: navigation (from one page to another), SSO (initiat-
ing and maintaining login state), appearance (visual consistency),
Classifier Accuracy Precision Recall and miscellaneous functionality (chats, search, shopping cart, etc.).
We label breakage as major or minor for each category: major break-
CookieGraph 90.18% 90.07% 92.09%
age – when it is not possible to use the functionality of the site in-
WebGraph 79.05% 71.67% 86.17%
cluded in any of the aforementioned categories, and minor breakage
CookieBlock 72.87% 70.73% 80.85%
– when it is difficult, but not impossible, for the user to make use of
the functionality. To assess breakage, we compare a vanilla Chrome
users across different sites. Limiting the amount of time that a browser (with no countermeasures against first-party cookies) with
cookie is set for will not be able to stop this sharing of information, browsers enhanced with an extension that blocks first-party cook-
thus proving ineffective in protecting user privacy. ies classified as ATS by CookieGraph, enhanced with an extension
which blocks all cookies set by resources labeled as ATS by Web-
5.5 Comparison to classifier-based blocking Graph, and enhanced with the official CookieBlock extension [22].
In this analysis, we also include two additional configurations: filter
Next, we compare CookieGraph with state-of-the-art countermea- lists [8, 9], and a Google Chrome with all cookies blocked. We used
sures against ATS, CookieBlock [42] and WebGraph [79], in terms two reviewers to perform the breakage analysis to mitigate the
of detection accuracy, website breakage, and robustness. impact of biases or subjectivity. Any disagreements between the
CookieBlock [42] is a state-of-the-art approach to classify cook- reviewers were resolved after careful discussion.
ies, including advertising/tracking and analytics. It makes use of Out of the 50 sites, CookieGraph only had major breakage on
both manually curated allow lists and a machine learning classifier, one site where a cookie popup kept freezing up and preventing
which mainly relies on features based on cookie attributes (cookie navigation around the website due to the deletion of a cookie that
names and values). stores user preferences. In contrast, WebGraph, CookieBlock, and
WebGraph [79] is the state-of-the-art graph-based approach to filter lists cause major breakage in one of the four categories on
classify ATS requests. As WebGraph is not designed to directly at least 6% of the sites. For example, WebGraph causes issues with
classify cookies, we adapt it by identifying ATS resources identified cart functionality on etsy.com, complete homepage breakage on
by WebGraph in 3P-Blocked and generating a block list of cookies aliexpress.us, and SSO issues on other sites. Most of the breakage
for each domain set by those resources. This list is meant to mimic issues of CookieBlock relate to SSO logins and additional login-
the effect of blocking these resources on first-party ATS cookies. dependent functionality (e.g., missing profile picture). Our results,
that CookieBlock causes breakage on 10% of the sites with SSO
5.5.1 Detection Accuracy. Table 3 compares the detection accuracy logins, are similar to the 7-8% breakage reported by the authors
of CookieGraph with CookieBlock and WebGraph. CookieGraph [42]. Blocking all cookies results in major breakage on 32 percent
outperforms both approaches in all metrics. The superiority in of the sites tested, with SSO and cart functionality proving to be
precision indicates that existing countermeasures result in many the most recurring issue.
more false positives than CookieGraph. These additional false We also find that WebGraph blocks some additional first-party
positives mean that previous approaches would block functional cookies that are important for server-side functionality, but not
first-party cookies, potentially affecting user experience. directly related to user experience and therefore not immediately
We also compared CookieGraph’s performance against popular perceptible. For example, WebGraph blocks essential cookies such
filter lists [8, 9]. We found that 52.51% (834) third-party script do- as Bm_sz cookie used by Akamai for bot detection, XSRF-TOKEN
mains that set first-party ATS cookies identified by CookieGraph cookie used to prevent CSRF on different sites, and AWSALB cookies
are not blocked by filter lists. Some of the most common examples used by Amazon for load balancing. CookieGraph correctly clas-
include dynamicyield.com, pinimg.com, auryc.coml, tinypass.com, sified these cookies at Non-ATS, and thus does not prevent these
and driftt.com. Some of the scripts loaded from these domains might measures from being deployed.
be blocked by filter lists, but our tool finds and blocks tracking
cookies from scripts that are either missed by filter lists, or are
exempted due to breakage issues. For example, some scripts from Table 4: Website breakage comparison of all three
assets.adobedtm.com are blocked while others scripts are allowed countermeasures.( ) signifies no breakage, ( )
and set s_sq tracking cookie. CDNs like CloudFront are a common minor breakage, and ( ) major breakage. Each cell
example of such domains that are used to serve both functional and represents the percentage of sites on which breakage was
tracking scripts. observed.

5.5.2 Website Breakage. We manually analyze the breakage caused Navigation SSO Appearance Miscellaneous
Classifier
by CookieGraph, CookieBlock and WebGraph’s on 50 sites that Minor Major Minor Major Minor Major Minor Major
are sampled from the 20K sites used in Section 4 (25 sites chosen CookieGraph 0% 2% 0% 0% 0% 0% 0% 0%
randomly from the top 100 and other 25 from the rest). 3 WebGraph 6% 2% 0% 2% 4% 2% 2% 2%
CookieBlock 2% 0% 0% 10% 0% 0% 2% 2%
Filter lists 4% 2% 0% 2% 2% 2% 2% 4%
3 The list of sites used in breakage analysis is available at: No Cookies 8% 8% 0% 32% 6% 12% 2% 28%
https://github.com/cookiegraph/CookieGraph

3499
CookieGraph: Understanding and Detecting First-Party Tracking Cookies CCS ’23, November 26–30, 2023, Copenhagen, Denmark

5.5.3 Robustness. We compare the robustness to evasion of Cook- find that 89.86% of sites deploy at least one first-party ATS cookie.
ieGraph, CookieBlock, and WebGraph, i.e., to intentional modifi- Of these sites, the average number of first-party ATS cookies per
cations of the cookies to cause the misclassification of ATS cookies site is 12.38.
as Non-ATS. Since ATS are known to engage in the arms race with
Who sets first-party ATS cookies? The vast majority (96.61%)
privacy-enhancing tools [39, 61, 66], it is important to test whether
of the first-party ATS cookies are set by third-party embedded
the detection of first-party ATS cookies is brittle in the face of trivial
scripts served from a total of 2,099 unique domains. This shows
manipulation attempts such as changing cookie names.
that first-party ATS cookies are in fact set and used by third-parties.
We evaluate robustness on a test set of 2,000 sites from our
These first-party cookies enable third-parties to perform same-site
dataset which also have the required CMP needed by CookieBlock
tracking as described in Section 3.
for data collection and training. This translates to a total set of 7,726
first-party cookies. We change the names of the cookies in our test Who sends and receives first-party ATS cookies? Next, we
set to randomly generated strings between 2 and 15 characters. analyze the most prevalent first-party cookies and the third-party
Both CookieGraph and WebGraph are fully robust to manipula- entities that actually set them. Table 5 lists the top-25 out of 20,794
tion of cookies names while CookieBlock’s accuracy degrades by first-party ATS cookies4 based on their prevalence5 . Two major ad-
more than 15.87%, while precision and recall degrade by 15.23% vertising entities (Google and Facebook) set first-party ATS cookies
and 16.79% respectively. CookieGraph and WebGraph are robust on approximately a third of all sites in our dataset. CookieGraph
because they do not use any content features (features related to detects _gid and _ga cookies by Google Analytics as ATS on 62.63%
the cookie characteristics, such as cookie name or domain) since and 53.27% of the sites. The public documentation acknowledges us-
these can be somewhat easily manipulated by an adversary aiming ing these two first-party cookies to store user identifiers for tracking
to evade classification [79]. On the contrary, the most important [27]. We also find evidence of widespread cross-domain first-party
feature of CookieBlock depends on the cookie name, i.e., whether first-party ATS cookie sharing. For example, _gid and _ga cookies
the name belongs to the top 500 most common cookie names [41]. are respectively exfiltrated to 83 and 259 destination domains, more
CookieGraph’s implementation of flow features can be ma- than 95% of which are non-Google domains.
nipulated by an adversary by using a different encoding than it CookieGraph detects _fbp cookie by Facebook as ATS on 24.82%
currently considers, or by changing the domains of exfiltration end- of the sites. Their public documentation acknowledges that Face-
points. CookieGraph’s robustness to these attacks can be improved book tracking pixel stores unique identifiers in the first-party _fbp
by more comprehensive information flow tracking. However, full- cookie [24]. In fact, Facebook made a recent change to include
blown information flow tracking would incur prohibitively high first-party cookie support in its tracking pixel to avoid third-party
run-time overheads (up to 100X-1000X [57]) and implementation cookie countermeasures [38]. It is again noteworthy that the _fbp
complexity in the browser [45, 46, 67, 81]. This overhead is signifi- cookie by Facebook is exfiltrated to 423 destination domains, more
cant not only at runtime but also in an offline setting. Optimistically, than 98% of which are non-Facebook domains.
assuming a 100X overhead, the time required to crawl a single page TikTok, a social media app that is known to aggressively harvest
increases from 60 seconds to 100 minutes. Crawling the landing sensitive user information [30], also recently added support for
page and 20 internal pages for one website will thus take 34 hours setting first-party tracking cookies using TikTok Pixel [35, 37]. Tik-
rather than 20 minutes. In addition to prohibitively large time for Tok’s first-party _ttp tracking cookie is present on 3.69% percent
website crawls, this delay would also likely impact the fidelity of of sites, which is considerably lower than Facebook and Google
the page execution itself. but comparable to more specialized entities such as Criteo. Criteo’s
To assess the robustness of CookieGraph against manipula- cto_bundle cookie is amongst the most prevalent first-party ATS
tion of these flow features, we remove the features related to the cookies. We observe that Criteo sets this first-party ATS cookie on
flow of cookie information (exfiltration and infiltration of first- 3.19% of the sites in our dataset and is exfiltrated to 24 destinations.
party ATS cookies) and then re-train/test the classifier. We find The extensive sharing of first-party ATS cookies to other do-
that CookieGraph’s accuracy drops by only 2% when exfiltration mains enables cross-domain same-site tracking, through which
and infiltration features are removed. Our feature analysis using a tracker who is unable to set first-party cookies is still able to track
information gain shows that instead of focusing on exfiltration user activity on a site. Similar to results by [54], we find ATS cookies
features, CookieGraph shifts focus to other features such as the to be extensively shared in redirects. 22% of all first-party cookies
number of local storage accesses by a script and redirections by exfiltrated were found to be part of redirects. Table 5 highlights
cookie setters. While there is a slight performance degradation extensive cookie-syncing between different ATS, e.g., Yandex and
when these features are removed, CookieGraph is able to adapt Hubspot’s first-party ATS cookies are shared to Google Analytics.
and still outperforms existing countermeasures by more than 10% CookieGraph makes use of these cookie-syncing based char-
in terms of classification accuracy. acteristics of first-party ATS cookies to detect tracking behavior.
In 15.73% of cases, the number of exfiltrations by a script setting a
6 DEPLOYMENT first-party ATS cookie is the most important classification feature
We deploy CookieGraph to classify first-party cookies in our crawl
of 20% of the top-million sites. 4We report distinct tuples of the cookie name and the setter script’s URL.
5 Prevalence denotes the percentage out of all sites analyzed on which the cookie was
Prevalence of first-party ATS cookies. CookieGraph classifies classified as ATS. Instances where the classification was Non-ATS are excluded from
61.37% of the 108,947 first-party cookies in our dataset as ATS. We the prevalence analysis.

3500
CCS ’23, November 26–30, 2023, Copenhagen, Denmark Shaoor Munir et al.

6, while the number of redirects sent is the most important feature cookie at the time of classification. As more interaction is captured
in 6.39% of cases. Table 5 lists the most important feature for the in the graph, CookieGraph is able to correctly switch the label to
classification of each of the top-25 ATS cookies, which shows the im- ATS. More importantly, CookieGraph never switches labels from
portance of both exfiltration and redirect information in detecting ATS to Non-ATS due to increased interaction.
first-party ATS cookies. 7.2 Deployment Overhead
Cross-site tracking. As discussed in Section 3, trackers use deter- CookieGraph’s implementation is not suitable for runtime de-
ministic (e.g., email address) or probabilistic (fingerprinting) iden- ployment due to the performance overheads associated with the
tifiers for cross-site tracking using first-party cookies.7 We show browser instrumentation and machine learning pipeline. We envi-
that scripts that set first-party ATS cookies are also involved in sion CookieGraph to be used in an offline setting: First first-party
fingerprinting. ATS cookie-domain pairs are detected using CookieGraph and
First, we analyze the first-party cookies set by the scripts from (2) the detected cookie-domain pairs are added to a cookie filter
entities known to engage in browser fingerprinting. We use Discon- list such as those already supported in privacy-enhancing browser
nect’s sublist of fingerprinters [26, 33] from its tracking protection extensions (e.g., uBlock Origin [36]) for run-time blocking. We
list [6]. We find that 50 (0.42%) distinct domains that set first-party argue that a reasonably frequent (e.g., once a week) deployment
cookies are also known fingerprinters. These domains are responsi- of CookieGraph on a large scale would be sufficient in generat-
ble for setting 32.17% of all first-party ATS cookies. ing and keeping the filter list up-to-date. This anti-circumvention
Second, we use FP-Inspector [60] to further determine whether based approach is frequently used by existing list-based counter-
first-party ATS cookies are set by fingerprinting scripts. Using FP- measures and CookieGraph’s reliance on content features (or lack
Inspector, we find that fingerprinting scripts set first-party cookies thereof) prevents evasion by advertisers and trackers. On the other
on 1,908 out of 20K sites. In total, 632 first-party cookies are set hand, existing countermeasures [42], which heavily make use of
by fingerprinting scripts. 242 out of these 632 cookies, set by 175 cookie name and content features, cannot simply be re-run to gen-
different fingerprinting scripts, are classified by CookieGraph as erate block lists for updated ATS cookies. While advertisers and
ATS cookies. It is noteworthy that all of these 242 cookies (e.g., trackers can in theory change cookie names at a rate faster than
adtech_uid, tfstk, bafp, pxde, ssid) are not listed as tracking CookieGraph’s periodic deployment, updating cookie names fre-
cookies on Cookiepedia. Our manual analysis of the remaining 390 quently is challenging in practice because setting these first-party
Non-ATS cookies shows that they store non-identifiable informa- ATS cookies across many different sites requires tight coordination
tion (e.g., domain names, flags for cookie permissions). between different entities. To illustrate the practical issues asso-
ciated with changing cookie names, consider the legacy demdex
7 LIMITATIONS cookie set by Adobe’s embedded script that is then exfiltrated to
7.1 Completeness the demdex.net domain. Adobe’s documentation explains that it
is difficult to change the legacy name because “... it is entwined
CookieGraph relies on a graph representation of interactions be-
deeply with Audience Manager, the Adobe Experience Cloud ID
tween different elements during webpage execution. The number
Service, and our installed user base” [5, 17]. If advertisers or track-
of interactions captured depends on the intensity and variety of
ers are somehow able to overcome these practical challenges and
user activity on a webpage (e.g., scrolling activity, number of in-
change cookie names at a much faster pace, CookieGraph’s online
ternal pages clicked). Thus, it is possible that CookieGraph does
implementation for run-time cookie classification would be neces-
not detect certain ATS cookies if user activity is insufficient as
sary. Further research is needed for efficient and effective online
that would mean that its graph representation has not captured
implementation of CookieGraph.
particular interactions between different elements in the webpage.
To study the impact of user activity, we recrawl sites performing 8 CONCLUSION
two to three times more internal page clicks than in the original In this paper, we investigated the use of first-cookies for tracking.
crawl. We specifically recrawl 238 sites where Criteo’s cto_bundle Through a large-scale differential measurement, we showed that
cookie was originally classified as Non-ATS by CookieGraph. trackers use first-party cookies to exfiltrate identifiers even when
CookieGraph’s deployment on the recrawled sites results in suc- third-party cookies are blocked. We found that third-party cookie
cessful detection of Criteo’s cto_bundle cookie as ATS on 121 of blocking is ineffective and blanket first-party cookie blocking is not
the 238 recrawled sites. We find that the average number of infiltra- practical because it results in major functionality breakage on al-
tions (exfiltrations) increase from 1.54 to 2.95 (1.13 to 4.01) across most one-third of sites. To detect and block first-party tracking cook-
the original and recrawled sites. We observed a similar trend for ies, we proposed CookieGraph, a machine-learning approach that
other prevalent first-party ATS cookies in our dataset. captures fundamental tracking behaviors exhibited by first-party
We surmise that while there are cases where CookieGraph cookies. Our evaluation showed that CookieGraph outperformed
incorrectly classifies ATS as Non-ATS due to incompleteness of the state-of-the-art in terms of detection accuracy, minimization
the graph representation, its decision reflects the behavior of the of website breakage, and robustness to evasion attacks. Our de-
ployment of CookieGraph on 20K websites provided evidence
6We use treeinterpreter (https://github.com/andosa/treeinterpreter) to determine the of widespread use of first-party tracking cookies on 89.86% of the
most important feature during the classification of ATS cookies. tested sites. These first-party tracking cookies are set by third-party
7While our automated crawls do not allow us to test the use of deterministic identifiers
for cross-site tracking at scale, recent work [72] showed the use of email addresses embedded scripts served from 2,099 domains that include major
and other deterministic identifiers by trackers such as Criteo. advertising entities such as Google, Facebook, and TikTok.

3501
CookieGraph: Understanding and Detecting First-Party Tracking Cookies CCS ’23, November 26–30, 2023, Copenhagen, Denmark

Table 5: List of top-25 ATS cookies detected by CookieGraph

Cookie Script Org. Percentage Destination Most Important Top-3 Destination Domains
Name Domain of Sites Domains Feature #1 #2 #3
_gid google-analytics.com Google 62.63 83 LocalStorage Sets through JavaScript google-analytics.com doubleclick.net mountain.com
_ga google-analytics.com Google 53.27 259 LocalStorage Sets through JavaScript google-analytics.com doubleclick.net google.com
_ga googletagmanager.com Google 31.31 222 LocalStorage Sets through JavaScript google-analytics.com doubleclick.net google.com’
_fbp facebook.net Facebook 24.82 423 Exfiltrations through URL facebook.com datadoghq.com google-analytics.com
_gcl_au googletagmanager.com Google 19.05 39 LocalStorage Sets through JavaScript doubleclick.net google.com anytrack.io
__gpi googlesyndication.com Google 10.06 5 Redirects by Setting Script doubleclick.net googleadservices.com clicktripz.com
__gads doubleclick.net Google 9.47 11 Redirects by Setting Script doubleclick.net googleadservices.com wmcdp.io
__gads googlesyndication.com Google 9.26 4 Redirects by Setting Script doubleclick.net googleadservices.com clicktripz.com
__gpi doubleclick.net Google 8.61 11 Redirects by Setting Script doubleclick.net googleadservices.com wmcdp.io
ln_or licdn.com Microsoft 8.04 2 LocalStorage Sets through JavaScript tiqcdn.com tealiumiq.com
_uetsid bing.com Microsoft 7.47 115 LocalStorage Sets through JavaScript bing.com clarity.ms datadoghq.com
_uetvid bing.com Microsoft 7.47 134 LocalStorage Sets through JavaScript bing.com clarity.ms datadoghq.com
_ym_d yandex.ru Yandex 6.29 312 Redirects by Setting Script google-analytics.com doubleclick.net google.com
_ym_uid yandex.ru Yandex 6.29 103 Redirects by Setting Script google-analytics.com adfox.ru doubleclick.net
_hjTLDTest hotjar.com HotJar 6.19 1955 Exfiltrations through URL google-analytics.com google.com facebook.com
__utmz google-analytics.com Google 5.12 6 LocalStorage Sets through JavaScript google-analytics.com retargetly.com zbj.com
__utmb google-analytics.com Google 5.12 11 LocalStorage Sets through JavaScript google-analytics.com doubleclick.net google.com
__utma google-analytics.com Google 5.12 14 LocalStorage Sets through JavaScript google-analytics.com thedermreview.com paltalk.com
__utmc google-analytics.com Google 5.01 26 LocalStorage Sets through JavaScript google-analytics.com yandex.ru moatads.com
OptanonConsent cookielaw.org CookieLaw 4.04 1 LocalStorage Sets through JavaScript gbqofs.io
_clck clarity.ms Microsoft 3.97 6 Redirects by Setting Script ezoic.net doubleclick.net tealiumiq.com
_clsk clarity.ms Microsoft 3.93 5 Redirects by Setting Script smart-bdash.com tealiumiq.com brightfunnel.com
_ttp tiktok.com TikTok 3.69 19 LocalStorage Sets through JavaScript tiktok.com tiqcdn.com uxfeedback.ru
__qca quantserve.com Quantcast 3.38 67 LocalStorage Sets through JavaScript rubiconproject.com yahoo.com gumgum.com
cto_bundle criteo.net Criteo 3.19 24 Exfiltrations through URL criteo.com clarity.ms akstat.io

For reproducibility and to foster follow-up research, Cookie- [14] 2022. One Trust. Cookiepedia. https://cookiepedia.co.uk.
Graph’s source code (patch to OpenWPM and the machine learn- [15] 2022. Tracking Prevention in Microsoft Edge. https://docs.microsoft.com/en-
us/microsoft-edge/web-platform/tracking-prevention.
ing pipeline) and the detected list of first-party tracking cookies is [16] 2022. The Trade Desk and LiveRamp to Lead Industry Effort to Bring New
available at https://github.com/cookiegraph/CookieGraph. Privacy-First Interoperable ID Solution to Meet Emerging Requirements in
Europe. Website. https://www.thetradedesk.com/us/news/press-room/the-
trade-desk-and-liveramp-to-lead-industry-effort-to-bring-new-privacy-first-
ACKNOWLEDGEMENTS interoperable-id-solution-to-meet-emerging-requirements-in-europe
[17] 2022. Understanding Calls to the Demdex Domain. https://experience
This work was supported in part by the National Science Foundation league.adobe.com/docs/audience-manager/user-guide/reference/demdex-
under grant numbers 2103439, 2103038, 2138139, and 2127309 (Com- calls.html?lang=en. https://experienceleague.adobe.com/docs/audience-
manager/user-guide/reference/demdex-calls.html?lang=en
puting Research Association for the CIFellows 2021 Project). Steven [18] 2023. Enhancing identity at scale with ID5. Website. https://www.thetradedesk
Englehardt was employed by DuckDuckGo during the project, but .com/us/resource-desk/enhancing-identity-at-scale-with-id5
this work was completed independently. [19] 2023. An open-source identity solution built for the open internet. Website.
https://web.archive.org/web/20230629012449/https://unifiedid.com/
[20] n.d.. About publisher provided identifiers. https://web.archive.org/web/
REFERENCES 20220614165742/https://support.google.com/admanager/answer/2880055?hl=e
n.
[1] 1996. This bug in your PC is a smart cookie. https://archive.org/details/Financia
[21] n.d.. Cartographer Identity Graph. https://web.archive.org/web/20220526085916/
lTimes1996UKEnglish.
https://www.lotame.com/solutions/cartographer-identity-graph/.
[2] 2001. Internet Privacy with IE6 and P3P: A Summary of Find-
[22] n.d.. CookieBlock. https://github.com/dibollinger/CookieBlock.
ings. http://web.archive.org/web/20200731061208/http://www.spywarewarrior.c
[23] n.d.. Criteo Online Identification). https://web.archive.org/web/20220819071808/
om/uiuc/ie6-p3p.htm.
https://filecache.investorroom.com/mr5ir_criteo/977/download/Criteo_Online
[3] 2022. AdBlock Plus. https://adblockplus.org/. https://adblockplus.org/
_Identification_May2020.pdf/.
[4] 2022. Attentive cookie. https://docs.attentivemobile.com/pages/developer-
[24] n.d.. fbp and fbc Parameters. https://web.archive.org/web/20220722220344/https:
guides/third-party-integrations/referral-marketing-platforms/talkable/.
//developers.facebook.com/docs/marketing-api/conversions-api/parameters/f
https://docs.attentivemobile.com/pages/developer-guides/third-party-
bp-and-fbc/.
integrations/referral-marketing-platforms/talkable/
[25] n.d.. Firefox rolls out Total Cookie Protection by default to all users world-
[5] 2022. Cookies and the Experience Cloud Identity Service. https://experienceleag
wide. https://blog.mozilla.org/en/products/firefox/firefox-rolls-out-total-cook
ue.adobe.com/docs/id-service/using/intro/cookies.html?lang=en. https://experi
ie-protection-by-default-to-all-users-worldwide/.
enceleague.adobe.com/docs/id-service/using/intro/cookies.html?lang=en
[26] n.d.. Firefox’s protection against fingerprinting. https://support.mozilla.org/en-
[6] 2022. Disconnect tracking protection lists. https://disconnect.me/trackerprotect
US/kb/firefox-protection-against-fingerprinting.
ion. https://disconnect.me/trackerprotection
[27] n.d.. Google Analytics Cookie Usage on Websites). https://web.archive.org/web/
[7] 2022. DoubleClick. https://web.archive.org/web/19970405225532/http://www.do
20220812222800/https://developers.google.com/analytics/devguides/collection
ubleclick.com/.
/gtagjs/cookie-usage.
[8] 2022. EasyList. https://easylist.to/easylist/easylist.txt.
[28] n.d.. ID5 Identity Cloud. https://web.archive.org/web/20220727094611/https:
[9] 2022. EasyPrivacy. https://easylist.to/easylist/easyprivacy.txt.
//www.id5.io/identity-cloud/.
[10] 2022. Enhanced Tracking Protection in Firefox for desktop. https://support.mozi
[29] n.d.. Identity Guide. https://web.archive.org/web/20220115155115/https://yieldb
lla.org/en-US/kb/enhanced-tracking-protection-firefox-desktop.
ird.com/identity-guide/.
[11] 2022. Hubspot cookie. https://knowledge.hubspot.com/reports/what-cookies-
[30] n.d.. It’s their word against their source code - TikTok Report. https://internet2-
does-hubspot-set-in-a-visitor-s-browser. https://knowledge.hubspot.com/repo
0.com/whitepaper/its-their-word-against-their-source-code-tiktok-report/.
rts/what-cookies-does-hubspot-set-in-a-visitor-s-browser
[31] n.d.. Lotame – Data Collection Guide. https://web.archive.org/web/
[12] 2022. ID5 - First Party IDs and Identity Resolution Methods Ex-
20210730071853/https://my.lotame.com/t/p8hxvnd/data-collection-guide.
plained. https://web.archive.org/web/20220408035339/https://id5.io/news/index.
[32] n.d.. Lotame Lightning Tag. https://web.archive.org/web/20220307010702/https:
php/2022/03/24/first-party-ids-and-identity-resolution-methods-explained/.
//my.lotame.com/t/m1hxv7l/lotame-lightning-tag.
[13] 2022. Omnisend cookie. https://support.omnisend.com/en/articles/1933402-
[33] n.d.. Our New Approach to Address the Rise of Fingerprinting. https://blog.dis
explaining-and-managing-tracking-cookies. https://support.omnisend.com/en/
connect.me/our-new-approach-to-address-the-rise-of-fingerprinting/.
articles/1933402-explaining-and-managing-tracking-cookies

3502
CCS ’23, November 26–30, 2023, Copenhagen, Denmark Shaoor Munir et al.

[34] n.d.. Panorama ID. https://web.archive.org/web/20220327180718/https://www.lo Symposium on Security and Privacy (S&P). IEEE.
tame.com/panorama/id/. [61] Umar Iqbal, Zubair Shafiq, and Zhiyun Qian. 2017. The Ad Wars: Retrospective
[35] n.d.. TikTok Adds Third-Party Cookies To Its Pixel – And Tries To Eat Measurement and Analysis of Anti-Adblock Filter Lists. In IMC.
Facebook’s Lunch. https://web.archive.org/web/20220623232016/https: [62] Umar Iqbal, Peter Snyder, Shitong Zhu, Benjamin Livshits, Zhiyun Qian, and
//www.adexchanger.com/online-advertising/tiktok-adds-third-party-cookies- Zubair Shafiq. 2020. AdGraph: A Graph-Based Approach to Ad and Tracker
to-its-pixel-and-tries-to-eat-facebooks-lunch/. Blocking. In IEEE Symposium on Security and Privacy (S&P). IEEE.
[36] n.d.. uBlock Origin: Resources Library. https://github.com/gorhill/uBlock/wiki [63] Umar Iqbal, Charlie Wolfe, Charles Nguyen, Steven Englehardt, and Zubair Shafiq.
/Resources-Library#cookie-removerjs-. 2022. Khaleesi: Breaker of Advertising and Tracking Request Chains. In USENIX
[37] n.d.. Using Cookies with TikTok Pixel. https://web.archive.org/web/ Security Symposium (USENIX).
20220610074648/https://ads.tiktok.com/help/article?aid=10007540. [64] Pierre Laperdrix, Nataliia Bielova, Benoit Baudry, and Gildas Avoine. 2020.
[38] n.d.. What Facebook’s First-Party Cookie Means for AdTech. https: Browser fingerprinting: A survey. ACM Transactions on the Web (TWEB) 14,
//web.archive.org/web/20220729210450/https://clearcode.cc/blog/facebook- 2 (2020), 1–33.
first-party-cookie-adtech/. [65] Pierre Laperdrix, Walter Rudametkin, and Benoit Baudry. 2016. Beauty and the
[39] Mshabab Alrizah, Sencun Zhu, Xinyu Xing, and Gang Wang. 2019. Errors, beast: Diverting modern web browsers to build unique browser fingerprints. In
Misunderstandings, and Attacks: Analyzing the Crowdsourcing Process of Ad- 2016 IEEE Symposium on Security and Privacy (SP).
blocking Systems. In Proceedings of the 2019 Internet Measurement Conference [66] Hieu Le, Athina Markopoulou, and Zubair Shafiq. 2021. CV-Inspector: Towards
(IMC). Automating Detection of Adblock Circumvention. In Network and Distributed
[40] Waqar Aqeel, Balakrishnan Chandrasekaran, Anja Feldmann, and Bruce M Maggs. System Security Symposium (NDSS).
2020. On landing and internal web pages: The strange case of jekyll and hyde in [67] Sebastian Lekies, Ben Stock, and Martin Johns. 2013. 25 million flows later:
web performance measurement. In Proceedings of the ACM Internet Measurement Large-scale detection of DOM-based XSS. In Proceedings of the 2013 ACM SIGSAC
Conference. conference on Computer and Communications Security. 1193–1204.
[41] Dino Bollinger. n.d.. Analyzing Cookies Compliance with the GDPR. https://ww [68] Pedro Giovanni Leon, Lorrie Faith Cranor, Aleecia M McDonald, and Robert
w.research-collection.ethz.ch/handle/20.500.11850/477333. Thesis, ETH Zurich. McGuire. 2010. Token attempt: the misrepresentation of website privacy policies
[42] Dino Bollinger, Karel Kubicek, Carlos Cotrini, and David Basin. 2022. Automat- through the misuse of p3p compact policy tokens. In Proceedings of the 9th Annual
ing Cookie Consent and GDPR Violation Detection. In 31st USENIX Security ACM Workshop on Privacy in the Electronic Society.
Symposium (USENIX Security 22). USENIX Association. [69] MDN. 2022. Redirect tracking protection. https://developer.mozilla.org/en-US/d
[43] Aaron Cahn, Scott Alfeld, Paul Barford, and S. Muthukrishnan. 2016. An Empirical ocs/Mozilla/Firefox/Privacy/Redirect_Tracking_Protection. https://developer.
Study of Web Cookies. In Proceedings of the 25th International Conference on World mozilla.org/en-US/docs/Mozilla/Firefox/Privacy/Redirect_Tracking_Protection
Wide Web. International World Wide Web Conferences Steering Committee, [70] Lou Montulli. 2013. The Reasoning Behind Web Cookies. http://montulli.blogs
891–901. pot.com/2013/05/the-reasoning-behind-web-cookies.html.
[44] Quan Chen, Panagiotis Ilia, Michalis Polychronakis, and Alexandros Kapravelos. [71] Nick Nguyen. 2018. Latest Firefox Rolls Out Enhanced Tracking Protec-
2021. Cookie Swap Party: Abusing First-Party Cookies for Web Tracking. In tion. https://blog.mozilla.org/en/products/firefox/latest-firefox-rolls-out-enha
Proceedings of the Web Conference. nced-tracking-protection/.
[45] Quan Chen and Alexandros Kapravelos. 2018. Mystique: Uncovering informa- [72] ChangSeok Oh, Chris Kanich, Damon McCoy, and Paul Pearce. 2022. Cart-Ology:
tion leakage from browser extensions. In Proceedings of the 2018 ACM SIGSAC Intercepting Targeted Advertising via Ad Network Identity Entanglement. In
Conference on Computer and Communications Security. 1687–1700. Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications
[46] Andrey Chudnov and David A Naumann. 2015. Inlined information flow mon- Security.
itoring for JavaScript. In Proceedings of the 22nd ACM SIGSAC Conference on [73] Panagiotis Papadopoulos, Nicolas Kourtellis, and Evangelos P. Markatos. 2019.
Computer and Communications Security. 629–643. Cookie Synchronization: Everything You Always Wanted to Know But Were
[47] L. Montulli D. Kristol. 1997. HTTP State Management Mechanism. https://datatr Afraid to Ask. In Proceedings of the World Wide Web (WWW) Conference.
acker.ietf.org/doc/html/rfc2109. [74] Victor Le Pochat, Tom Van Goethem, Samaneh Tajalizadehkhoob, Maciej Kor-
[48] Savino Dambra, Iskander Sanchez-Rola, Leyla Bilge, and Davide Balzarotti. 2022. czyński, and Wouter Joosen. 2018. Tranco: A research-oriented top sites ranking
When Sally Met Trackers: Web Tracking From the Users’ Perspective. In USENIX hardened against manipulation. arXiv preprint arXiv:1806.01156 (2018).
Security Symposium. [75] Audrey Randall, Peter Snyder, Alisha Ukani, Alex Snoeren, Geoff Voelker, Stefan
[49] Díaz-Morales and Roberto. 2015. Cross-Device Tracking: Matching Devices Savage, and Aaron Schulman. 2022. Trackers Bounce Back: Measuring Evasion
and Cookies. In 2015 IEEE International Conference on Data Mining Workshop of Partitioned Storage in the Wild.
(ICDMW). 1699–1704. [76] Franziska Roesner, Tadayoshi Kohno, and David Wetherall. 2012. Detecting and
[50] Brendan Eich. 2013. C is for Cookie. https://brendaneich.com/2013/05/c-is-for- Defending Against Third-Party Tracking on the Web. In 9th USENIX Symposium
cookie/. on Networked Systems Design and Implementation (NSDI 12) (San Jose, CA). 155–
[51] Brendan Eich. 2013. The Cookie Clearinghouse. https://brendaneich.com/2013/ 168.
06/the-cookie-clearinghouse/. [77] Iskander Sanchez-Rola, Matteo Dell’Amico, , Davide Balzarotti, Pierre-Antoine
[52] Steven Englehardt and Arvind Narayanan. 2016. Online tracking: A 1-million-site Vervier, and Leyla Bilge. 2021. Journey to the center of the cookie ecosystem:
measurement and analysis. In Proceedings of ACM CCS 2016. Unraveling actors’; roles and relationships. In S&P 2021, 42nd IEEE Symposium
[53] Steven Englehardt, Dillon Reisman, Christian Eubank, Peter Zimmerman, on Security & Privacy, 23-27 May 2021, San Francisco, CA, USA.
Jonathan Mayer, Arvind Narayanan, and Edward W. Felten. 2015. Cookies That [78] Justin Schuh. 2020. Building a more private web: A path towards making third
Give You Away: The Surveillance Implications of Web Tracking. In Proceedings party cookies obsolete. https://blog.chromium.org/2020/01/building-more-priv
of the 24th International Conference on World Wide Web. ate-web-path-towards.html.
[54] Imane Fouad, Nataliia Bielova, Arnaud Legout, and Natasa Sarafijanovic-Djukic. [79] Sandra Siby, Umar Iqbal, Steven Englehardt, Zubair Shafiq, and Carmela Tron-
2020. Missed by Filter Lists: Detecting Unknown Third-Party Trackers with coso. 2022. WebGraph: Capturing Advertising and Tracking Information Flows
Invisible Pixels. Proceedings on Privacy Enhancing Technologies 2020 (04 2020), for Robust Blocking. In 31st USENIX Security Symposium (USENIX Security 22).
499–518. https://doi.org/10.2478/popets-2020-0038 USENIX Association.
[55] Imane Fouad, Cristiana Santos, Arnaud Legout, and Nataliia Bielova. 2022. My [80] Alexander Sjösten, Peter Snyder, Antonio Pastor, Panagiotis Papadopoulos, and
Cookie is a phoenix: detection, measurement, and lawfulness of cookie respawn- Benjamin Livshits. 2020. Filter List Generation for Underserved Regions. In
ing with browser fingerprinting. In Privacy Enhancing Technologies Symposium WWW.
(PETS). [81] Ben Stock, Sebastian Lekies, Tobias Mueller, Patrick Spiegel, and Martin Johns.
[56] Google. n.d.. The Privacy Sandbox. https://developer.chrome.com/docs/privacy- 2014. Precise Client-side Protection against DOM-based Cross-Site Scripting. In
sandbox/. 23rd USENIX Security Symposium (USENIX Security 14). San Diego, CA, 655–670.
[57] Daniel Hedin, Arnar Birgisson, Luciano Bello, and Andrei Sabelfeld. 2014. JSFlow: [82] Microsoft Edge Team. 2022. Introducing tracking prevention, now available in
Tracking information flow in JavaScript and its APIs. In Proceedings of the 29th Microsoft Edge preview builds. https://blogs.windows.com/msedgedev/2019/06/
Annual ACM Symposium on Applied Computing. 1663–1671. 27/tracking-prevention-microsoft-edge-preview/. https://blogs.windows.com/
[58] Maximilian Hils, Daniel W Woods, and Rainer Böhme. 2020. Measuring the msedgedev/2019/06/27/tracking-prevention-microsoft-edge-preview/
emergence of consent management on the web. In Proceedings of the ACM Internet [83] Alessandra Van Veen and AP de Vries. 2021. Cookie Compliance of Dutch
Measurement Conference. Hospital Websites. (2021).
[59] Xuehui Hu, Nishanth Sastry, and Mainack Mondal. 2021. CCCC: Corralling [84] WebKit. 2022. Tracking Prevention in WebKit. https://webkit.org/tracking-
Cookies into Categories with CookieMonster. In 13th ACM Web Science Conference prevention/. https://webkit.org/tracking-prevention/
2021. Association for Computing Machinery, 234–242. [85] John Wilander. 2017. Intelligent Tracking Prevention. https://webkit.org/blog/
[60] Umar Iqbal, Steven Englehardt, and Zubair Shafiq. 2021. Fingerprinting the 7675/intelligent-tracking-prevention/.
Fingerprinters: Learning to Detect Browser Fingerprinting Behaviors. In IEEE

3503
CookieGraph: Understanding and Detecting First-Party Tracking Cookies CCS ’23, November 26–30, 2023, Copenhagen, Denmark

[86] John Wilander. 2018. Intelligent Tracking Prevention 1.1. https://webkit.org/blo 1 {


g/8142/intelligent-tracking-prevention-1-1//. 2 " created_at ": " 2022 -02-09T11:42:40.817811Z " ,
[87] John Wilander. 2018. Intelligent Tracking Prevention 2.0. https://webkit.org/blo 3 " id5_consent ": true ,
g/8311/intelligent-tracking-prevention-2-0/. 4 " original_uid ":" ID5 * FnFOGLkYzdJ...Oeg2Ok4VTNc " ,
[88] John Wilander. 2019. Intelligent Tracking Prevention 2.1. https://webkit.org/blo 5 " universal_uid ":" ID5 *
g/8613/intelligent-tracking-prevention-2-1/.
HGH7W7iMpMu3-...szRCJDUkiiu-tv5BQ " ,
[89] John Wilander. 2019. Intelligent Tracking Prevention 2.2. https://webkit.org/blo
g/8828/intelligent-tracking-prevention-2-2/. 6 " signature ":" ID5_Ab6tnGgm...JQWlsUEfynB1hBGZc " ,
[90] John Wilander. 2019. Intelligent Tracking Prevention 2.3. https://webkit.org/blo 7 " link_type " :1 ,
g/9521/intelligent-tracking-prevention-2-3/. 8 " cascade_needed ": true ,
[91] John Wilander. 2020. Full Third-Party Cookie Blocking and More. https://webkit 9 " privacy " :{
.org/blog/10218/full-third-party-cookie-blocking-and-more/. 10 " jurisdiction ":" other " ,
11 " id5_consent ": true
12 }
A APPENDIX 13 }
A.1 Case Studies Code 2: Example of data structure received from ID5 during
In this section, we look at case studies of four popular ATS that a user’s first visit.
make use of first-party tracking cookies: Lotame, ID5, Criteo, and
The Trade Desk. A.1.3 Criteo. Criteo provides Criteo Identity Graph for identity
A.1.1 Lotame. Lotame is an identity management solution that resolution [23]. Criteo Identity Graph is built from four different
claims to provide a single ID to users across multiple browsers, sources: (i) data contributed by advertisers, (ii) data collected from
devices, and platforms. Lotame’s Lightning Tag [32] packages the publisher websites, (iii) data provided by Criteo partners such as Liv-
user visit data in a JSON object and sends it to its servers. Code 1 eRamp and Oracle, (iv) and predictions on existing data by Criteo’s
shows an example payload sent to Lotame. The payload includes IDs machine learning models. Criteo claims that its identity graph is
assigned by the website, third-party identifiers present on the site, able to stitch together identifiers from more than 2 billion users
certain user behaviors (configured through collaboration between across the world and that it contains persistent deterministic iden-
the publisher and Lotame), and other custom rules defined per tifiers for 96% of the users [23]. Similar to other identity resolution
website [31]. Lotame processes the payload and matches the data services, Criteo generates an ID, based on identifiers, such as hashed
with its Cartographer Identity Graph [21], and sends back an ID, emails, mobile device IDs, and cookie IDs, and stores it in both first-
called panoramaID [34], which is stored as a first-party cookie or party cookies and localStorage as cto_bundle. As described in
in localStorage. Section 5.1, CookieGraph’s graph representation abstracts storage
to refer to both Cookies and localStorage, and it includes a count of
A.1.2 ID5 Universal ID. ID5 provides identity resolution for pub- localStorage accesses in the feature set computed from the graph
lishers and advertisers through its Identity Cloud [28]. ID5’s script representation to effectively model this particular behavior.
sends a request to its Identity Cloud with a payload that contains
A.1.4 The Trade Desk. The Trade Desk (TTD) is a digital market-
several deterministic identifiers, such as email, usernames, and
ing company whose stated aim is to improve digital advertising.
phone numbers (if available) as well as probabilistic identifiers,
Their most relevant initiative is Unified ID 2.0 (UID 2.0) [19], which
such as IP address, user agent, and location of the user [12]. Iden-
uses deterministic information such as email address and prob-
tity Cloud processes and returns an ID, called universal_id, which
abilistic information such as browser/device attributes to create
is stored as a first-party cookie as well as in local storage. An ex-
identifiers at the household and individual level. UID 2.0 is unique
ample payload from ID5 is shown in Code 2. We note that ID5 also
because of its partnerships with major players and publishers in the
provides Partner Graph, a service that enables information sharing
digital advertising ecosystem. Notably, ID5 [18] and LiveRamp [16],
among its partners [28]. Partner Graph allows different identity
which specialize in providing alternatives to third-party cookie-
providers to exchange information with each other.
based tracking, both collaborate with TTD to integrate with UID
2.0. UID 2.0 works by first collecting hashed email addresses and
1 data: {
other deterministic identifiers from users visiting a website, which
2 behaviorIds: [1 ,2 ,3] , is then sent to a UID 2.0 operator. The operator matches the hashed
3 behaviors: { email address with the centralized ID graph consisting of infor-
4 int: [ ' behaviorName ' , ' behaviorName2 '], mation contributed by all UID 2.0 partners. In case of a match, an
5 act: [ ' behaviorName ']
encrypted user identifier (or token) is sent back to the client-side
6 },
7 ruleBuilder: { and stored in a first-party cookie. This token is used by TTD’s
8 key1: [ ' value 1a ' , ' value 1b '] partners, alongside other deterministic and probabilistic signals, to
9 }, identify a user through identity graphs as described in section 3.
10 thirdParty: {
11 namespace: ' NAMESPACE ',
12 value: ' TPID_VALUE '
13 }
14 }

Code 1: Example of data sent structure sent to Lotame during


a user’s first visit.

3504

You might also like