0% found this document useful (0 votes)
35 views19 pages

Research Paper

This paper studies the collection of email addresses and passwords by third-party scripts before users submit online forms. The researchers developed a crawler to fill out email and password fields and monitor network traffic for leaks on the top 100,000 websites. They found that email addresses were exfiltrated before submission on over 1,800 EU sites and 2,900 US sites, with some passwords also collected incidentally. The effects of location, consent settings, and mobile vs desktop were evaluated.

Uploaded by

aditya doneria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views19 pages

Research Paper

This paper studies the collection of email addresses and passwords by third-party scripts before users submit online forms. The researchers developed a crawler to fill out email and password fields and monitor network traffic for leaks on the top 100,000 websites. They found that email addresses were exfiltrated before submission on over 1,800 EU sites and 2,900 US sites, with some passwords also collected incidentally. The effects of location, consent settings, and mobile vs desktop were evaluated.

Uploaded by

aditya doneria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Leaky Forms: A Study of Email and Password

Exfiltration Before Form Submission


Asuman Senol, imec-COSIC, KU Leuven; Gunes Acar, Radboud University;
Mathias Humbert, University of Lausanne; Frederik Zuiderveen Borgesius,
Radboud University
https://www.usenix.org/conference/usenixsecurity22/presentation/senol

This paper is included in the Proceedings of the


31st USENIX Security Symposium.
August 10–12, 2022 • Boston, MA, USA
978-1-939133-31-1

Open access to the Proceedings of the


31st USENIX Security Symposium is
sponsored by USENIX.
Leaky Forms:
A Study of Email and Password Exfiltration Before Form Submission

Asuman Senol Gunes Acar Mathias Humbert


imec-COSIC, KU Leuven Radboud University University of Lausanne
Frederik Zuiderveen Borgesius
Radboud University

Abstract number of connected devices, tracking users only on websites


does not suffice to get a complete view of their profile. The
Web users enter their email addresses into online forms for
demand for an alternative mechanism to track users across
a variety of reasons, including signing in or signing up for a
websites and devices has also increased since major browser
service or subscribing to a newsletter. While enabling such
vendors such as Safari and Firefox have started blocking or
functionality, email addresses typed into forms can also be
partitioning third-party cookies and trackers.
collected by third-party scripts even when users change their
Email addresses are ideal identifiers to fill this gap, since
minds and leave the site without submitting the form. Email
they are unique, persistent, and can even be available in the
addresses—or identifiers derived from them—are known to
offline realm—e.g., when a user signs up for a loyalty card.
be used by data brokers and advertisers for cross-site, cross-
Compared to other personal information such as name or
platform, and persistent identification of potentially unsuspect-
postal address, email addresses are more effective for track-
ing individuals. In order to find out whether access to online
ing users across platforms, since they are long-term, unique,
forms is misused by online trackers, we present a measure-
and available on many websites and applications to facilitate
ment of email and password collection that occurs before the
account login, registration, and newsletter subscriptions. Data
form submission on the top 100, 000 websites. We evaluate
brokers and advertisers already use email hashes to identify
the effect of user location, browser configuration, and inter-
users, track them across devices, and match their online and
action with consent dialogs by comparing results across two
offline activities [7, 25, 35].
vantage points (EU/US), two browser configurations (desk-
top/mobile), and three consent modes. Our crawler finds and The demand for a more global and persistent identifier,
fills email and password fields, monitors the network traffic along with the ongoing phase-out of third-party cookies,
for leaks, and intercepts script access to filled input fields. makes email addresses typed into online forms an attractive
Our analyses show that users’ email addresses are exfiltrated target for collection by trackers. However, prior work on the
to tracking, marketing and analytics domains before form collection of credentials typed into online forms is limited.
submission and without giving consent on 1, 844 websites Besides, the collection of information before form submission
in the EU crawl and 2, 950 websites in the US crawl. While has been even less studied. Only a 2017 news article by Surya
the majority of email addresses are sent to known tracking Mattu and Kashmir Hill reported that a third party called Nav-
domains, we further identify 41 tracker domains that are not istone was collecting personal information from mortgage
listed by any of the popular blocklists. Furthermore, we find calculator forms before the user submitted the form [71]. This
incidental password collection on 52 websites by third-party is despite the high dropout rates among web users (e.g., in
session replay scripts. signup forms [27, 29]), which shows that many users indeed
leave websites without submitting the form they started filling
out. For instance, a survey by The Manifest found that 81%
1 Introduction of the 502 respondents have abandoned forms at least once,
and 59% abandoned a form in the last month [38].
Websites commonly use third-party advertising and marketing In this paper, we investigate to what extent third-party track-
services to monetize their content. Those services heavily ers collect email addresses, and (incidentally) passwords, even
depend on monitoring users’ online activities, at times without if the user does not submit any form. Unlike prior work, we
their knowledge and consent. Stateful tracking mechanisms focus on leaks that occur before form submission, and we
such as cookies are isolated by origins, and limited to the analyze the effect of location, of user consent to personal data
web platform. As users’ online activities are spread over a processing, and of mobile vs. desktop browsing.

USENIX Association 31st USENIX Security Symposium 1813


In addition, we evaluate the effect of users’ location, of the last decade, more intrusive and persistent tracking mecha-
user consent to personal data processing, and of mobile vs. nisms have emerged. Browser fingerprinting [53], evercook-
desktop browsing. In particular, we run crawls from two van- ies [13] and cookie syncing [76] are such mechanisms that
tage points (EU & US), with desktop and mobile-emulated are harder to control and detect than the traditional cookies.
browsers. In addition, we use three different cookie consent As a reaction to these emergent tracking mechanisms, track-
settings to investigate the effect of user consent: accept all, ing protection countermeasures such as browser extensions
reject all, and no action. Our contributions include the follow- and built-in browser defenses were developed. For instance,
ing: Safari’s Intelligent Tracking Prevention, and Firefox’s En-
hanced Tracking Protection can prevent third-party tracking
• We develop an interactive, instrumented crawler based by identifying trackers and blocking cookies that are used
on DuckDuckGo’s Tracker Radar Collector [34] to mea- for cross-site tracking [11, 87]. The countermeasures against
sure email and password exfiltration on Tranco top 100K traditional tracking mechanisms made alternatives such as
sites. We fit the crawler with a pre-trained machine- tracking based on personal identifiers or “people-based mar-
learning (ML) classifier that can robustly detect email keting” [22] even more necessary.
fields. Our crawler is further able to fill the email and
password fields and to intercept script access to filled
input fields (Section 3.1). 2.2 Related Work
Online tracking Several studies investigated stateful [67, 80]
• Based on a crawl of 2.8 million pages from the top
and stateless [57, 60, 65] tracking techniques and their evolu-
100K sites, we find that trackers collect email addresses
tion over time. Taking an offensive approach, other studies
before form submission on thousands of websites in both
proposed new tracking techniques that are difficult to detect
EU (1, 844 websites) and US (2, 950 websites) crawls—
such as canvas and GPU fingerprinting [64, 73]. Analyzing
60% more exfiltrations when the same sites are visited
IAB Europe’s Transparency and Consent Framework (TCF)
from the US. We uncover 41 previously unknown tracker
cookie banners, Matte et al. found a widespread violation of
domains that exfiltrate email addresses. We develop a
the GDPR and the ePrivacy Directive; for instance by register-
proof-of-concept browser add-on that detects sniff and
ing positive consent when the user has not made a choice [70].
exfiltration attempts on online forms.
Similar to our discussion on GDPR compliance of email ex-
filtration practices (Section 5), Mayer and Mitchell presented
• We discuss whether email exfiltrations by trackers are
an overview of regulation that applies to online tracking–
compliant with the GDPR or not (Section 5). Further,
but their analysis predates modern privacy laws such as the
we send GDPR requests to a sample of websites and
GDPR [72].
third parties, asking the purpose of their email collection,
Personal information leaks Lin et al. presented the first
retention period and further sharing policies (Section 6).
comprehensive study of privacy threats emanating from
• Finally, we uncover incidental password collection by browsers’ auto-fill functionality [68]. While relevant, auto-
session replay providers on 52 websites (Section 4.2). fill-related abuse is orthogonal to the types of exfiltration
Two third-party trackers with a combined presence of we investigate. Acar et al. studied personal data exfiltration
five million websites released fixes to address the issue, by third parties, uncovering inadvertent password leaks by
thanks to our disclosures. session replay scripts, and third parties that harvest (hashed)
email addresses by injecting invisible login forms that trigger
browsers’ login managers [41].
2 Background and Related Work Englehardt et al. built a corpus of emails by signing up to
mailing lists, and they found that 30% of emails they received
2.1 Background leaked the recipient’s email address to one or more third-
party servers when viewed in an email client program or web
Web tracking is the process of collecting information about application [56]. Similar to our study, Englehardt et al. also
users’ online activities across websites. The personal infor- searched and filled email fields, but their method aimed to
mation that can be collected or inferred by the trackers may identify leaks that occur when reading emails—not when
include personal and sensitive information such as sexual typing email addresses on the page.
orientation, political and religious beliefs. Tracking may be Starov et al. studied PII leakage on contact pages of the
performed for various purposes including analytics, personal- 100,000 most popular sites on the web [83]. They populated
ization, and building a behavioral profile for marketing and contact forms with a name, surname, email address and a
targeted advertisements. sample contact message. Their results showed that, after re-
The most traditional way to track users across websites moving accidental leakage, 6.1% (1, 035) of all contact forms
is to store a unique identifier in users’ cookies. However, in leaked PIIs to third parties after form submission. They also

1814 31st USENIX Security Symposium USENIX Association


Figure 1: Components of our crawler. We integrate Firefox Relay’s Fathom-based email field classifier [10] and Consent-O-
Matic [45] to Tracker Radar Collector (TRC)—a web privacy measurement crawler developed by DuckDuckGo [34]. TRC is
based on Puppeteer, which uses Chrome Devtools Protocol to interact with the underlying browser. We modify TRC to efficiently
discover inner pages, and fill email and password fields.

found that PIIs were leaked to third parties before submitting Google Chrome, 31 Mozilla Firefox, and 33 Opera browser
the contact form on 13 websites. Unlike Starov et al.’s work, versions released from 2016 to 2020 by using VisibleV8, and
we ran comparative crawls (mobile/desktop, US/EU, consent showed that different browser versions have identifiable finger-
modes); and our leak detection method did not require three prints [43]. Recently, DuckDuckGo developed Tracker Radar
visits. While not directly comparable, we identified substan- Collector [34], an instrumented Puppeteer-based crawler that
tially more personal information leaks. is used to detect trackers through large-scale crawls. We chose
Chatzimpyrros et al. [47] and Dao et al. [49] investigated to build our crawler by extending Tracker Radar Collector for
PII leaks on top 200K websites, and on 307 popular shop- its simplicity and scalability. We explain the details of this
ping websites respectively. Chandramouli et al. measured the process in the following section.
prevalence of email header injection vulnerabilities in web Login security Jonker et al. presented a framework called
forms, which can be used for phishing, spoofing, and other at- Shepherd, which detects login pages using a combination
tacks [46]. Other prior work investigated PII leaks on mobile method of searching for login-based URLs, clickable elements
devices [78, 79], or compared tracking on mobile and desktop and search engine APIs [62]. Shepherd also interacts with the
devices [89]. login forms, and analyzes authentication cookies to determine
Our study differs from these works by focusing on email whether the website is vulnerable to session hijacking. Ana-
and password exfiltration during the filling of the forms. We lyzing the use of web authentication mechanisms on 100,000
run crawls from multiple vantage points, with different con- domains, Van Acker et al. showed that login pages of cer-
sent modes to evaluate their effect on data exfiltration. We tain open-source web frameworks and content management
compare email and password collection on mobile and desk- systems are vulnerable to several attacks under various adver-
top crawls. In addition, we use GDPR requests to reach out to sary models [85]. They evaluated 51,307 login pages from
first and third parties to ask for the purposes of email address 100K websites against man-in-the-middle attacks showing
collection. that 62.8% of login pages are vulnerable to adversaries with
Web privacy measurement studies Many researchers de- moderate resources. Van Acker et al.’s study also showed that
veloped their own tools to study web tracking techniques in password leaks to third parties are possible on many websites.
the wild. In 2012, Mayer and Mitchell implemented Fourth- Unlike these two studies measuring login page vulnerabili-
Party, a Firefox extension that instrumented browser APIs, ties, we measure the actual misuse by trackers on real-world
HTTP traffic and cookies [72]. Using FourthParty, they ex- websites.
amined web tracking techniques on more than 500 web-
sites. FPDetective is based on a modified PhantomJS and 3 Methods
Chromium and was used to measure browser fingerprinting
on the top million pages [42]. Englehardt and Narayanan de- 3.1 Extending Tracker Radar Collector
veloped OpenWPM, which consists of an instrumentation ex-
tension and automation code that drives a full-fledged Firefox Tracker Radar Collector (TRC) is a modular, multi-threaded
browser [57]. Jueckstock and Kapravelos contributed Visi- crawler that is tailored for large-scale web measurements. Us-
bleV8, a modified V8 JavaScript engine that logs all native JS ing Puppeteer under the hood, TRC takes advantage of all
function calls and property accesses, without the need to add the capabilities of the Chrome DevTools Protocol. TRC uses
specific instrumentation [63]. Akhavani et al. inspected 33 collectors—modules in charge of capturing tracking-related

USENIX Association 31st USENIX Security Symposium 1815


behavior—that captures browser API accesses, cookies and maximum number of links to click, since pilot crawls showed
requests. Unlike OpenWPM’s inline instrumentation [63] that diminishing returns after ten links.
wraps functions and objects with getters, TRC uses Chrome
DevTools Protocol to set conditional breakpoints that are 3.3 Identifying Email and Password Fields
evaluated when a certain function is called or a property is ac-
cessed. When the debugger hits a breakpoint set by TRC, the After clicking each link, we search for email and password
condition script collects the JavaScript stack trace and other fields on the new page and on all of its iframes. We search
metadata about the property access or function invocation. for iframes since a pilot crawl of top 1K Tranco sites
In order to detect email and password exfiltration, we ex- showed that 3% of email fields are found in iframes. For
tended TRC by adding a collector that finds and fills email detecting password fields, we search for input fields with
and password fields. Besides, we extended TRC’s network in- type password (i.e. input[type=‘password’]). However,
strumentation to capture WebSocket traffic and HTTP POST email input fields do not need to have the email type (i.e.
payloads—in addition to GET requests which are already input[type=‘email’]). In fact, through pilot crawls we
being intercepted. We also added instrumentation to intercept found that many websites, including popular ones such as
JavaScript access to input fields, capturing the access time, facebook.com, use text input elements to accommodate login
input value, and attributes of the accessed input element. A with phone numbers or other username formats. To address
high-level overview of our crawler is shown in Figure 1. this challenge, we integrated into our crawler a pre-trained
email field classifier based on Mozilla Fathom [10]. Fathom
3.2 Discovering Inner Pages is a supervised learning framework specialized to detect web-
page parts such as popups [14]. We used the Fathom-based
Our crawler starts to search email and password fields on email field detector model used in Firefox Relay add-on [10].
the landing pages. If no field can be found, it tries to fol- Firefox Relay is a privacy-focused service from Mozilla that
low links to discover fields in the inner pages. To find links offers free email aliases 1 . Using the Fathom-based detector
that are more likely to yield email and password fields, allowed us to identify 76% more email fields than we would
we use a combined regular expression pattern that we ex- detect by simply searching for input fields with type email.
tract from Firefox’s Password Manager module [15]. The This substantial increase may indicate that earlier studies that
pattern contains several translations of words related to relied on email input type could have missed a significant
“sign in”, “sign up” and “register”. We search for this pat- number of email fields.
tern in the following attributes of a, button, div, span ele-
ments: innerText, title, href, placeholder, id, name 3.4 Filling Email and Password Fields
and className. We limit ourselves to these four elements
since they can be used to create links on the page. We pri- We use a unique email address on each page by adding the
oritize elements that exactly match the regular expression site domain to the email address after a plus (+) character.
pattern over elements that partially match the pattern. As a This allowed us to uniquely attribute received emails to the
final fallback, we search for links (this time only consider- websites they are collected on. To address potential bot de-
ing a, button elements) according to their page coordinates tection measures, we simulate user typing behavior by using
(i.e., distance from the top left corner). Based on a pilot randomized intervals for each key press and dwell times, as
crawl of 100K websites, we calculated the median X and well as the delay times between each press. After typing into
Y position of the links that led to pages with email or pass- each field, we simulate pressing the ‘Tab’ key to switch to
word fields: 1113px and 64.5px, respectively. Note that, since the next form field, while triggering the blur event on the
we used a 1440px-wide viewport in the desktop crawls, this previously filled element.
point is very close to the viewport’s top right corner, where Englehardt et al. found that the “Show password” feature,
sign-in/sign-up links are commonly found. This coordinate- which changes the type of the password field from password
based link detection method increased the number of detected to text, caused certain session replay scripts to collect the
email fields by around 10%. Within each link category (ex- passwords incidentally [54]. To measure such leaks at large,
act match, loose match, coordinate-based match), we prior- the crawler changes the password fields’ type from password
itize 1) a and button links, 2) links that are in the viewport, to text before filling the field. This allows us to simulate
3) links that are on top of other elements (computed via the effect of browser extensions such as ShowPassword [26],
Document.elementFromPoint()). We arrived at these pri- which displays passwords in cleartext. We then run a follow-
oritization steps by comparing email and password yields up crawl without changing the password input type on web-
using different methods in pilot crawls. sites where we identified password leaks. Overall, our pass-
While clicking the links, we keep a record of the URLs we word exfiltration measurements aim to identify the incidental
have visited and we skip links to already visited pages. We collection, rather than malicious password theft.
continue to click these sorted links until we find and fill an 1 Coincidentally,
Firefox Relay and similar email alias services can be
email field, or until we clicked ten links. We choose ten as the used as countermeasures against email exfiltration we study in this paper.

1816 31st USENIX Security Symposium USENIX Association


3.5 Interaction with Consent Management Di- these timeouts and other crawl parameters based on data from
alogs 1K pilot crawls. For instance, we measured how long the
CMP operations take and set the extra wait time to the 99th
After the introduction of the GDPR in 2018, more websites percentile of the distribution (6 seconds).
started to show dialogs to get users’ consent for personal data In addition, we run crawls for mobile websites to measure
processing. The acceptance or refusal to give consent may the email and password exfiltration on the mobile web. We
have an effect on how the website and the third parties may emulated a mobile browser by adjusting the viewport dimen-
collect, process and share users’ personal data. While one sions, spoofing touch support, and using a mobile user-agent
expects less tracking and data collection when refusing to string. The mobile-specific parameters we used are available
give consent, prior research showed that in certain cases the in the TRC source code [34]. For mobile crawls, we fill a
opposite may be true: a recent study by Papadogiannakis et different email address to distinguish emails we received due
al. found that websites are more likely to use sophisticated to mobile and desktop crawls. We omit experiments with dif-
tracking techniques such as ID syncing and fingerprinting ferent consent modes for mobile crawls due to limited time
when users reject cookies [77]. Regardless, web privacy stud- and space.
ies such as ours should take consent dialog interaction into
account since it may affect how websites and third parties
behave. 3.7 Email and Password Leak Detection
In order to investigate the effect of users’ consent prefer-
Identifying encoded, hashed or obfuscated leaks is a challenge
ences, we integrate Consent-O-Matic [45] into our crawler.
that we need to address to avoid underestimating leaks. This
Developed by Nouwens et al. to study dark patterns in con-
challenge was tackled in different ways in prior work in web
sent dialogs, Consent-O-Matic is a browser extension that can
privacy measurement studies. Starov et al. compare data from
recognize and interact (e.g., accept or reject cookies) with
three different crawls to identify PII in HTTP traffic [83].
various Consent Management Provider (CMP) pop-ups [75].
Since Starov et al.’s method requires more crawls and manual
We configure Consent-O-Matic to log detected CMPs, and
analysis, we prefer Englehardt et al.’s method [56], which in-
perform the following interactions with the CMPs:
volves searching for different encodings and hashes of search
accept-all: Allow processing for all purposes. reject-all:
terms, including Base64 encoding, and hash functions such as
Disallow processing for all purposes. no-action: Continue
SHA-256. Starting with the email and password we filled, we
without interacting with the CMP dialog, if any.
compute a precomputed pool that contains all possible sets of
tokens by iteratively applying the hashes and encodings. We
3.6 Measurement Configuration then search for the leaks in the referrer header, cookies, URL
We measure email and password exfiltration on the top and POST bodies of the requests, by splitting the contents
100, 000 Tranco websites [66]2 . Initially, we used the Tranco by potential separator characters, such as ‘=’. We apply all
domains without any changes, but we encountered DNS errors possible decodings and we check whether the decoded result
even on most popular websites such as windowsupdate.com— is in the precomputed pool. We repeat this process until we
the eighth most popular site in Tranco. To address this prob- reach a level of three layers of encodings or decodings. We list
lem, we matched Tranco domains to URLs listed in the the hash and encoding algorithms we used in Appendix 10.
Chrome User Experience Report [1], which contains actual We improve upon the original method by Englehardt et
URLs visited by Chrome users. When matching domains al. in several ways. First, in addition to splitting content by
to URLs, we pick the URL with the lower rank (more pop- separators and decoding the resulting strings, we search for
ular) if there are multiple alternatives. This minor change different encodings of the search terms (e.g., email and pass-
increased the successfully visited websites from 94, 427 (EU word values). This enabled us to detect leaks that do not
pilot crawl) to 99, 380 (EU final crawl). We used the March conform to the standard key=value structure. Similar to the
2021 versions of both Tranco and Chrome UX Report lists. precomputed pool mentioned above, we iteratively apply the
To compare results based on user location, we run two encodings. Further, we identify two new encodings and one
simultaneous crawls from the EU (Frankfurt) and the US hash method that were not covered by Englehardt et al.’s origi-
(New York City)—both using cloud-based servers hosted on nal detector. The newly discovered encoding methods include
Digital Ocean. For each crawl, we use one server with 16 a simple substitution cipher that replaces each letter with an-
cores and 32GB RAM. other based on a fixed mapping. We extract this mapping from
We limit the maximum crawl duration on a site to 180 a third-party script’s source code and incorporate it into the
seconds and maximum page load time to 90 seconds. After leak detector. We identified such missed leaks by using the
detecting a CMP on a website, we wait 6 seconds for the CMP received emails as proof of email collection. We manually ana-
interaction (accept or reject) to complete. We determined lyzed scripts from parties that send emails, but were not found
to collect leaked emails. Using this method, we also found
2 Available at https://tranco-list.eu/list/6WGX/100000 a third party that compresses payloads using lzstring, and

USENIX Association 31st USENIX Security Symposium 1817


EU US
Crawl Option no-action accept-all reject-all mobile no-action accept-all reject-all mobile
Crawled URLs 100K 7,720 7,720 100K 100K 7,720 7,720 100K
Successfully loaded websites 99,380 7,716 7,716 99,363 99,437 7,714 7,716 99,409
Crawled pages 625,143 44,752 40,385 597,791 690,394 51,735 49,260 668,848
Websites where we filled email 52,055 5,076 5,115 47,825 53,038 5,071 5,077 49,615
Websites where we filled password 31,002 2,306 2,342 29,422 31,324 2,263 2,283 30,356

Table 1: Desktop crawl statistics based on servers located in the EU and the US. no-action, accept-all, reject-all indicate consent
modes. Crawled pages also include inner pages that we visited.

another third party that hashes email addresses with a fixed and US crawls, respectively (for no-action, desktop crawls).
salt, which was hard-coded in their script. Note that using We plan to share these domains with blocklists providers.
(salted) email hashes may prevent this third party to match
identities with external entities such as data brokers—unless 3.9 Dataset
the data broker also uses the same salt for hashing emails.
Our main dataset consists of eight crawls, all of which were
3.8 Determining Tracker-related Leaks run in May and June of 2021. A total of six desktop crawls
were run from the EU and the US using three consent
There may be legitimate reasons why email addresses and— modes: no-action, accept-all, reject-all. In addition, two mo-
to some extent—passwords are collected before form sub- bile crawls were run using the no-action mode from the two
mission: For instance, checking whether an email/username locations. In the four, no-action crawls (100K websites), we
picked by a user is available before form submission. To avoid flag the websites where we detected (but not interacted) the
counting such cases, we exclude from our analysis all requests presence of a CMP using Consent-O-Matic. We then use these
that are sent to first-party domains, or third-party domains CMP-detected websites in the accept-all and reject-all crawls.
that are not flagged as trackers. When determining third par- For comparability we use the same 7, 720 CMP-detected web-
tyness we make use of Tracker Radar’s entity list [12], which sites in the accept-all and reject-all crawls on both locations—
contains a list of domains owned by a company. Using entity- the 7, 720 websites were detected in the EU crawl. While we
to-domains mapping allows us to better determine the third limit our crawls to the top 100K websites, our dataset contains
parties, and prevent overcounting the leaks. In addition, we approximately 2.8M page visits across all crawls considering
exclude cases where we filled the email on a page or on an the inner pages visited when searching for email and pass-
iframe that has a different domain than the crawled website. word fields. In addition to the HTTP request and response
Note that throughout the study by domain, we mean registra- details, our dataset also contains HTML sources, JavaScript
ble domain name or the effective top-level domain plus one instrumentation logs, and screenshots that can be used to de-
(eTLD+1). bug the crawler. Each 100K website crawl took five days to
Lastly, we only consider requests that are sent to end- run. The ethics considerations we took into account during
points flagged as a tracker by one of Disconnect [51], Who- the study can be found in Section 9.
tracks.me [32], DuckDuckGo [9] blocklists and uBlock Ori-
gin [16]. For the Disconnect list, we also consider domains
in the “Content” category, which is only blocked if Firefox 4 Measurement Results
is in Private Browsing mode. For uBlock Origin, we use the
Results in this section are based on desktop crawls and no-
blocklists enabled by default in the add-on. These include Ea-
action mode (no interaction with the cookie dialog) unless
syList, EasyPrivacy and Peter Lowe’s Ad and tracking server
otherwise specified.
list, among others.
Manual tracker labeling Additionally, we label the leaky
4.1 Email Leaks
request domains that are not flagged as trackers by any of the
Disconnect, Whotracks.me, DuckDuckGo and uBlock Ori- Prevalence of leaks Table 3 shows that email addresses (or
gin. For each such domain, we follow a decision algorithm their hashes) are sent to a third-party tracker on 1, 844 (EU)
explained in Appendix 10 to determine the tracker status. vs. 2, 950 (US) distinct websites. This shows that, on more
Thanks to this manual analysis, we uncover 41 tracker do- than a thousand websites, trackers only collect emails when
mains that are not listed in any of the popular blocklists. Man- the website is visited from the US.
ually labeled domains accounted for an increase of 13.4% and Table 2 gives a more detailed overview of the most com-
4.2% in the number of websites with email leaks, in the EU mon trackers that emails are leaked to. Prom. stands for promi-

1818 31st USENIX Security Symposium USENIX Association


EU US
Key Key
Leak Entity Tracker Num. Min. Entity Tracker Num. Min.
by Prom. by Prom.
Type Name Domain sites Rank Name Domain sites Rank
key key
Taboola taboola.com No 327 302.9 154 LiveRamp rlcdn.com No 524 553.8 217
Adobe bizible.com Yes 160 173.0 242 Taboola taboola.com No 383 499.0 95
Bounce
FullStory fullstory.com Yes 182 75.6 1,311 bouncex.net No 189 224.7 191
Exchange
zenaps.com* No 113 48.7 2,043 Adobe bizible.com Yes 191 212.0 242
Awin Inc.
awin1.com* No 112 48.5 2,043 zenaps.com* No 119 111.2 196
Awin
Yandex yandex.com Yes 121 41.9 1,688 awin1.com* No 118 110.9 196
AdRoll adroll.com No 117 39.6 3,753 FullStory fullstory.com Yes 230 105.6 1,311
email

Glassbox glassboxdigital.io* Yes 6 31.9 328 Listrak listrakbi.com Yes 226 66.0 1,403
Listrak listrakbi.com Yes 91 24.9 2,219 LiveRamp pippio.com No 138 65.1 567
Oracle bronto.com Yes 90 24.6 2,332 SmarterHQ smarterhq.io* Yes 32 63.8 556
LiveRamp rlcdn.com No 11 20.0 567 Verizon Media yahoo.com* Yes 255 62.3 4,281
SaleCycle salecycle.com Yes 35 17.5 2,577 AdRoll adroll.com No 122 48.6 2,343
Automattic gravatar.com* Yes 38 16.7 2,048 Yandex yandex.ru Yes 141 48.1 1,648
Facebook facebook.com Yes 21 14.8 1,153 Criteo SA criteo.com* No 134 46.0 1,403
Salesforce pardot.com* Yes 36 30.8 2,675 Neustar agkn.com* No 133 45.9 1,403
Oktopost okt.to* Yes 31 11.4 6,589 Oracle addthis.com No 133 45.9 1,403
yandex.com 37 12.12 4,699
Yandex Yes Yandex yandex.ru Yes 45 17.23 1,688
yandex.ru 7 2.41 12,989
pswd

Mixpanel mixpanel.com Yes 1 0.12 84,547 Mixpanel mixpanel.com Yes 1 0.12 84,547
LogRocket lr-ingest.io Yes 1 0.12 82,766 LogRocket lr-ingest.io Yes 1 0.12 82,766

Table 2: Top tracker domains and associated entities that emails or passwords are exfiltrated to in desktop crawls using the
no-action mode which was conducted in May’21. *: Third-party domain is not among the request initiators; that means the leak
could have been triggered by another party. Prominence (Prom.) values have been multiplied by 1,000 for readability.

EU US every month [30].


All
Third Tracking
All
Third Tracking Cross-domain email sharing for identity matching On
party related party related 565 of the 1, 844 distinct websites (EU) where we identified
Email 4,395 2,633 1,844 5,518 3,790 2,950 email leaks to tracker domains, no script from the request
Password 89 87 48 92 87 49 domain was among the request’s initiators. This means that
these requests are initiated by other parties. Analyzing HTTP
Table 3: The number of distinct websites where email and request initiators, and JavaScript stack traces of access to input
passwords are sent to first-party domains vs. third-party do- fields we found that email leaks to yahoo.com, criteo.com and
mains in desktop crawls using the no-action mode. dotomi.com are always initiated by other parties. The email
hashes to yahoo.com, for example, are sent by a script from
adthrive.com (CafeMedia)—a digital publishing and ad mon-
nence, a metric developed by Englehardt and Narayanan [57] etization network that Yahoo has a partnership with [82]. The
which captures both the quantity and popularity of websites a Yahoo endpoint (ups.analytics.yahoo.com) that email
third party is embedded on. We use prominence to sort third hashes are sent to, is described in Yahoo’s ConnectID API
parties in Table 2 because it better represents the scale of a documentation [31]. The documentation mentions that the
given third party’s reach. API can be used for ID matching and is built on Verizon Me-
In the US crawl, rlcdn.com (LiveRamp, formerly Acxiom) dia’s ID Graph, “delivering a higher find rate of audiences on
is the most prominent tracker domain that collects hashed publishers’ sites [sic] user targeting”. Clickagy(.com), on the
email addresses. On WebMD and Fox News websites, Liv- other hand, sends email hashes to up to seven other tracker
eRamp collected the MD5, SHA-1 and SHA-256 hashes of domains including agkn.com (Neustar) and pippio.com (Liv-
the email address typed into the login form. The EU list, eRamp), both of which accepts hashed emails for various
on the other hand, is dominated by Taboola—an advertis- services according to their public documentation and privacy
ing company that was found to promote clickbait and other policies [23, 36].
problematic content and ads [81, 90]. According to their help Our findings showed that email addresses or their hashes
pages, Taboola accepts hashed emails to create target audi- are sent to facebook.com on 21 distinct websites in the EU.
ences [37] based on over 1.4 billion unique visitors they reach On 17 of these, Facebook Pixel’s Automatic Advanced Match-

USENIX Association 31st USENIX Security Symposium 1819


EU US
Rank Website Third-party Hash/encoding/compression Rank Website Third-party Hash/encoding/compression
154 usatoday.com* taboola.com Hash (SHA-256) 95 issuu.com taboola.com Hash (SHA-256)
242 trello.com* bizible.com Encoded (URL) 128 businessinsider.com taboola.com Hash (SHA-256)
243 independent.co.uk* taboola.com Hash (SHA-256) 154 usatoday.com taboola.com Hash (SHA-256)
300 shopify.com bizible.com Encoded (URL) 191 time.com bouncex.net Compression (LZW)
328 marriott.com glassboxdigital.io Encoded (BASE-64) awin1.com Hash (SHA-256 with salt)
196 udemy.com
567 newsweek.com* rlcdn.com Hash (MD5, SHA-1, SHA-256) zenaps.com Hash (SHA-256 with salt)
705 prezi.com* taboola.com Hash (SHA-256) 217 healthline.com rlcdn.com Hash (MD5, SHA-1, SHA-256)
754 branch.io* bizible.com Encoded (URL) 234 foxnews.com rlcdn.com Hash (MD5, SHA-1, SHA-256)
1,153 prothomalo.com facebook.com Hash (SHA-256) 242 trello.com* bizible.com Encoded (URL)
1,311 codecademy.com fullstory.com Unencoded 278 theverge.com rlcdn.com Hash (MD5, SHA-1, SHA-256)
1,543 azcentral.com* taboola.com Hash (SHA-256) 288 webmd.com rlcdn.com Hash (MD5, SHA-1, SHA-256)

Table 4: Top ten websites where the filled email was collected by a tracker before form submission in desktop crawls using the
no-action mode. *: Not reproducible anymore as of February 2022.

ing feature [21] was responsible for sending the SHA-256 and independent.co.uk, appear high on the lists. This is in
of the email address in a SubscribedButtonClick event, line with prior work which found that news websites contain
despite not clicking any submit button. According to its docu- the highest number of third parties compared to other web-
mentation, Automatic Advanced Matching captures hashed site categories [57]. Medical news and information websites
customer data including email addresses, phone numbers, first webmd.com and healthline.com are other notable entries for
and last names; from checkout, sign-in and registration forms. their sensitive content.
We believe the leaks are due to Facebook’s script interpreting Emails sent key by key As shown in Table 2, certain third
clicks on irrelevant buttons as “submit button clicked” events. parties send email addresses character-by-character, as the
Website categories In order to compare email exfiltration user types in their address. This behavior appears to be due to
across website categories, we query McAfee’s categoriza- session replay scripts that collect users’ interactions with the
tion service [6]. Note that a website may have multiple cate- page including key presses and mouse movements [41].
gories. As shown in Table 5, Fashion/Beauty and Online Shop- HTTP and WebSocket usage Finally, we observed that
ping are the two categories where we detect the most email the leaked emails are almost always sent over encrypted
exfiltrations—considering only the categories with more than (HTTPS) connections. We only found 15 and 14 websites
1,000 websites in our 100K sample. On the other hand, web- where emails are leaked over HTTP in the EU and the US,
sites categorized as Public Information, Government/Military, respectively. In addition, on 67 websites in the EU and on 132
and Games leaked less than 1% of the filled email address. A websites in the US, the leaks were sent over the WebSocket
somehow surprising result was the following: despite filling protocol—to hotjar.com, freshrelevance.com, noibu.com and
email fields on hundreds of websites categorized as Pornogra- decibelinsight.net.
phy, we have not a single email leak. While surprising, this
is in line with limited prior research on tracking on the adult 4.2 Password Leaks
websites: a limited 2016 study by Altaweel et al. found that
Recall that we change the type of password elements to text
adult websites have relatively fewer third-party trackers com-
before filling them. To better understand why passwords are
pared to non-adult websites with comparable popularity [44].
collected, we manually analyzed a sample of websites, includ-
Effect of website popularity The number of websites with ing leaks to non-tracker third parties. We found that, in some
email leaks follows a close to a uniform distribution in the cases, passwords were sent to third parties for checking the
US crawl. On the other hand, in the EU crawl, there are sub- password strength. However, we have not found such a use
stantially fewer sites with email leaks on the Tranco top 5K: case in leaks to trackers. We found most cases we analyzed
only 1.28% sites on the top 5K has email leaks, compared to to be due to incidental collection by session recording scripts,
the average of 1.87% in websites with rank >5000 (cf. US top most prominently by Yandex Metrica.
5K: 2.96%, 5K-100K: 2.95%). Popular websites and trackers Password collection without input type swapping Since
may be using questionable data collection methods sparingly our primary findings are based on changing the type of the
in the EU to avoid GDPR fines or investigations. password field, they only apply to a limited number of users
Top websites with leaks Table 4 shows the top ten web- or websites. In order to better characterize password leaks at
sites with email leaks for each vantage point. We list the third- large, we ran follow up crawls of websites where we detected
party tracker found to collect emails on these sites, along a password leak; but this time we did not change the input
with the hashing/encoding method used when exfiltrating the type from password to text. We ran two such crawls, one
email. News websites such as usatoday.com, foxnews.com from the EU, and one from the US; both desktop crawls. Un-

1820 31st USENIX Security Symposium USENIX Association


EU/US EU US Consent modes EU US
Filled Leaky Filled Leaky
Categories Sites accept-all 239 242
sites sites sites sites
Fashion/Beauty 1,669 1,176 131 (11.1%) 1,179 224 (19.0%) reject-all 201 199
Online Shopping 5,395 3,658 345 (9.4%) 3,744 567 (15.1%) no-action 202 228
General News 7,390 3,579 235 (6.6%) 3,848 392 (10.2%)
Software/Hardware 4,933 2,834 138 (4.9%) 2,855 162 (5.7%)
Business 13,462 7,805 377 (4.8%) 7,924 484 (6.1%) Table 6: The number of distinct websites where emails were
... ... ... ... ... ... leaked and a CMP was detected in desktop crawls using the
Games 2,173 925 9 (1.0%) 896 11 (1.2%) no-action mode.
Public Information 2,346 1,049 8 (0.8%) 1,084 27 (2.5%)
Gov’t/Military 3,754 939 5 (0.5%) 974 7 (0.7%)
Uncategorized 1,616 636 3 (0.5%) 646 2 (0.3%)
Pornography 1,388 528 0 (0.0%) 645 0 (0.0%) email collecting tracker across both crawls (rlcdn.com, Liv-
eRamp), is not even among the top ten trackers in the EU
in Table 2. 3 In certain cases, the same tracking script is
Table 5: Per-category number of websites we crawled, filled
served with different content based on the vantage point. For
an email field, and observed an email leak to a tracker domain
instance, securedvisit.com, the tracker that uses a substitution
(based on desktop crawls using the no-action mode). The
cipher to encrypt its payload (Section 3.7), serves a slightly
percentage under the Leaky sites column is based on total
different script to EU visitors that disables email collection.
websites where we could fill an email field (i.e. 100 * Num.
Overall, our results appear to indicate that certain third
of leaky sites / Num. of filled sites).
parties avoid collecting EU visitors’ email addresses. In Sec-
tion 5, we provide a legal analysis of whether the practice of
less otherwise specified, password leaks presented throughout collecting emails before form submission complies with the
this paper are based on these latter crawls, without input type GDPR.
swapping. We found that passwords are collected by trackers
on 52 distinct websites even for users who do not use Show- 4.4 The Effect of Consent
Password or similar extensions. An overwhelming majority
Recall that, we found consent popups only on 7, 720 (7.7%)
(50/52) of these leaks were due to Yandex Metrica’s session
sites in the EU and 5, 391 (5.4%) sites in the US (of 100K
recording feature. However, a manual analysis of Yandex Met-
sites). Crawling these websites with three consent modes, we
rica’s code showed that it has filters to exclude password fields
obtain the results in Table 6, which shows the number of web-
from the collection. Comparing websites where Yandex col-
sites where we detect CMPs and email leaks to trackers. When
lects passwords to websites where it does not, we found that
we reject all data processing, the number of sites with leaks to
almost all leaky websites were built using the React frame-
trackers decreases by 13% in the US, 0.05% in the EU. The
work. Note that 7 of the 52 affected websites are in the Tranco
reduction in leaks in both cases is limited confirming Papado-
top 20K, and some of them are major banks and other highly
giannakis et al.’s conclusion that cookie consent choices are
visible websites such as toyota.ru. We have already reported
not effective in preventing tracking [77]. Almost no reduction
this problem to Yandex, and reached out to the affected first
in the EU leaks, however, may be counter-intuitive. This is
parties as explained in Section 6.
likely due to the limited number of websites where we could
detect CMPs and observe leaks.
4.3 Vantage Points: EU vs. US
In this section, we compare the results from our two crawl 4.5 Mobile
vantage points: the EU (Germany) and the US (NYC). The
We detected leaks on 1, 745 and 2, 744 distinct mobile web-
differences in privacy regulations are the main motivation be-
sites in the EU and US crawls, respectively (Table 7). Al-
hind this comparison. In the US crawl, the number of websites
though the number of sites with leaks is lower compared to
with email leaks is 60% higher than that of the EU: 1,844 vs
desktop crawls, the ratio of the sites with leaks to the sites
2,950.
where we could fill email is nearly the same in both vantage
Comparing the websites where we detected an email leak,
points.
we find that 2,950 websites identified in the US crawl are
The mobile and desktop websites where emails are leaked
roughly a superset of the (1,844) websites identified in the
to tracker domains overlap substantially but not completely.
EU crawl: 94.4% of the 1,844 websites detected in the EU
The Jaccard similarity of (leaky) desktop and mobile websites
crawl also appears in the list of websites in the US crawl.
is equal to 66% in the EU and 64% in the US. The difference
Tracker domains such as addthis.com, yahoo.com, dou-
between the desktop and mobile results could be due to web-
bleclick.net and criteo.com only seem to receive email ad-
dresses in the US crawls, perhaps due to stricter data protec- 3 In
fact, LiveRamp sent a 451 HTTP error code (Unavailable For Legal
tion regulations in the EU. In addition, the most prominent Reasons) in responses to requests made in the EU crawl.

USENIX Association 31st USENIX Security Symposium 1821


site dynamism and the time difference between the mobile exfiltration would have to be assessed separately, considering
and desktop crawls (more than a month). all the circumstances of that case.
We also found 18 tracker domains that only received email Does the GDPR apply? The GDPR applies when ‘per-
leaks on mobile crawls such as yieldify.com, td3x.com and sonal data’ are processed. Personal data are defined broadly
getdrip.com. However, checking the websites associated with in the GDPR. Essentially, any information that relates to an
these domains did not suggest that they are only targeting identifiable person is personal data (Article 4.1). For instance,
mobile web visitors. Further, we found 24 domains that only an email address, an IP address, a tracking cookie, an identifi-
appear in desktop crawls, further indicating that the difference cation number, and an ‘online identifier’ are almost always
could be due to factors such as time difference and website personal data. But even hashed or encrypted email addresses
dynamism. are generally personal data, as far as they contain a unique
identifier that can be linked to a person [4]. Moreover, hashed
Leaky/ Filled Sites Leaky/ Filled Sites email addresses can often be reversed [40]. ‘Processing’ is
EU US defined broadly too in the GDPR: virtually everything that
can be done with personal data is a type of processing (Arti-
Desktop 1,844 / 60,008 (3.0%) 2,950/ 60,999 (4.8%) cle 4(2)). Hence, if website owners or third parties exfiltrate
Mobile 1,745 / 55,738 (3.1%) 2,744 / 57,715 (4.8%) an email address, they process personal data and the GDPR
applies.
Table 7: The number of sites leaking emails or passwords to An organization that processes personal data is a ‘con-
trackers, compared to the number of sites where we could troller’ in GDPR parlance. The ‘controller’ is responsible
fill an email address in desktop and mobile crawls using the for complying with the GDPR, and can be fined for non-
no-action mode. compliance. In the case of email exfiltration, the website
owner and the third party are typically both responsible (as
‘joint controllers’) [33, 69].
4.6 Emails Received on the Filled Addresses Is the GDPR relevant for companies outside Europe?
The territorial scope of the GDPR is complicated, but can be
Since our crawler fills a distinct email address for each web- summarized as follows (Article 3 GDPR). If the controller
site, we are able to attribute the received emails to distinct is based in the EU, the GDPR applies. But the GDPR can
websites.4 In the six-week period following the crawls, we also apply to controllers based outside the EU. For instance,
received 290 emails from 88 distinct sites on the email ad- offering goods or services to Europeans can trigger the GDPR.
dresses used in the desktop crawls, despite not submitting If a website owner sells something and allows payment in
any form. Most emails offer a discount, or just invite us back Euros, and processes the personal data of website visitors,
to their site. The sender websites seem to vary by topic and the owner must comply with the GDPR. The GDPR also
theme. Most notable examples include diabetes.org.uk, myp- applies to controllers based outside the EU, if they ‘monitor’
illow.com, and walmart.com.mx. On the mobile crawl email the behavior of people in the EU. Tracking people online is an
address, we received 187 emails from 71 distinct websites fol- example of such monitoring [59]. Hence, if a company uses
lowing the four-week period after the crawls—mobile crawls email exfiltration for tracking web users in the EU, it must
were run two weeks after the desktop crawls. comply with the GDPR.
Transparency principle The GDPR has six overarching
principles relating to the processing of personal data. The first
5 Does Email Exfiltration Comply With the principle says that personal data must be processed ‘fairly
GDPR? and in a transparent manner’ (Article 5). The controller must
provide comprehensive information about what it does with
In this section, we discuss how email exfiltration can breach personal data, in an ‘intelligible and easily accessible form,
at least three core rules of the General Data Protection Regu- using clear and plain language’ (Article 11). Moreover, the
lation (GDPR) [48]. Roughly speaking, the GDPR could be GDPR requires detailed information about, for instance, the
seen as a Europe-wide data privacy law. Because of length processing ‘purposes’, and the ‘recipients of the personal data’
constraints, we focus on three main principles of the GDPR, (Article 13 and 14). Controllers can provide such information
omitting greater detail. in a privacy notice.
We discuss email exfiltration in general. We do not discuss Does email exfiltration comply with the transparency
to what extent specific companies comply with the GDPR. principle? If the website does not clearly disclose that it
For such a company-specific analysis, each example of email or a third party exfiltrates email addresses, the exfiltration
4 A caveat to our method is the following: we did not use separate email breaches the transparency principle. A phrase such as ‘we
addresses for the EU and the US crawls, thus we cannot attribute the received share your personal data with selected marketing partners’
emails to visits from specific locations. does not provide sufficient transparency.

1822 31st USENIX Security Symposium USENIX Association


Purpose limitation principle Does email exfiltration com- 6 Security Disclosures, GDPR Requests, and
ply with the GDPR’s purpose limitation principle? Roughly Leak Notifications
summarized, the purpose limitation says that controllers can
only collect personal data if they specify a clear purpose in Our methods allow us to detect email and password leaks
advance. And the controller is not allowed to use the data for from clients to trackers, but what happens after the leaks
‘incompatible’ new other purposes (Article 6(1)(b)). Suppose reach third party’s servers is unknown to us. In order to better
that the first purpose is enabling website visitors to manage understand the server-side processing of collected emails, and
their website account. The first purpose will be something to disclose cases of password collection, we have reached out
like ‘remembering the website visitors’ login credentials so to more than a hundred first and third parties. We used the real
that they can open and maintain an account’. Say that the third identity and university email account of one of the authors
party uses the exfiltrated email address for behavioral adver- when reporting the issues or sending the GDPR requests.
tising, email marketing or tracking people around the web. Moreover, we made it clear that our inquiries are sent within
Those purposes are incompatible with the original purpose, the context of an academic research.
and thus prohibited. Password collection disclosures Once again we note that
The requirement for a legal basis such as consent An- we believe all password leaks to third parties mentioned below
other important GDPR requirement is that the controller al- are incidental. We reached out to all third parties listed in Ta-
ways needs a ‘legal basis’ to process personal data (Article ble 2. Yandex, the most prominent tracker that collects users’
6). There are six possible legal bases, including consent. The passwords, has quickly responded to our disclosure and rolled
requirements for valid consent are strict. For instance, a con- out a fix to prevent password collection. We have also notified
sent request that is hidden in the small print of a contract or more than 50 websites where passwords were collected. Since
privacy notice cannot lead to valid consent. Further, a con- the majority of the websites embedding Yandex were in Rus-
troller cannot assume consent if people fail to opt-out (Article sian, we have enclosed a Russian translation of our message
4(11)). The GDPR does not always require the person’s con- in the notification email, along with our message in English.
sent. However, for online tracking and behavioral advertising, Mixpanel released an update only two days after we disclosed
the GDPR does require prior consent [3, 86]. the issue. With this change, even the users with outdated
SDKs were protected from collecting passwords involuntarily.
To obtain valid consent to collect website visitors’ email LogRocket, who collected passwords on publicize.co’s login
addresses before they click submit, the consent request would page, have never replied to our repeated contact attempts5 ;
have to be specific; such as: ‘Do you agree with us collecting and the password leak remained on Publicize’s website for
your email address and sharing it with company, A, B, and more than ten weeks, before it was fixed.
C for email marketing before you click submit?’. Only if the GDPR requests on email exfiltration We reached out to
website visitor clearly agrees to such a request, the visitor 58 first and 28 third parties with GDPR requests. We avoided
gives valid consent to email exfiltration. If the request was sending blanket data access requests to minimize the overhead
vague, or if the visitor did not clearly express their choice, the for the entities who were obliged to respond to our GDPR
consent is invalid. requests. Instead, we asked specific questions about how the
collected emails are processed, retained and shared. In ad-
In certain situations, email exfiltration might be allowed un- dition, we notified the top 33 websites6 where we detected
der the GDPR without the website visitor’s consent. Suppose email exfiltration in the US crawl. We sent a friendly notifica-
that a security firm (third party) exfiltrates a website visitor’s tion to these websites about the email exfiltration, rather than
email address for an extra security check. Assuming that the a formal GDPR request. We did not get any response from
security firm complies with all the other GDPR norms, the these 33 websites.
firm could be allowed to exfiltrate the email address without When selecting the first parties to send GDPR requests to,
consent (based on Article 6(1)(f)). we included the most popular websites from the EU crawl,
for which we could reproduce the email leaks. We asked the
Conclusion Email exfiltration by third parties can breach
first parties if they were aware of the email collection on their
at least three GDPR requirements. First, if such exfiltration
websites, how they used the collected email addresses, and
happens surreptitiously, it violates the transparency principle.
how long they retained them.
Second, if such exfiltration is used for purposes such as be-
havioral advertising, marketing and online tracking, it also 5 We have also enrolled the help of a contact at the Electronic Frontier
breaches the purpose limitation principle. Third, if the email Foundation, who tried calling LogRocket’s phone number, emailed their
exfiltration is used for behavioral advertising or online track- privacy contact address, and their cofounder—all to no avail. Our attempts
ing, the GDPR typically requires the website visitor’s prior to disclose the issue via LogRocket’s chatbot have also failed. We have also
consent. For breaching any of these three rules, controllers contacted Publicize, and have not heard back.
6 33 out of the top 50 websites for which we could reproduce the exfiltra-
can be fined up to 20,000,000 Euro or up to 4% of their total tion.
worldwide annual turnover (Article 83(5)).

USENIX Association 31st USENIX Security Symposium 1823


Responses from first parties: Almost half of the first par- submission. While some collection reportedly occurs due to
ties (30/58) responded to our requests. technical glitches, or (surprisingly) for compliance purposes;
other responses point to collection for marketing, analytics
• fivethirtyeight.com (via Walt Disney’s DPO), trello.com and identity matching purposes. In certain cases, companies
(Atlassian), lever.co, branch.io and cision.com were suggested that the email data are not shared with any third
among the websites that said they had not been aware parties, while others have not made the same promise. The
of the email collection prior to form submission on their limited number of responses we received, along with potential
websites and removed the behavior. response bias, prevent us from making generalizations. Re-
• Marriott said that the information collected by Glassbox gardless, we note the benefit of reaching out to the respective
is used for purposes including customer care, technical parties, despite the substantial logistics overhead. Due to lim-
support, and fraud prevention. ited space, we could only include a selection of the responses.
We plan to publish an overview of the responses as part of
• Tapad, a cross-device tracking company on whose web- our dataset.
site we found an email leak, said that they are not offering
their services to UK & EEA users since August, 2021; 7 Countermeasures
and they have deleted all data that they held from these
regions. In recent years, all major browsers except Google Chrome im-
plemented different forms of protection against online track-
• stellamccartney.com explained that the emails on their ing. In 2017, Apple introduced Safari Intelligent Tracking
websites were collected before the submission due to Prevention (ITP), which combines machine learning with
a technical issue, which was fixed upon our disclosure. a rule-based system that prevents cross-site tracking [87].
According to their response, the SaleCycle script that Since March 2020, Safari blocks all third-party cookies [88].
collected email addresses had not been visible to their Mozilla introduced tracking protection in 2018 by stripping
cookie management tool from OneTrust. cookies from requests to tracker domains, based on a tracker
list compiled by Disconnect [51, 74].
Responses from third parties: Roughly half (15/28) of
the third parties responded. Eight third parties, including In order to find out whether major browsers with anti-
Adobe, FullStory and Yandex said they are data processors, tracking features (namely, Safari and Firefox) block the exfil-
and asked us to send our GDPR request to the corresponding trations we uncovered, we manually analyzed ten websites,
first parties. each containing a distinct tracker that we found to exfiltrate
email addresses. We manually filled the email fields on these
• Taboola said in certain cases they collect users’ email websites and checked whether the exfiltration occurs by in-
hashes before form submission for ad and content per- specting the HTTP request payloads in the devtools interface.
sonalization; they keep email hashes for at most 13 We found that neither Safari nor Firefox blocked email ex-
months; and they do not share them with other third filtrations to tracking endpoints in our small sample. This
parties. Taboola also said they only collect email hashes result may be expected since both browsers try to strike a bal-
after getting user consent; however, our findings and sub- ance between minimizing breakage and curtailing cross-site
sequent manual verification showed that was not always tracking. To this end, they allow requests to tracker domains,
the case. but they strip cookies, partition network state [55], or block
access to storage that may facilitate cross-site tracking.
• Zoominfo said their “FormComplete” product appends Browser vendors may take further steps to protect against
contact details of users to forms, when the user exists in scripts that harvest email addresses for tracking purposes.
ZoomInfo’s sales and marketing database. They said the Browsers may block requests to these trackers, prevent their
ability to capture form data prior to submission can be scripts from accessing form fields, or provide them with fake
enabled or disabled by their clients. data—e.g., an empty string similar to how a zero-filled IDFA
• ActiveProspect said their TrustedForm product is used is returned on iOS devices unless the user has given their
to certify consumer’s consent to be contacted for com- consent [2]. Similar solutions are already used by different
pliance with regulations such as the Telephone Con- vendors: Firefox already blocks requests to third parties that
sumer Protection Act in the US. They said data captured use browser fingerprinting for advertisement, analytics and so-
from abandoned forms are marked for deletion within cial network tracking [5]. DuckDuckGo’s browser extension
72 hours, is not shared with anyone including the site uses JavaScript stack traces to block certain tracker cook-
owner. ies [52]. We believe the scale of unconsented data collection
uncovered in our study justifies a similar countermeasure for
We picked the above responses to reflect the diversity of scripts that harvest email addresses.
reasons for which email addresses are collected prior to form Browser extensions such as uBlock Origin [16], and

1824 31st USENIX Security Symposium USENIX Association


browsers such as Brave [24] block requests to tracker do- tion 3.7. It detects encoded, hashed, compressed or cleartext
mains, which better protects against email exfiltration than leaks from the monitored fields. While L EAK I NSPECTOR
countermeasures built-in to Firefox and Safari. On mobile, currently only uses DuckDuckGo’s blocklist [9], it is possible
users may opt for browsers that support extensions (e.g., Fire- to extend it to use other blocklists.
fox, Safari), or use a privacy-focused mobile browser that L EAK I NSPECTOR also features a user interface where re-
blocks trackers such as Brave [24] and DuckDuckGo [39]. cent sniff and leak attempts are listed, along with the tracker
Recently, Mozilla [20], Apple [18], and DuckDuckGo [19] domain, company and tracker category. The user interface
started to offer private email relay services that give users module is based on DuckDuckGo’s Privacy Essentials add-
the ability to generate and use pseudonymous (alias) email on [8]. We believe L EAK I NSPECTOR may help publishers
addresses. These privacy-focused services automatically for- and end-users to inspect third parties that harvest personal
ward emails received at the alias addresses, and allow users information from online forms without their knowledge and
to keep their real email address hidden from untrusted online consent.
services.
In their study on data exfiltration from contact forms, Starov 8 Limitations
et al. developed FormLock, an extension that detects and
highlights forms that may leak PII. Further, to prevent PII Through an iterative design process, pilot crawls and extensive
leakage, FormLock temporarily blocks third-party requests sanity checking, we built our crawler and analysis processes to
and prevents stashing of PII into various storage mechanisms be robust and scalable. Where possible we set the parameters
such as cookies, localStorage and indexedDB [84]. of the crawler such as timeout duration, based on data from
L EAK I NSPECTOR Since none of the available counter- pilot crawls. However, certain limitations apply to our data
measures allow inspection of sniff and exfiltration attempts, collection and analysis methods.
we developed L EAK I NSPECTOR, a proof-of-concept browser Leak detection While we search for an extensive set of
add-on that warns users against sniff attempts and blocks encodings and hashes, and we substantially improved the leak
requests containing personal information. detector module we inherited from the prior work, our leak de-
While L EAK I NSPECTOR has similarities to FormLock, it tection method may still miss leaks that are custom encoded,
also supports detecting form sniff attempts and more pre- encrypted, or compressed. Future work may improve leak de-
cisely detects and prevents leak attempts to trackers. Further, tection by applying methods such as multi-stage filtering [61],
L EAK I NSPECTOR does not require user intervention, and and JavaScript information flow tracking [58].
logs technical details of the detected sniff and leak attempts Shadow DOM and crawl depth During our pilot crawls
to console to enable technical audits. The logged informa- we found that we cannot detect email and password fields if
tion includes the value and XPath of the sniffed element, the they are in the Shadow DOM [28] of other elements. Since we
origin of the sniffer script, and details of the leaky request only found two such cases in a pilot crawl of 1K websites, we
such as URL and POST data. L EAK I NSPECTOR has two main believe this is an acceptable limitation. Further, our crawler
features that users may enable: is limited to crawls of one-click depth for simplicity. Input
Sniffer Detector When this feature is enabled, L EAK I N - fields that can only be discovered through multiple subsequent
SPECTOR detects and optionally prevents sniffing of input clicks may be missed by our crawler. These limitations make
fields where users may enter personal information such as our results likely lower bounds.
name, email and credit card details. We use code extracted Blocklists We use a combination of blocklists from dif-
from Firefox’s autofill field detection heuristics [17] to detect ferent providers to flag domains as trackers. These lists vary
such input fields. by quality and compilation method (e.g., crowdsourced vs.
We overwrite the getter method of the HTMLInputElement maintained by a company such as Disconnect). Further, we
prototype to intercept input field sniff attempts. We add an flag domains as trackers if they are present in only one of
event listener for input event to all auto-fill fields to keep these lists. As such, our results may have both false positives
track of their current values. These input field values are then and false negatives due to imperfections in those blocklists.
used to detect leaks in outgoing requests. When a script at- Domain aliases Although we only consider leaks to third-
tempts to read a monitored field’s value, L EAK I NSPECTOR party tracker domains, we also analyzed a sample of exfil-
processes the JavaScript stack trace and extract the script ad- tration to first-party domains. The use cases we identified
dresses. It then highlights the sniffed input field if there is a included email address verification and self-hosted analyt-
third-party script in the stack trace categorized as a tracker by ics services. Future work could investigate exfiltrations to
DuckDuckGo’s blocklist [9]–which we also use in Section CNAME-based trackers that appear as first parties [50].
3.8. When determining third party scripts, L EAK I NSPECTOR Bypassing cookie consent banners During the manual
takes into account domain-entity relationships [12]. labeling process, we encountered modal GDPR consent di-
Leak Detector L EAK I NSPECTOR intercepts HTTP re- alogs that disallow proceeding without giving/rejecting to
quests and runs the leak detector algorithm presented in Sec- give consent. A real user would have to accept or reject data

USENIX Association 31st USENIX Security Symposium 1825


processing to interact with the page; but our web crawler Acknowledgments
could have bypassed the consent dialog, depending on how it
is implemented. On a random sample of 1, 000 websites, we We thank Alexei Miagkov, Arvind Narayanan, Bart Jacobs,
detected 168 modal consent dialogs. Claudia Diaz, David Roefs, Dorine Gebbink, Galina Bulbul,
Anti-bot measures Finally, our crawler might have been Gwendal Le Grand, Konrad Dzwinel, Pete Snyder, Sergey
served CAPTCHA pages, or treated differently due to crawl- Galich, Steve Englehardt, Vincent Toubiana, our shepherd
ing from cloud IP addresses. During a 1K website pilot crawl, Alexandros Kapravelos, SecWeb and USENIX Security re-
we identified only three CloudFlare CAPTCHA pages that viewers for their valuable comments and contributions. The
blocked our crawler. idea for measuring email exfiltration before form submis-
sion is initially developed with Steve Englehardt and Arvind
9 Ethics Considerations Narayanan during an earlier study [41]. Asuman Senol was
funded by the Cyber-Defence (CYD) Campus of armasuisse
Data collection: When crawling, we took adequate measures Science and Technology. Gunes Acar was initially supported
to avoid overloading the websites. For instance, we avoided by a postdoctoral fellowship from the Research Foundation
making concurrent visits to the same website. Flanders (FWO). The study was supported by CyberSecurity
Disclosures: We reported password leaks to both trackers Research Flanders with reference number VR20192203.
and to the websites where we detected a password leak. In
our emails, we provided technical details and reproduction
instructions so that it is easier for the parties to reproduce and References
address the issue we reported. To the third parties, we sent
the list of websites where they caused a password leak. To [1] Adding Rank Magnitude to the CrUX Report in Big-
avoid any misunderstanding, we made it clear to all parties Query. https://developers.google.com/web/up
that we did not collect any visitors’ email or password during dates/2021/03/crux-rank-magnitude.
our study. We did not send GDPR requests to trackers that
incidentally collected passwords. [2] advertisingIdentifier | Apple Developer Documentation.
https://developer.apple.com/documentation/
adsupport/asidentifiermanager/1614151-adve
10 Conclusion
rtisingidentifier.
We presented a large-scale study of email and password ex-
[3] Article 29 Working Party, ‘Opinion 03/2013 on purpose
filtration by online trackers before form submission. In order
limitation’ (WP 203), 2 April 2013. https://ec.e
to address the challenges of finding and filling input fields,
uropa.eu/justice/article-29/documentation/
we integrated into our crawler a pre-trained ML classifier that
opinion-recommendation/files/2013/wp203_en.
detects email fields. Our results—likely lower bounds—show
pdf.
that on thousands of sites email addresses are collected from
login, registration and newsletter subscription forms; and sent
[4] Article 29 Working Party, ‘Opinion 05/2014 on
to trackers before users submit any form or give their con-
Anonymisation Techniques’ (WP 216) 10 April 2014.
sent. Further, we found tens of sites where passwords are
https://ec.europa.eu/justice/article-29/do
incidentally collected by third parties providing session re-
cumentation/opinion-recommendation/files/2
play services. Comparing results from the EU and the US
014/wp216_en.pdf.
vantage points, we found that 60% more websites leaked
users’ emails to trackers, when visited from the US. Mea- [5] Cookie Status: Current Status Of Browser Tracking Pre-
suring the effect of consent choices on the exfiltration, we vention | cookiestatus.com. https://www.cookiest
found their effect to be minimal. Based on our findings, users atus.com.
should assume that the personal information they enter into
web forms may be collected by trackers—even if the form [6] Customer URL Ticketing System. https://www.trus
is never submitted. Considering its scale, intrusiveness and tedsource.org.
unintended side-effects, the privacy problem we investigate
deserves more attention from browser vendors, privacy tool [7] Data Services API: Endpoints. https://developer.
developers, and data protection agencies. myacxiom.com/code/api/endpoints/hashed-ent
ity.
Code and Data [8] DuckDuckGo Browser Extensions. https://github
.com/duckduckgo/duckduckgo-privacy-extensi
The source code and the dataset from our study are publicly on.
available at https://github.com/leaky-forms.

1826 31st USENIX Security Symposium USENIX Association


[9] DuckDuckGo Tracker Blocklist. https://staticcdn. [25] Sending oHashes to Oracle Data Cloud platform. http
duckduckgo.com/trackerblocking/v2.1/tds.js s://docs.oracle.com/en/cloud/saas/data-clo
on. ud/data-cloud-help-center/IntegratingBlueK
aiPlatform/IDManagement/sending_ohashes.htm
[10] email_detector.js - Private Relay. https://github.c l.
om/mozilla/fx-private-relay/blob/v1.2.2/ex
tension/js/email_detector.js. [26] ShowPassword - Chrome Web Store. https://chrome
.google.com/webstore/detail/showpassword/b
[11] Enhanced Tracking Protection in Firefox for desktop. biclfnbhommljbjcoelobnnnibemabl.
https://support.mozilla.org/en-US/kb/enhan
ced-tracking-protection-firefox-desktop. [27] Signup Abandonment Emails Case Study: How Drip
Increased Trial Signups by 15%. https://www.saas
[12] entity_map.json - DuckDuckGo Tracker Radar. https: emailmarketing.net/articles/signup-abandon
//github.com/duckduckgo/tracker-radar/blob ment-emails-increase-trial-signups/.
/main/build-data/generated/entity_map.json.
[28] Using shadow DOM. https://developer.mozilla.
[13] Evercookie - Virtually irrevocable persistent cookies. org/en-US/docs/Web/Web_Components/Using_s
https://samy.pl/evercookie. hadow_DOM.
[14] Fathom documentation. https://mozilla.github.i
[29] What is a “good” conversion rate for your signup flow?
o/fathom/.
https://heap.io/blog/good-conversion-rate-
[15] Firefox Password Manager Module. https://search signup-flow.
fox.org/mozilla-central/source/toolkit/com
[30] Why Taboola? https://pubhelp.taboola.com/hc
ponents/passwordmgr/NewPasswordModel.jsm.
/en-us/articles/360003157074-Why-Taboola-.
[16] gorhill/uBlock: uBlock Origin - An efficient blocker for
[31] yahoo-connectid/sync.spec.js. https://github.com
Chromium and Firefox. Fast and lean. https://gith
/yahoo/yahoo-connectid/blob/d0b56d47a7/src
ub.com/gorhill/uBlock.
/sync.spec.js#L33-L34.
[17] heuristicsRegexp.js - Mozilla Autofill. https://sear
[32] whotracks.me | Data from the largest and longest mea-
chfox.org/mozilla-central/source/toolkit/c
surement of online tracking. https://github.com/g
omponents/formautofill/content/heuristicsR
hostery/whotracks.me, 2017.
egexp.js.
[33] Court of Justice of the European Union, Case C-
[18] Hide My Email for Sign in with Apple. https://supp
40/17, Fashion ID GmbH & Co. KG v Ver-
ort.apple.com/en-us/HT210425.
braucherzentrale NRW e.V., judgment of 29 July 2019
[19] Introducing Email Protection: The easy way to block (ECLI:EU:C:2019:629). https://curia.europa.e
email trackers and hide your address. https://spre u/juris/liste.jsf?num=C-40/17, 2019.
adprivacy.com/introducing-email-protection
[34] Tracker Radar Collector. https://github.com/duc
-beta/.
kduckgo/tracker-radar-collector, 2020.
[20] Mozilla Relay | Protect your real email address to help
control your inbox. https://relay.firefox.com/. [35] About the customer matching process - Google Ads
Help. https://support.google.com/google-ads/
[21] Optimise: Automatic advanced matching. https://ww answer/7474263?hl=en, 2021.
w.facebook.com/business/m/signalshealth/op
timize/automatic-advanced-matching. [36] Hashing Identifiers. https://docs.liveramp.com/
connect/en/hashing-identifiers.html, 2021.
[22] People-Based Marketing In The Cookiepocalypse. ht
tps://dataq.ai/blog/the-rise-of-people-bas [37] Uploading and Targeting a Customer File. https://he
ed-marketing/. lp.taboola.com/hc/en-us/articles/360021908
874-Uploading-and-Targeting-a-Customer-Fil
[23] Privacy | Neustar. https://www.home.neustar/pri e, 2021.
vacy.
[38] 6 Steps for Avoiding Online Form Abandonment. http
[24] Secure, Fast & Private Web Browser with Adblocker | s://themanifest.com/web-design/blog/6-step
Brave Browser. https://brave.com/. s-avoid-online-form-abandonment, 2022.

USENIX Association 31st USENIX Security Symposium 1827


[39] DuckDuckGo Privacy Browser - Apps on Google Play. tracking. In Proceedings of the 17th International Con-
https://play.google.com/store/apps/detai ference on emerging Networking EXperiments and Tech-
ls?id=com.duckduckgo.mobile.android, 2022. nologies, pages 223–229, 2021.

[40] Gunes Acar. Four cents to deanonymize: Companies [50] Yana Dimova, Gunes Acar, Lukasz Olejnik, Wouter
reverse hashed email addresses. https://freedom-to Joosen, and Tom Van Goethem. The CNAME of the
-tinker.com/2018/04/09/four-cents-to-deano Game: Large-scale Analysis of DNS-based Tracking
nymize-companies-reverse-hashed-email-addr Evasion. Proceedings on Privacy Enhancing Technolo-
esses/, 2018. gies, (3):394–412, 2021.
[41] Gunes Acar, Steven Englehardt, and Arvind Narayanan. [51] Disconnect. Disconnect Tracking Protection. https:
No boundaries: data exfiltration by third parties embed- //github.com/disconnectme/disconnect-track
ded on web pages. Proceedings on Privacy Enhancing ing-protection.
Technologies, (4):220–238, 2020.
[52] DuckDuckGo. DuckDuckGo Privacy Essentials browser
[42] Gunes Acar, Marc Juarez, Nick Nikiforakis, Claudia extension. https://github.com/duckduckgo/duck
Diaz, Seda Gürses, Frank Piessens, and Bart Preneel. duckgo-privacy-extension/blob/bfbd47a/shar
FPDetective: Dusting the Web for Fingerprinters. In ed/js/content-scope/tracking-cookies-1p-pr
Proceedings of the 2013 ACM SIGSAC Conference on otection.js#L30, 2021.
Computer and Communications Security, pages 1129–
1140, 2013. [53] Peter Eckersley. How unique is your web browser? In
Proceedings of the 10th International Conference on
[43] Seyed Ali Akhavani, Jordan Jueckstock, Junhua Su, Privacy Enhancing Technologies (PETS), page 1–18.
Alexandros Kapravelos, Engin Kirda, and Long Lu.
Browserprint: An analysis of the impact of browser [54] Steve Englehardt, Gunes Acar, and Arvind Narayanan.
features on fingerprintability and web privacy. In In- No boundaries for credentials: New password leaks to
ternational Conference on Information Security, pages Mixpanel and Session Replay Companies. https://fr
161–176. Springer, 2021. eedom-to-tinker.com/2018/02/26/, 2018.
[44] Ibrahim Altaweel, Maximillian Hils, and Chris Jay Hoof- [55] Steven Englehardt and Arthur Edelstein. Firefox 85
nagle. Privacy on adult websites. In Altaweel et al., Pri- Cracks Down on Supercookies – Mozilla Security Blog.
vacy on Adult Websites, Workshop on Technology and https://blog.mozilla.org/security/2021/01/
Consumer Protection (ConPro), 2017. 26/supercookie-protections, 2021.
[45] Rolf Bagge, Célestin Matte, Éric Daspet, Kaspar
[56] Steven Englehardt, Jeffrey Han, and Arvind Narayanan.
Emanuel, Sam Macbeth, and Steven Roeland. Consent-
I never signed up for this! Privacy implications of email
O-Matic. https://github.com/cavi-au/Consent
tracking. Proceedings on Privacy Enhancing Technolo-
-O-Matic/, 2019.
gies (PETS), 2018(1):109–126, 2018.
[46] Sai Prashanth Chandramouli, Pierre-Marie Bajan,
[57] Steven Englehardt and Arvind Narayanan. Online Track-
Christopher Kruegel, Giovanni Vigna, Ziming Zhao,
ing: A 1-million-site Measurement and Analysis. In
Adam Doupé, and Gail-Joon Ahn. Measuring E-Mail
Proceedings of the 2016 ACM SIGSAC Conference on
Header Injections on the World Wide Web. In Proceed-
Computer and Communications Security, pages 1388–
ings of the 33rd Annual ACM Symposium on Applied
1401, 2016.
Computing, pages 1647–1656, 2018.

[47] Manolis Chatzimpyrros, Konstantinos Solomos, and [58] Daniel Hedin, Arnar Birgisson, Luciano Bello, and An-
Sotiris Ioannidis. You Shall Not Register! Detecting drei Sabelfeld. JSFlow: Tracking information flow in
Privacy Leaks Across Registration Forms. In Computer JavaScript and its APIs. In Proceedings of the 29th
Security, pages 91–104. Springer, 2019. Annual ACM Symposium on Applied Computing, pages
1663–1671, 2014.
[48] Council of European Union. EU General Data Protec-
tion Regulation (GDPR). https://eur-lex.europa [59] Chris Jay Hoofnagle, Bart van der Sloot, and Fred-
.eu/eli/reg/2016/679/oj. erik Zuiderveen Borgesius. The European Union gen-
eral data protection regulation: what it is and what it
[49] Ha Dao and Kensuke Fukuda. Alternative to third-party means. Information & Communications Technology
cookies: investigating persistent PII leakage-based web Law, 28(1):65–98, 2019.

1828 31st USENIX Security Symposium USENIX Association


[60] Umar Iqbal, Steven Englehardt, and Zubair Shafiq. Fin- [70] Célestin Matte, Nataliia Bielova, and Cristiana Santos.
gerprinting the Fingerprinters: Learning to Detect Do cookie Banners Respect My Choice?: Measuring
Browser Fingerprinting Behaviors. In IEEE Sympo- Legal Compliance of Banners from IAB Europe’s Trans-
sium on Security and Privacy (SP), pages 1143–1161, parency and Consent Framework. In IEEE Symposium
2021. on Security and Privacy (SP), pages 791–809, 2020.

[61] Sakshi Jain, Mobin Javed, and Vern Paxson. Towards [71] Surya Mattu and Kashmir Hill. Before You Hit ’Submit,’
Mining Latent Client Identifiers from Network Traf- This Company Has Already Logged Your Personal Data.
fic. Proceedings on Privacy Enhancing Technologies Gizmodo, 2017. https://gizmodo.com/before-you
(PETS), (2):100–114, 2016. -hit-submit-this-company-has-already-logge
-1795906081.
[62] Hugo Jonker, Stefan Karsch, Benjamin Krumnow, and
Marc Sleegers. Shepherd: A generic approach to au- [72] Jonathan R Mayer and John C Mitchell. Third-Party
tomating website login. In Workshop on Measurements, Web Tracking: Policy and Technology. In 2012 IEEE
Attacks, and Defenses for the Web (MADWeb), 2020. Symposium on Security and Privacy, pages 413–427.
IEEE, 2012.
[63] Jordan Jueckstock and Alexandros Kapravelos. Visi-
bleV8: In-browser Monitoring of JavaScript in the Wild. [73] Keaton Mowery and Hovav Shacham. Pixel Perfect: Fin-
In Proceedings of the Internet Measurement Conference, gerprinting Canvas in HTML5. Proceedings of W2SP,
pages 393–405, 2019. 2012.

[64] Tomer Laor, Naif Mehanna, Antonin Durey, Vitaly [74] Nick Nguyen. Changing Our Approach to Anti-tracking
Dyadyuk, Pierre Laperdrix, Clémentine Maurice, Yossi – Future Releases. https://blog.mozilla.org/fut
Oren, Romain Rouvoy, Walter Rudametkin, and Yuval urereleases/2018/08/30/changing-our-approa
Yarom. DRAWNAPART: A Device Identification Tech- ch-to-anti-tracking.
nique based on Remote GPU Fingerprinting. In Network
and Distributed System Security Symposium (NDSS), [75] Midas Nouwens, Ilaria Liccardi, Michael Veale, David
2022. Karger, and Lalana Kagal. Dark Patterns after the GDPR:
Scraping Consent Pop-Ups and Demonstrating Their
[65] Pierre Laperdrix, Nataliia Bielova, Benoit Baudry, and Influence. In CHI Conference on Human Factors in
Gildas Avoine. Browser fingerprinting: A survey. ACM Computing Systems, pages 1–13, 2020.
Transactions on the Web (TWEB), 14(2):1–33, 2020.
[76] Lukasz Olejnik, Tran Minh-Dung, and Claude Castel-
[66] Victor Le Pochat, Tom Van Goethem, Samaneh Tajal- luccia. Selling Off Privacy at Auction. In Network and
izadehkhoob, Maciej Korczyński, and Wouter Joosen. Distributed System Security Symposium (NDSS), 2014.
Tranco: A Research-Oriented Top Sites Ranking Hard-
ened Against Manipulation. In Proceedings of the 26th [77] Emmanouil Papadogiannakis, Panagiotis Papadopoulos,
Annual Network and Distributed System Security Sym- Nicolas Kourtellis, and Evangelos P. Markatos. User
posium (NDSS), 2019. Tracking in the Post-cookie Era: How Websites Bypass
GDPR Consent to Track Users. In Proceedings of the
[67] Adam Lerner, Anna Kornfeld Simpson, Tadayoshi Web Conference 2021, pages 2130–2141, 2021.
Kohno, and Franziska Roesner. Internet jones and the
raiders of the lost trackers: An archaeological study of [78] Abbas Razaghpanah, Rishab Nithyanand, Narseo
web tracking from 1996 to 2016. In 25th USENIX Secu- Vallina-Rodriguez, Srikanth Sundaresan, Mark Allman,
rity Symposium, 2016. Christian Kreibich, Phillipa Gill, et al. Apps, trackers,
privacy, and regulators: A global study of the mobile
[68] Xu Lin, Panagiotis Ilia, and Jason Polakis. Fill in the tracking ecosystem. In The 25th Annual Network and
Blanks: Empirical Analysis of the Privacy Threats of Distributed System Security Symposium, 2018.
Browser Form Autofill. In Proceedings of the 2020 ACM
SIGSAC Conference on Computer and Communications [79] Jingjing Ren, Ashwin Rao, Martina Lindorfer, Arnaud
Security (CCS), pages 507–519, 2020. Legout, and David Choffnes. ReCon: Revealing and
Controlling PII Leaks in Mobile Network Traffic. In
[69] René Mahieu and Joris Van Hoboken. Fashion-ID: In- Proceedings of the 14th Annual International Confer-
troducing a phase-oriented approach to data protection? ence on Mobile Systems, Applications, and Services,
European Law Blog, 2019. pages 361–374, 2016.

USENIX Association 31st USENIX Security Symposium 1829


[80] Franziska Roesner, Tadayoshi Kohno, and David Wether- Appendix A Supported Hash and Encoding
all. Detecting and Defending Against Third-Party Track- Methods for Leak Detection
ing on the Web. In 9th USENIX Symposium on Net-
worked Systems Design and Implementation (NSDI 12), Hashes and Checksums: MD2, MD4, MD5, SHA1,
pages 155–168, 2012. SHA256, SHA224, SHA384, SHA512, SHA3 (224,
256, 384, 512-bit), MurmurHash3 (32, 64, 128-bit),
[81] Md Main Uddin Rony, Naeemul Hassan, and Moham- RIPEMD-160, Whirlpool, Salted SHA1 (salt=QX4QkKEU)
mad Yousuf. Diving Deep into Clickbaits: Who Use Encodings: Base16, Base32, Base58, Base64, Urlen-
Them to What Extents in Which Topics with What Ef- code, Entity, Deflate, Zlib, Gzip, LZstring, Custom Map (
fects? In Proceedings of the 2017 IEEE/ACM Inter- kibp8A4EWRMKHa7gvyz1dOPt6UI5xYD3nqhVwZBXfCcFe...
national Conference on Advances in Social Networks 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghi...)
Analysis and Mining, pages 232–239, 2017.
Appendix B Labeling email-collecting 3rd-
[82] Shobha Doshi. CafeMedia integrates with Verizon Me- party domains that are not blocked by block-
dia ConnectID. https://cafemedia.com/integrat lists
ing-with-verizon-media-connectid/.
For each domain:
[83] Oleksii Starov, Phillipa Gill, and Nick Nikiforakis. Are 1. Is the 3rd-party domain is owned by the same entity as
You Sure You Want to Contact Us? Quantifying the the first party?
Leakage of PII via Website Contact Forms. Proceedings a. Yes: not tracking-related (first-party exception)
on Privacy Enhancing Technologies (PETS), (1):20–33, 2. Did we receive any email from websites where this
2016. domain collected email addresses?
a. Yes: tracking-related
3. Identify the company website—use the initiator script
[84] Oleksii Starov, Phillipa Gill, and Nick Nikiforakis.
(URL, source code, copyright preamble, comments) if
FormLock. https://github.com/ostarov/For
necessary.
mlock, 2021.
a. Is the 3rd party used for email validation (check
[85] Steven Van Acker, Daniel Hausknecht, and Andrei on an example first-party site taking into account
Sabelfeld. Measuring Login Webpage Security. In UI messages (e.g. “Invalid email”) and HTTP re-
Proceedings of the Symposium on Applied Computing, sponse content (e.g., “bogus email” when we enter
pages 1753–1760, 2017. [email protected])?
i. Yes: not tracking-related (validation exception)
[86] Michael Veale and Frederik Zuiderveen Borgesius. b. Identify the business category using BuiltWith and
Adtech and Real-Time Bidding under European Data the company website (esp. check for solutions, prod-
Protection Law. German Law Journal, 2021. ucts, and other marketing materials). Does the business
category include one of marketing, advertising, analyt-
[87] John Wilander. Intelligent Tracking Prevention. https: ics?
//webkit.org/blog/7675/intelligent-trackin i. Yes: tracking-related
g-prevention, 2017. ii. No: not tracking-related

[88] John Wilander. Full Third-Party Cookie Blocking and


More. https://webkit.org/blog/10218/full-t
hird-party-cookie-blocking-and-more, 2020.

[89] Zhiju Yang and Chuan Yue. A Comparative Measure-


ment Study of Web Tracking on Mobile and Desktop
Environments. Proceedings on Privacy Enhancing Tech-
nologies, (2):24–44, 2020.

[90] Eric Zeng, Tadayoshi Kohno, and Franziska Roesner.


Bad News: Clickbait and Deceptive Ads on News and
Misinformation Websites. In Workshop on Technology
and Consumer Protection (ConPro), 2020.

1830 31st USENIX Security Symposium USENIX Association

You might also like