0% found this document useful (0 votes)
61 views12 pages

21 - WebSocket - Adoption - and - The - Landscape - of - The - Real-Time - Web

Uploaded by

so.ghost.07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views12 pages

21 - WebSocket - Adoption - and - The - Landscape - of - The - Real-Time - Web

Uploaded by

so.ghost.07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

WebSocket Adoption and the Landscape of the Real-Time Web

Paul Murley Zane Ma Joshua Mason


University of Illinois at University of Illinois at University of Illinois at
Urbana-Champaign Urbana-Champaign Urbana-Champaign
[email protected] [email protected] [email protected]

Michael Bailey Amin Kharraz


University of Illinois at Florida International University
Urbana-Champaign [email protected]
[email protected]
ABSTRACT 1 INTRODUCTION
Developers are increasingly deploying web applications which Websites are increasingly dependent on real-time commu-
require real-time bidirectional updates, a use case which nication between clients and servers. The modern web has
does not naturally align with the traditional client-server expanded to a broad range of applications that require bidirec-
architecture of the web. Many solutions have arisen to address tional updates between parties—online gaming, advertising,
this need over the preceding decades, including HTTP polling, and collaborative document editing are a few examples. For
Server-Sent Events, and WebSockets. This paper investigates years, the client-server model was stretched to provide these
this ecosystem and reports on the prevalence, benefits, and capabilities to developers. Practices such as HTTP polling
drawbacks of these technologies, with a particular focus on the enabled some real-time applications, but these stopgap solu-
adoption of WebSockets. We crawl the Tranco Top 1 Million tions present significant inefficiencies in the form of excessive
websites to build a dataset for studying real-time updates in headers and wasted requests. HTTP polling gradually evolved
the wild. We find that HTTP Polling remains significantly and optimized some of these pain points, but the fundamental
more common than WebSockets, and WebSocket adoption architectural challenge remained: receiving real-time updates
appears to have stagnated in the past two to three years. We from a server which cannot initiate connections.
investigate some of the possible reasons for this decrease in WebSockets were designed to offer developers a built-in,
the rate of adoption, and we contrast the adoption process to performant solution for real-time applications. The Web-
that of other web technologies. Our findings further suggest Socket API directly addresses these needs, providing low
that even when WebSockets are employed, the prescribed overhead and full duplex communication between web servers
best practices for securing them are often disregarded. The and their clients, while avoiding large amounts of unneces-
dataset is made available in the hopes that it may help inform sary traffic. Having been implemented in major browsers for
the development of future real-time solutions for the web. almost a decade, WebSockets are now a mature, widespread
part of the web landscape. Accordingly, it is important to
CCS CONCEPTS understand how this web technology is deployed and to what
degree WebSockets have been successful in improving the web
• Information Systems → World Wide Web.
experience. By identifying WebSocket successes and failures,
and comparing WebSockets with their alternatives, we strive
KEYWORDS to inform future efforts to improve WebSockets and similar
WebSocket; Polling; Measurement; Performance; Adoption; web technologies.
Abuse; Security This paper is an empirical assessment of the real-time web.
We visit the Tranco Top 1M [46] websites, gathering data on
ACM Reference Format:
71,637 WebSocket connections across 55,805 websites which
Paul Murley, Zane Ma, Joshua Mason, Michael Bailey, and Amin
Kharraz. 2021. WebSocket Adoption and the Landscape of the Real-
use them. We make our full dataset available for download1 .
Time Web. In Proceedings of the Web Conference 2021 (WWW Using this dataset, we study the types of sites that use real-
’21), April 19–23, 2021, Ljubljana, Slovenia. ACM, New York, time technologies, with a particular focus on WebSockets, and
NY, USA, 12 pages. https://doi.org/10.1145/3442381.3450063 we characterize how these sites employ these various technolo-
gies. As expected, we find that WebSockets are being used for
a diverse set of use cases across popular websites, with some
This paper is published under the Creative Commons Attribution 4.0 of the most common being chat, analytics, and live updates
International (CC-BY 4.0) license. Authors reserve their rights to
disseminate the work on their personal and corporate Web sites with
for sports scores and stock prices. We find that online chat
the appropriate attribution. accounts form a majority of WebSocket connections across
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia top sites. Scripts which communicate via the WebSocket API
© 2021 IW3C2 (International World Wide Web Conference Committee),
published under Creative Commons CC-BY 4.0 License.
are almost exclusively third-party (94.9%), suggesting that
ACM ISBN 978-1-4503-8312-7/21/04.
1
https://doi.org/10.1145/3442381.3450063 https://bit.ly/2TeUpYx
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Paul Murley, Zane Ma, Joshua Mason, Michael Bailey, and Amin Kharraz

most websites do not implement their own WebSocket in- servers may need to update clients multiple times per second,
frastructure. This is in line with previous work [44], which but may also go without sending an update for minutes or
detailed the increasing complexity of websites due to a grow- hours. Clearly, the web needs mechanisms for bidirectional
ing number of third party inclusions, and particularly third communication at unpredictable intervals. In this paper, we
party scripts. examine a group of technologies that seek to enable this:
In order to assess how successful WebSockets have been, HTTP Polling/Streaming, Server-Sent Events (SSE), and
we quantify the advantages they provide and identify real WebSockets.
scenarios where they could be employed to improve current
website implementations. We provide calculations on the
overhead and wasted request savings of switching from HTTP 2.2 HTTP Polling
polling to WebSockets, and we present real-world data to An early solution to server-initiated communication, known
support our calculations. The results suggest that there are as HTTP polling, has been around for more than two decades.
at least as many sites still using HTTP polling as there are In HTTP polling, a client desiring near-real-time updates
sites which use WebSockets. As a case study, we provide a sends frequent HTTP requests to a server, which usually
concrete example of the reduction of overhead that a website replies with a empty or baseline response. When an update
achieves by deploying WebSockets. Our research points to becomes available, the client receives it from the server with
significant room for improvement on the web through the a latency of roughly the time between requests. To further
further deployment of WebSockets. optimize this solution, HTTP offers the keep-alive header.
Like many other technologies on the Internet, WebSockets This option allows the reuse of a TCP connection for multi-
are commonly misconfigured and/or misused. While we did ple HTTP requests, reducing the overhead of creating a new
not uncover any new vulnerabilities, our analysis reveals TCP connection for each HTTP request. The problems with
shortcomings in third party libraries and other errors in this approach are straightforward and well known [1]. Even
deployment that degrade the security and performance of with persistent connections, the server cannot push updates
many applications which use WebSockets. We discuss several to the client directly. While HTTP polling may be sufficient
of the best practices laid out in the WebSocket RFC [2] and for applications where data arrives at known times, it will
compare them to the ecosystem we observe. For example, we lead to many wasted client requests when servers need to
find that 74.4% of WebSocket servers do not check/verify update clients at inconsistent intervals. Each of these requests
the HTTP Origin header attached to requests to open a and responses contain HTTP headers, leading to significant
connection. 14.1% of the WebSocket servers we observe are amounts of wasted network traffic. In addition, updates from
accessible over unencrypted (ws://) channels, with 0.8% the server can only arrive at the frequency with which it
of them using this configuration by default. While these is polled. An optimization on polling, called “long polling”,
may not always result in blatant vulnerabilities, they are uses the ability of servers to hold client queries open until
indicators that developers are frequently failing to follow best- data becomes available. This reduces the delay in delivering
practices, leading to a more risky environment overall. Beyond updates to clients, since the server will respond to the re-
misconfigurations, we uncover evidence that WebSockets are quest at the moment when the data becomes available. Since
regularly used to facilitate unsavory practices such as user requests need to be sent much less frequently (only when the
tracking, cryptojacking, and malware delivery. We provide server sends data or when a request times out), this solution
an empirical look at several malicious use cases we observed, offers a significant decrease in bytes on the wire.
along with examples we observed in the wild. Our hope is that
this work serves to raise awareness about the importance
of real-time technologies such as WebSockets, while also 2.3 HTTP Streaming and Server-Sent Events
underscoring several problems to be corrected in current, A further optimization to long polling, known as “HTTP
real-world deployments. streaming”, keeps the underlying TCP connection open even
after data is delivered in response to a request, meaning that
2 BACKGROUND more data can be delivered to the client if/when it becomes
available. This is accomplished using a Transfer Encoding:
2.1 The Need for Real-Time Communication chunked HTTP header, which causes the browser to hold
The client-server model has long been the architectural back- open the TCP connection even as multiple responses are
bone of the web: a client requests a resource, and that request received, eliminating the need for a reconnection whenever
is subsequently serviced by a remote server. However, this the server sends data.
model has become increasingly inadequate for various appli- One particular variety of HTTP streaming, called Server-
cations. As web apps become more dynamic and interactive, Sent Events (SSE), are offered by the EventSource JavaScript
servers often require the ability to push messages to the browser API. SSE essentially implements HTTP streaming in
client at will. Examples of these use cases are numerous: an easy-to-use interface, with reconnection and event firing
browser-based gaming, collaborative document editing such built in. It sets the HTTP mimetype header to the value
as Google Docs, chat services, continuous updates to sports text/eventstream, indicating to the browser that responses
scores and stock prices, and many more. In these applications, should be delivered as JavaScript events. However, these
WebSocket Adoption and the Landscape of the Real-Time Web WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

solutions still suffer from added overhead due to connection the client without a request. However, it is not meant as a
timeouts and HTTP headers attached to each message. real-time communication mechanism, but rather as an opti-
mization technique to decrease the number of GET requests
2.4 WebSockets the client needs to send on a page load. For this reason, we
do not discuss it in the paper.
The WebSocket protocol was standardized in 2011 to ad-
dress many of the issues described above. WebSockets offer
3 MEASUREMENT
a significant improvement over previous real-time update
mechanisms. A single HTTP request/response is required for To gather data on the use of real-time technologies in the wild,
setup. Subsequently, a full-duplex communications channel is we crawled the Tranco Top 1M [46]2 using an instrumented
available to both the client and the server. This means a client version of the Chromium browser. Our browser harness is
can receive updates from a server (either text or binary data) written in Go. It uses chromedp[15], a Go interface for the
in real-time, without polling the server. WebSocket frames Chrome DevTools Protocol, to drive the browser and collect
sent over an existing connection do contain a header, but this data. We use a full (non-headless) version of the browser in
header is much smaller than an HTTP header (usually less an in-memory display, and we use a fresh browser instance
than 8 bytes total). Thus, WebSockets are valuable for two with a new user data directory for each website visit.
main categories of use cases that are not sufficiently handled Our crawl of the Tranco Top 1M was conducted across 10
by existing web technologies: 1) Small, frequent data ex- virtual machines, each running Ubuntu 16.04 and Chromium
changes between client and server which benefit from smaller version 81. Each crawler VM ran 12 browser instances simul-
per-message overhead, and 2) Server-initiated communica- taneously, and the crawl took approximately four days (8
tions with a client which no longer require the client to poll thru 11 May 2020) in total. After retrying each failed crawl a
the server. second time, we obtained data for a total of 88.1% of websites
A WebSocket connection begins when a client initiates in the top million. 7.7% of the listed domain names failed to
a connection by sending an HTTP GET request with the resolve, and the remaining 4.2% had web servers which failed
header Upgrade: websocket. If the server is capable of serv- to respond or reset incoming connections. We remained on
ing a WebSocket connection, it responds with HTTP 101: each web page for 60 seconds, in order to ensure we could
Switching Protocols, and the connection is established. observe real-time updates for a reasonable period of time.
Both the client and the server may now send data frames at We gathered metadata on all resources downloaded for each
will. According to the WebSocket RFC [2], WebSockets were site, including request initiators and timestamps for each re-
designed to be “as close to just exposing raw TCP to [a] script quest. We gathered data on each WebSocket and EventSource
as possible”. There is a small header (2 to 14 bytes) attached (Server-Side Event) connection down to the data sent in each
to every WebSocket frame which contains an opcode, the pay- individual frame. Our full dataset is available for download3 .
load size, and a masking bit, but relative to HTTP headers,
the overhead of WebSocket frames is quite small. The browser 3.1 Polling, Long Polling, and HTTP Streaming
closes WebSocket connections automatically when the client Detecting HTTP polling and streaming is tricky because
closes (or navigates away from) the page. Major browsers there are many different techniques, intervals, and libraries
began implementing experimental versions of WebSockets as that are used to accomplish them in the wild. We take a
early as 2010, and since 2013, all major (desktop and mobile) heuristic approach to the measurement of these techniques.
browsers provide full WebSocket support [13]. To be considered polling, a site must send three or more
requests to the same URL, although we allow the query and
2.5 Excluded Technologies fragment to differ. The requests must come from the same
initiating script. We also require that the time between these
There are a few notable techniques and technologies related
requests remain consistent. To enforce this requirement, we
to real-time communication on the web which we purpose-
calculate the median time between polling requests for a site
fully exclude from our analysis. We do not examine plugins
and compare it to the number of requests and the amount of
such as Microsoft Silverlight or Adobe Flash. While these
time spent on the page. If polling is happening at a consistent
protocols do allow for sockets which can be used in websites,
interval, we expect the product of median polling interval
we consider them to be end-of-life, as they are either no
and the number of requests to be near the time spent on the
longer supported by major browsers (Silverlight), or will be
page. We verify this methodology with manual inspection of
phased out within a year (Flash). We also specifically exclude
a sample of the instances identified, finding no false positives.
WebRTC from our measurements. WebRTC is an important
Our crawler remains on each page for 60 seconds during
protocol for real time communication on the web, but it
our main crawl, but this is often is often too brief to detect
usually satisfies different requirements than the protocols we
polling or streaming, given that default browser timeouts for
study here. WebRTC is a peer-to-peer protocol which usually
HTTP connections are now as high as five minutes[18]. To
uses UDP. In this paper, we are studying client/server inter-
address this, we conducted an additional crawl of a subset
actions, so we omit it from our analysis. Finally, we do not
investigate HTTP/2 Server Push. Server push is a technique 2
Top 1M List: https://tranco-list.eu/list/QJ94
by which a server can initiate the sending of resources to 3
Dataset: https://bit.ly/2TeUpYx
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Paul Murley, Zane Ma, Joshua Mason, Michael Bailey, and Amin Kharraz

1 3.2 Server-Sent Events


CDF Websites

0.8
0.6 Like long polling, Server-Sent Events, which are implemented
0.4 using the JavaScript EventSource API, were much less preva-
0.2 lent than we expected. We find them in use on only 0.4% of
0 the top thousand and 0.05% of websites in the top million.
1 10 100 1000 Of that small percentage in the top million sites, a single
Seconds Between Polling Requests advertising/tracking service (media.net) accounts for 62.8%
of those instances. Clearly, this is not a technology that has
Figure 1: Polling Intervals—A CDF of the median intervals at found widespread adoption on the web. Usage of SSE and
which sites send new polling requests. No standard interval WebSockets offer an intriguing case study into what happens
for long polling exists, and we see wide variation in how when a technology is not adopted by all major browsers.
developers implement their own polling solutions. Although the EventSource standard was adopted quickly
by Chrome (2010), Safari(2010), and Firefox (2012), it was
never added to Internet Exporer, and was only added to
Microsoft Edge in 2019[5]. Given that Internet Explorer still
has a market share of more than 2%, it is understandable
that developers have largely avoided this API in favor of
more widely supported real-time solutions. By contrast, In-
ternet Explorer (and all other major browsers) had added
of websites, in which we remained on each page for one support for WebSockets by 2012. As we show below, they
hour, rather than one minute. Our subset of sites for this have become significantly more common than SSE. While
crawl consisted of the top 1000 sites, along with a random there are undoubtedly other contributing factors, it seems
sample of 1000 sites from each of the top 10K, top 100K, clear that the decision not to add SSE to Internet Explorer
and top 1M, for a total sample size of 4000 websites. We has had a significant adverse impact on SSE adoption.
found that HTTP polling or long polling are in use on 14.8%
of these websites. Higher-ranked sites more commonly use 3.3 WebSockets
polling, with 19.8% of the top thousand sites leveraging the
We found WebSocket usage on 55,805 websites (6.3%) in
technique compared to 9.2% of sites in the top million. We
total. This is a substantial increase from a study in 2018 [22],
observed varying polling intervals, shown in Figure 1. The
which found that only 1.6% to 2.5% of sites used WebSockets.
most common choice is an interval of 30 seconds, with the
Figure 2 shows the prevalence of WebSocket usage relative
median being 60 seconds. On some sites, JavaScript forces
to site ranking. Extremely highly ranked sites tend to be
full or near-full page refreshes periodically, which we consider
slightly more likely to use WebSockets (7.3% of the top 1000
to be a particularly inefficient variety of polling.
sites), but this difference is marginal. WebSockets are clearly
To determine whether these instances were regular or
favored by developers relative to SSE and the EventSource
long polling, we examined the time between request and
API, but they continue to lag well behind polling in terms of
response. In long polling, we expect the request/response
adoption.
interval to be close to the interval between requests. In regular
The expanding usage of WebSockets across the Internet is
polling, the server responds immediately, so the interval is
enabled by third party providers, who account for the vast
much shorter. We consider any group of polling requests in
majority of WebSocket use. Across 71,637 WebSocket con-
which the median time for a response is greater than half
nections we observed, 94.9% of connection-initiating scripts
the median time between requests to be indicative of long
and 92.1.% of the WebSocket servers to which they connect
polling. Applying this methodology to our dataset, we find
are third-party (different second-level domain than the base
long polling to be exceedingly rare in practice, occurring on
site). This large number of third party providers means that
only a single website out of our 4000 site sample. This was a
there is less heterogeneity in deployments than one might ex-
surprise to us, as we found long polling discussed frequently
pect based on the number of sites using WebSockets. Across
online as a real-time update technique. One explanation
all the connections we observed, we find only 7,624 unique
for this is that while regular polling sacrifices latency in
WebSocket servers.
updates, it also frees servers from the requirement to hold
open connections over a long period of time. As discussed 3.3.1 Who Uses WebSockets? Next, we investigated the
further in our limitations section, this finding is likely also types of websites that use WebSockets and the types of ser-
influenced by our data collection method, which only visited vices that are being provided over WebSockets. To perform
the front pages of websites. Nonetheless, it was surprising website categorization, we use the WebShrinker API [19],
to find a lack of usage of this technique. HTTP streaming, which provides a mapping from domain names to website
which can be identified by looking for a “Transfer-Encoding: categories (Sports, Chat, News, etc.). As shown in Table 1,
Chunked” HTTP header on polling requests, was present live chat is the most common use case for WebSockets, ac-
on 4.5% of all pages, and roughly a quarter of the sites using counting for thirteen of the top twenty most common third
some form of HTTP polling/streaming. party providers. These products are designed to increase user
WebSocket Adoption and the Landscape of the Real-Time Web WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

35
WebSockets Category # Using WS % Using WS
30 HTTP Polling
Stocks 572 50.8%
% of Sites

25
20 Software 1959 29.0%
15 Gambling 847 25.0%
10 Shopping 3327 10.0%
5 Chat 907 10.0%
0 Real Estate 594 8.7%
100 1000 10000 100000 1x106 Sports 566 6.3%
Tranco Site Rank Adult 593 4.8%
Social 288 3.12%
Figure 2: WebSockets and polling by site popularity—We News/Weather 2196 3.1%
track the cumulative frequency of WebSocket and HTTP
Table 2: WebSocket Usage by Category—We list some inter-
Polling use over the Tranco Top 1M. WebSockets are
esting website categories, along with the rate at which they
marginally more common in the top 1K sites than the top
use WebSockets. WebSockets are deployed across many dif-
1M (7.3% vs. 6.3%), but polling is much more common over
ferent types of websites, led by stock sites, which are often
those same intervals.
updated in real time.

WS Service Count Percentage Category


zopim.com 9640 13.5% Chat 10
tawk.to 8500 11.9% Chat 8
drift.com 5888 8.2% Tracking % of Sites
intercom.io 4924 6.9% Chat 6
livechatinc.com 4448 6.2% Chat 4
visitors.live 3157 4.4% Tracking
jivosite.com 1847 2.6% Chat 2 1K 100K
10K 1M
hotjar.com 1803 2.5% Tracking 0
firebaseio.com 1799 2.5% Analytics 07/17 01/18 07/18 01/19 07/19
crisp.chat 1715 2.4% Chat Date
Others 27916 39.0% -
Figure 3: WebSocket Adoption Over Time—The percent of
Table 1: Most Common WebSocket Services—The most com-
sites using WebSockets over the last three years. Perhaps
mon third party WebSocket service providers. The majority
surprisingly, the top thousand sites were slower to adopt
of top services are live chat products, accounting for six of
WebSockets than the top million as a whole. As a whole,
the top ten. Tracking and analytics use cases are also quite
WebSocket adoption has mostly leveled off over the last two
common, and are mostly provided as third party services.
years—an indication that the technology is relatively mature.

browsers, including very old ones. The data also shows that
engagement across many types of websites, especially for
WebSocket adoption has been stagnant over the last year,
online commerce or company product sites. We find that
indicating that WebSockets are a mature technology and may
WebSockets are also used to a lesser extent for ads, tracking,
be reaching their “high-water mark” less than a decade after
and outright malicious purposes as well. We delve deeper
their standardization.
into who is using WebSockets in Section 4, focusing on mis-
In Figure 4, we compare the adoption rate of WebSockets
configuration and malicious use.
over the last three years to that of SSE and HTTP Strict
3.3.2 How Have WebSockets Been Adopted Over Time? We Transport Security (HSTS), a HTTP header which causes
leveraged data from the HTTP Archive [16] to study Web- browsers to enforce using HTTPS on a particular page. While
Socket prevalence over the last three years—the oldest reliable HSTS is not a real-time technology, we thought it worthwhile
data on WebSocket use we could find. The HTTP Archive to provide an adoption comparison since there are both op-
publishes data from historical web crawls. While they do tional web improvements released as RFCs at approximately
not capture fine-grained information about WebSocket con- the same time. All three of these technologies were introduced
nections such as opcodes and payload data, they have data in a three year period from 2010-2012, but they have seen very
on WebSocket initiation requests for roughly the last three different rates of adoption over the last decade. For reasons
years, which we plot in Figure 3. The data shows that the top explained above, SSE has remained almost nonexistent in the
one thousand sites were slower, and perhaps more cautious, wild. HSTS, on the other hand, has seen usage grow signifi-
in adopting the new technology. This could be because the cantly, even relative to WebSockets. We hypothesize that this
top thousand sites care more about compatibility with all is attributable to the simplicity and ease-of-implementation
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Paul Murley, Zane Ma, Joshua Mason, Michael Bailey, and Amin Kharraz

20

CDF - Messages
1
HSTS
% of Sites

15 WebSocket 0.8
SSE 0.6
10
0.4
5 0.2 Client Messages
Server Messages
0 0
07/17 01/18 07/18 01/19 07/19 01/20 07/20 1 10 100 1000 10000 100000
Date Bytes per WebSocket Message

Figure 4: Web Tech Adoption Over Time—The rates at which Figure 5: CDF: Bytes Per Message—We show the distribution
WebSockets have been adopted over the last three years rela- of the number of bytes in each WebSocket message payload.
tive to Server-Side Events (SSE) and HTTP Strict Transport The overall median frame size is 85 bytes, while the largest
Security (HSTS). SSE is barely visible on the x-axis here, single message we captured was over 5 megabytes. The small
due to extremely low adoption. HSTS provides a contrast to size of most messages underscores the value of the reduced per-
WebSockets in adoption rate, likely because of the ease and message overhead offered by WebSockets, since WebSocket
simplicity of adoption relative to WebSockets and SSE. headers are generally an order of magnitude smaller than
HTTP headers.

of HSTS for web developers, but we leave investigation of 10000


this for future work.
3.3.3 How Are WebSockets Used? To characterize how devel-
opers are using WebSockets today, we present some statistics 1000

# Server Messages
on the traffic we observed. Figure 6 shows the distribution
of the number of WebSocket messages sent over each con-
nection by clients and servers. This scatter plot shows that 100
while there is diversity in the way messages are sent, it is
clear that it is more common for servers to send multiple
messages for each client message than the reverse. In other 10
words, the directionality of WebSocket traffic leans towards
servers sending more messages to clients. This aligns with
our expectations, given that one of the primary motivations
1
for WebSockets was the ability for servers to push messages x=y
to clients without a corresponding request. 3.1% of connec- 1 10 100 1000
tions sent zero messages, and the median connection saw # Client Messages
between 7 and 8 messages exchanged over the 60 seconds
we remained on the page. Interestingly, there were multiple Figure 6: Client and Server Messages per Connection—The
connections that exchanged thousands of messages during number of messages sent relative to the number of messages
a single site visit. In the most extreme case, popsplit.us, a received for each WebSocket connection. The size of each
web-based game, sent 31,324 messages (from server to client) mark scales with the number of connections with those partic-
over the course of our visit, averaging an update every 1.4 ular x and y values. Although connections are quite diverse in
milliseconds. While this particular case is extreme and is their patterns, it is clearly more common for server messages
likely the result of poor development or misconfiguration, to outnumber client messages.
the fact that the site still functions highlights the ability
of WebSockets to facilitate high frequency communication
with websites. Despite this capability, only 2,512 connections in Section 4, this seems to be a highly inefficient method for
(2.9%) averaged more than a frame per second, suggesting accomplishing the goals of these companies. Other sites use
that cases requiring high frequency updates may actually be large payloads for a variety of reasons. We observe instances
relatively rare. where sites use WebSockets to load resources which would
On a frame-by-frame level, we observed that most mes- traditionally be loaded via normal HTTP GET requests. One
sages are relatively small, with the median message size site, moviemovie.com.hk, achieved an ever-changing display
being 85 bytes. However, more than 5% of the messages were of movies by sending megabytes worth of movie poster images
larger than a kilobyte, and there were rare instances of multi- over WebSockets. Sites which load many of their resources
megabyte WebSocket frames. Manual inspection revealed that over WebSockets may find an increase in efficiency, but there
some of these were the result of tracking/analytics scripts is a danger of having to re-implement many features that
sending the content of the DOM back to a server. Putting HTTP provides within the JavaScript handling these Web-
aside the privacy concerns, which are discussed in more detail Socket connections.
WebSocket Adoption and the Landscape of the Real-Time Web WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

The structure of data flowing over WebSockets on the mod- 4 MISCONFIGURATION AND MISUSE
ern web is relatively homogeneous. WebSockets can trans- The security community has long known that misconfigura-
mit data in text or binary. Clearly, it is more bandwidth- tion and unintended use of technologies can lead directly to
efficient to transmit data in a binary format. However, we find vulnerabilities [32, 48]. Given the diversity of applications
that 88.4% of messages, and 96.4% of total bytes, are made on the web, and the importance they hold for users, it is
up of text data. Further, the majority of this text data (69%) valuable to periodically measure and understand misconfig-
is JSON, indicating that developers using WebSockets value uration and misuse of web technologies in the wild. This
ease of development over optimization. We also observed that section looks back at the WebSocket RFC and reflects on
some of this JSON contained re-implementation of features how well real-world deployments have complied with some
already provided by the WebSocket protocol. We identify best-practices and recommendations laid out in the original
433 sites which are sending PING/PONG messages enclosed in standard.
JSON, even though the WebSocket protocol itself provides
dedicated opcodes specifically for this purpose. All of this 4.1 Misconfiguration
together serves to confirm the anecdote that web developers
4.1.1 Checking Origin Headers. Since WebSockets are not
often introduce inefficiencies in their code to ease develop-
restricted by the Same Origin Policy (SOP), a malicious
ment, or because they simply lack a firm understanding of
script can use existing cookies for a host to authenticate in
the features and tools they use.
a manner similar to cross-site request forgery attacks [27].
Consequently, a WebSocket server should, for many typical
3.3.4 How Much Could WebSockets Improve Sites Currently
use cases, validate the HTTP Origin header, which is set by
Using Polling? Previous work has studied the performance
the web browser based on the origin of the script opening
benefits of WebSockets under controlled laboratory settings [52,
the connection. To facilitate this process, major WebSocket
53]. Here, we focus on applying real-world data to under-
libraries such as socket.io and sockjs provide a specific func-
stand how changes could affect websites as they are currently
tion for setting the allowed origins [11, 12]. However, the
deployed. In schemes such as HTTP polling and long polling,
default behavior of these libraries is to allow any origin to
where a unique HTTP request is required for each mes-
access the server.
sage, HTTP headers become a significant source of overhead.
We measured whether WebSocket servers in the wild are
Across our data, we find that the mean size of HTTP request
checking Origin headers by attempting to connect to each
headers is 184.9 bytes. Response headers are more than twice
of the WebSocket servers we observed using an arbitrary
as large on average, at 403.1 bytes. Given that our crawls are
HTTP Origin. To do this, we used a dedicated script as a
unauthenticated and thus our requests contain significantly
WebSocket client and used the applicable domain name as
fewer cookies/tokens, these are definitely underestimates of
the host, and specified arbitrary (incorrect, unrelated) do-
average header size. As a concrete example to demonstrate
main as the HTTP Origin. We observed that of successfully
the differences in traffic requirements between polling and
connect to (and receive data frames from) 74.4% of the 7,624
WebSockets, consider tradingview.com, a currency exchange
distinct WebSocket servers we encountered. While this wide-
website. This website uses a single WebSocket connection to
spread lack of Origin header validation in initial WebSocket
push updated currency exchange rates to clients roughly once
requests might sound shocking at first. There is no doubt
per second. As WebSocket use cases go, this is not a particu-
that an overlooked Origin header check could jeopardize user
larly high update frequency. We found that some WebSockets
security and privacy. However, our analysis showed that the
deliver tens or hundreds of updates per second. However, the
scripts accepting these connections were mostly trackers, web
bandwidth savings achieved through WebSocket usage here
analytics, and advertising networks that were loaded during
are still significant. With WebSockets, updates require an
a website visit. These entities operate more effectively when
average of 82 bytes sent across the network (an 8 byte Web-
they provide universal access to their servers and allow all in-
Socket header plus a variable sized binary payload averaging
coming connections in order to attain more visibility over the
74 bytes). No request is required—the update is delivered at
behavior of users across different websites. This behavior is
the moment it becomes available on the server. Using our
in line with well-known tracking methodologies such as pixel
data on HTTP header sizes, we estimate that implementing
tracking [30] where fetching a cross-origin request is necessary
the same functionality with long polling would require an
to load a remote image. Servers are often configured to allow
average of 499 bytes on the network for each update, given
cross origin requests using Cross-Origin Resource Sharing
that both request and response headers would be required
(CORS) to allow the image fetching and complete the track-
and binary data would need to be base64 encoded at a 4:3
ing process successfully. Trackers and analytics servers often
ratio. Therefore, we conservatively estimate that WebSockets
omit a check on the Origin header, but applications where
offer a decrease in network overhead of at least 417 bytes per
security is important to the provider, such as cryptomining
client per second. Given that tradingview.com is estimated
and gaming, tend to reject connections without the proper
to have approximately 4,835 visitors on their page at a time
header.
in September 2019 [10], WebSockets represent a decrease
of 16Mbps in required server bandwidth for this relatively 4.1.2 Unencrypted Connections. The large number of track-
benign traffic load. ing and analytics scripts we found emphasizes the fact that
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Paul Murley, Zane Ma, Joshua Mason, Michael Bailey, and Amin Kharraz

WebSocket connections often transmit sensitive data to re- full snapshot of the DOM tree at the start of a site visit, the
mote servers. In this section, we seek to understand whether JavaScript code periodically records the interaction of the
the adoption of WebSockets aligns with the push in the user as a set of snapshots that contain mouse coordinates and
web community to move the web traffic to TLS-protected keyboard strokes. Session recording has been studied by other
channels. WebSockets can use unencrypted (ws://) or TLS- researchers [29] and is considered by many to be a serious
protected (wss://) connections [2]. However, in all modern privacy violation. Session recorders, PII leakers, and other
browsers, if a website is served over an HTTPS connection, unsavory (but increasingly common) scripts use WebSockets
an attempt to open a non-TLS-protected WebSocket connec- as an efficient way to extract the data they collect.
tion will fail [8], so unencrypted WebSocket connections can We find that leaked data in the frames’ payloads is usually
only be initiated by websites who are served over HTTP. Of sent in a structured format (i.e., key/value pairs) such as
the 55,805 websites we observed using WebSockets, 438 of password=mypass or [email protected]. Accordingly,
them (0.8%) used unencrypted WebSockets by default. We we performed simple string matching to identify PII being
also attempted to create unencrypted connections to each sent inside of WebSocket payloads. Although values such
of the WebSocket servers using HTTP WebSockets solely as username, password, and email address were often not
with the goal of locating servers that do not enforce a secure present because this was an automated crawl, we were able
WebSocket communication on the incoming requests. This to measure PII leakage based on clearly-named keys. Ta-
analysis showed that 14.1% of the servers allowed WebSockets ble 3 illustrates the type of information sent over WebSocket
upgrade request over clear text communication channel. In channels. While we observed the use of unique identifiers
the best case, these are unnecessary services exposed on the across third-party code, the data frames also included key/-
Internet, exposing additional attack surfaces without confer- value pairs such as locations and email address. As an ex-
ring any apparent benefits. In the worst case, they represent ample, https://vapesociety.com, an online shopping website,
a vulnerability in that sensitive user data may be sent across loads a script that establishes a WebSocket connection to
the web unencrypted without users ever knowing. zopim.com—a marketing automation entity. The code sent 21
frames (approximately one frame every two seconds) to the
4.2 Malicious Use server during the site visit. The exchanged frames contained
In the course of our study, we found that a significant per- information about website visitors (e.g., mouse movements,
centage of WebSocket use in the wild was related to tracking, client IP, and geolocation data). Note that PII leakage was
analytics, and even worse, scams and malware delivery. To not seen in every instance of the third-parties listed in Table
be clear, we do not assert that these practices would not be 5. For example, hotjar.com, one of the most frequently con-
possible without WebSockets. However, WebSockets clearly tacted domains, leaked user data in only 79 out of 2,337 cases.
make these harmful practices easier and more discreet. This It was also common that these third parties received more
section outlines our findings on the darker side of WebSocket than one tracking parameter (e.g., visitor ID, phone number).
use. The analysis on the type of PII data sent over WebSockets
shows that visitor ID, email address, and IP address were
4.2.1 Data Leaks. The topic of web tracking has been well the most common PII data extracted among the five libraries
studied, and the pervasiveness of trackers and their privacy listed in Table 3. We identified 211 unique incidents where
implications have been extensively documented [26, 31, 34, PII data was extracted as key/value pairs, and 160 (76%) of
37, 42, 43]. Trackers are known to use intrusive techniques these cases contained information about the visitor ID, email
to gather information about online users and their behavior address, or IP address of the visiting user.
patterns. Examples include the use of HTML5 APIs such as
Canvas, Battery Status, and Audio context [31] for device
fingerprinting [35, 45, 51], cross-device tracking [25], and even 4.2.2 Web Tracking. Web tracking has become a well-known
exfiltrating user data from unsubmitted forms [55]. practice in recent years. The canonical web tracking tech-
We extended our experiments in Section 3 to investigate nique assigns an identifier to the user’s browser for a third-
how WebSockets, as an efficient data transmission mechanism, party domain, say tracker.com, and generates a request to
are used by trackers. Personally Identifiable Information (PII) tracker.com using that identifier when the user visits a web-
is a blanket term describing information about users that page that contains a resource from tracker.com. Figure 7
could be used to trace an individual’s identity. PII includes illustrates a real-world example of cross-site web tracking
first and last names, email address(es), phone number(s), and based on WebSockets. The third-party code belonging to
IP addresses. We define a PII leak as any instance in which truconversion.com, a web tracking entity, collects information
this information is transmitted to a third-party without user about device properties, the IP address, and the geograph-
consent. Beyond PII, there is a wide variety of information ical location of the user. It sends this information (via a
that scripts can collect which might be considered objection- WebSocket) to a remote server. The server sends a periodic
able to users on privacy-related grounds. Session recording, heartbeat which contains a unique 16 digit user ID to check
for example, is the process of deep copying the DOM objects whether the browser tab is still open on user’s machine. We
and serializing those objects to a specific format (e.g., JSON) identified 54 websites which were using the same script from
for transmission to a remote server. In addition to sending a truconversion.com. We tested all 54 websites using the same
WebSocket Adoption and the Landscape of the Real-Time Web WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

Library PII Location Fingerprint Session heartbeat requests across different websites. Real-time cross-
Recording site tracking can have significant monetary value [23] due to
the rise of Real-Time Bidding (RTB), in which advertising
cux.io ∙ ∙ ∙ ∙ and tracking companies are incentivized to collaborate in
beusable.com ∙ ∙ ∙ order to exchange real-time data about users and facilitate
hotjar.com ∙ ∙ ∙ ∙ bidding on impressions [23, 33]. Bashir et al. [24] studied
inspectlet.com ∙ ∙ ∙ ∙ this behavior by implementing a simulation of an online ad
webspectator ∙ ∙ ∙ ecosystem, demonstrating how online tracking and analytics
colpirio.com ∙ ∙ ∙ incorporate various techniques to collect users’ browsing pat-
firecrux.com ∙ ∙ ∙ ∙ terns and use them that in the RTB market. We found five
zopim.com ∙ ∙ other tracking companies that were using similar techniques
Table 3: Prevalent third-party trackers—Some of the most for real-time cross-site web tracking.
common third party trackers, along with the types of data
they export. We observe that many scripts export data at 4.2.3 Exposing Users to Malicious Pages. Prior work [49, 50,
regular intervals. WebSockets allow them to accomplish this 57, 59] has discussed various techniques that adversaries use
less conspicuously, since they avoid sending frequent HTTP to distribute malicious links across websites, luring users to
requests, especially in cases like session recording. visit their scam pages. However, this approach lends itself
to rapid identification and removal of these malicious pages,
as discussed in prior work on emerging online scams [39, 49].
An alternative for adversaries is to expose users to mali-
browser profile and confirmed that the assigned user ID ex-
cious content via real-time notifications – a form of server
changed over WebSockets was identical in all the websites. If
to client messaging often built on WebSockets. Open source
the remote server receives heartbeat responses from different
PushWoosh [9], Google Firebase notification [6], and Amazon
websites for the same user ID at the same time, it allows the
Simple Notification Service (SNS) [3] are just examples of
tracker to create a list of websites that a user has visited in
implementations which are widely deployed. Using such a
a specific time period. We extend this experiment by delet-
service, web applications can send data to thousands of re-
ing all cookies of the browser profile used in the previous
mote user devices with relatively small server-side overhead.
experiment, and run the experiment again while monitoring
Real-time notifications are deployed on top of this service
the assigned user ID of the device. Our analysis showed that
to provide publishers with a flexible notification platform
clearing cookies would not prevent trackers from identifying
that is supported across a variety of end-user devices. Since
the same device because the same user ID was assigned to
WebSockets work on all major browsers [14], they allow for
the visiting device across each of the 54 websites.
broad code reuse across many browser versions and device
types.
1 Visiting Websites The ability to send real-time notifications to remote targets
wss://io.truconversion.com/? Visiting be
stseeker
s.com and avoid leaving evidence of malicious activity in website
token=10235-bestseekers.com
ll.c
om source code allows adversaries to hide from search engines
nfa
gu
ard
ia
om
and active web scans by security researchers. We identified
ing .c
Vis
it
hg
o ld 1,466 websites that were using third-party libraries to deliver
irc
wss://io.truconversion.com/?
token=4603-www.guardianfall.com tin
g
b
WebSocket-based notifications. Visitors to the corresponding
si
Vi
3
Real-time
Cross-site Tracking
websites are encouraged to register for real-time notifications
to receive special discounts on products or software.
wss://io.truconversion.com/
?token=3907-www.birchgold.com We performed an experiment on 400 websites which used
these third-party libraries. Inside of a dedicated virtual ma-
chine, we registered for notifications and monitored traffic
WebSocket
2 Aggregating Fingerprinting Data
Direct Visit via Concurrent WebSocket Channels from these sites for 30 days. Of these 400 websites, we identi-
Cross-site Tracking io.truconversion.com
fied 32 cases where push notifications were used to deliver
Potentially Unwanted Programs (PUPs), scam pages, adult,
Figure 7: Example: Cross-Site Tracking—An example where or affiliated websites. Across these 32 websites, we logged 123
WebSockets are used to track users across multiple sites. By individual payloads.
correlating user IDs sent at aligning times, a third party Table 4 shows the types of malicious payloads distributed
script provider can build lists of websites which users have by WebSocket push notifications. While we observed multiple
visited. types of malicious practices, a large number of collected sam-
ples (52.7%) were distributing PUPs such as Amonetize, Mac
In this experiment, we observed that truconversion.com Keeper, and TotalAV. While the number of identified cases
uses persistent browser-based fingerprinting to make defend- is not huge, the finding is in line with prior work [50] where
ing against online tracking more difficult. WebSockets make the authors analyzed a large number of social engineering
this possible by allowing remote servers to push low-cost and web-based attacks and found that PUPs and extensions
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Paul Murley, Zane Ma, Joshua Mason, Michael Bailey, and Amin Kharraz

Categories Dist. community, and finding ways to effectively measure authenti-


cated WebSockets at scale remains an interesting topic for
Adult pages 12 (10%) future research. In particular, we believe that incorporating
Affiliate Programs 33 (27%) a server-side vantage point into measurements could shed
Malware 3 (2.5%) additional light on the value of various real-time technologies
PUPs 42 (34%) on the modern web.
Malicious Extensions 23 (18.7%) Adoption of HTTP/2.0 is currently at about 45% among
Technical Support Scams 10 (8%) the top million websites [7] and rising. It is interesting to
Total 123 (100%) consider how this will impact WebSocket usage, given that
Table 4: Malicious payloads delivered by push notifications— HTTP/2.0 includes server push, a feature whereby servers
The distribution of malicious payloads we observed being can asynchronously send data to clients without an explicit
delivered during our notification experiment. Across 32 out request. WebSocket usage is still much more common than
of 400 websites which delivered unwanted content, we ob- server push. This is not surprising, given that server push
served 123 total payloads. Among these payloads were scams is not supported on older desktop browsers such as Internet
and malware which could have serious negative impacts on Explorer, or on common mobile browsers such as iOS Safari
individual users. (both of which support WebSockets) [4]. WebSockets and
server push fill slightly different roles as well. Server push
does not offer a truly bidirectional channel, and the data it
sends is not always directly accessible to scripts. In fact, in
late 2020, Google announced the removal of HTTP/2.0 and
Binary Type Dist. Type
gQUIC server push due to high maintenance costs and low
Installcore 3 PUP usage [17]. Considering all of this, our assessment is that while
Speed Dial 12 Extension the proliferation of HTTP/2.0 may replace WebSockets for
Mac Keeper 20 PUP some specific use cases, it is unlikely to significantly impact
Easy Convert 11 Extension the amount of WebSocket deployments in the foreseeable
Search Defender 10 PUP future.
TotalAV 9 PUP There is clearly room for improvement in terms of best
Flash update 3 Malware practices in the WebSocket ecosystem, but we do not believe
Total 69 (100%) - this to be principally (or even primarily) the responsibility of
first-party developers. We fear that web developers are often
Table 5: List of unique downloaded binaries distributed via
forced to learn how to implement security in WebSockets
WebSocket-based notifications—We observed these down-
via online forum posts and other non-official sources. Cur-
loads as part of our 30-day experiment on WebSocket-based
rent documentation for popular third party libraries such as
push notifications. While most of these downloads simply
socket.io and sockjs lacks clear explanations on how to
cause annoyance for users, there are a handful of examples
implement security features such as rate limiting, logging,
of actual malware being distributed through these channels.
and authentication. Consequently, developers are often left
with powerful functionality features but a sparse knowledge
base on creating hardened applications. This seems to be a
recipe for dangerous deployments, and we encourage code
are the most common forms of malicious payloads. We also providers to improve documentation related to securing their
found 10 cases where pop-up widgets claimed that the visi- WebSocket implementations.
tor’s computer was infected with malware. These websites are This study follows in the footsteps of numerous efforts to
entry points to technical support scams, an ongoing problem understand how new standards and technologies are changing
recently explored by other researchers [49]. the web. Given the complexity of modern browsers and web-
sites, it is essential that additions to the ecosystem are both
5 DISCUSSION AND LIMITATIONS motivated by and evaluated with empirical, evidence-based
Many of the canonical difficulties in measuring website func- investigations to understand the need for new features and
tionality apply to this study. In particular, WebSockets are the impact of those features once they have been deployed.
difficult to measure in full for two main reasons. First, initiat- Based on our findings, it is clear that the WebSocket API is
ing WebSocket connections may require specific interactions providing substantial benefits to developers and users across
with a website, such as pressing a “connect to updates” button. a broad range of applications. However, it is important to
Predicting these interactions heuristically, either beforehand constantly weigh benefits of a technology with the negative
or during site visits, is quite difficult, and we do not attempt effects it generates. In Section 4, we presented evidence of
it in this work. Second, many of the more interesting Web- less-than-desirable uses of WebSockets, which should be mon-
Socket applications sit behind some form of authentication, or itored by the security and privacy community going forward.
simply deeper than the front pages of websites. Measurement In particular, WebSockets significantly increase the potency
of authenticated services is a long standing problem in the
WebSocket Adoption and the Landscape of the Real-Time Web WWW ’21, April 19–23, 2021, Ljubljana, Slovenia

of tracking and analytics scripts, and we highlight them in of best practices which will create safer, more stable appli-
particular as a subject for continued research. cations. We show that web developers are often failing to
follow these practices and, in some cases, using WebSockets
in questionable or malicious ways. We believe WebSockets
6 RELATED WORK provide significant overall benefit to users on the web, and
Although the WebSocket protocol was added to major browsers we advocate for expanded adoption, along with improved
almost a decade ago, empirical analyses focusing specifically documentation.
on the WebSocket ecosystem in the wild remain relatively
sparse. Snyder et al. [54] measured WebSocket usage over ACKNOWLEDGMENTS
the Alexa Top 10K websites, finding that 5.4% of them used We would like to thank the anonymous reviewers for their
WebSockets, with 64.6% of those usages being blocked by anti- thoughtful feedback.
tracking software. Bashir et al. [22] showed how trackers and
advertisers could use WebSockets to elude ad blockers. In the REFERENCES
process, they also measured WebSocket prevalence and found [1] 2011. rfc6202: Known Issues and Best Practices for the use of
less WebSocket usage, stating that only about 2% of sites Long Polling and Streaming in Bidirectional HTTP. https://
tools.ietf.org/html/rfc6202. (April 2011).
used WebSockets. This disagreement on the commonality of [2] 2011. rfc6455: The WebSocket Protocol. https://tools.ietf.org/
WebSockets is notable. Our data shows a larger WebSocket html/rfc6455. (December 2011).
ecosystem, with 6.3% of WebSockets in the Tranco Top 1M [3] 2019. Amazon Simple Notification Service. https://aws.amazon.
com/sns/. (September 2019).
using WebSockets. This makes the misuse and vulnerabilities [4] 2019. Can I Use: Push API. https://caniuse.com/#feat=push-api.
in the WebSocket ecosystem all the more concerning. (October 2019).
[5] 2019. Can I Use: Server-Sent Events. https://caniuse.com/#feat=
In recent years, browser API usage has become a prevalent server-sent-events. (October 2019).
way to study various phenomena on the web. Researchers [6] 2019. Firebase Cloud Messaging. https://firebase.google.com/
have used traces of these browser API calls to forensically docs/cloud-messaging/. (September 2019).
[7] 2019. HTTP/2 + Push Adoption Measurements. https://http2.
reconstruct web-based attacks [47, 57], analyze malicious netray.io/stats.html. (September 2019).
extensions [21], better understand anti-ad and anti-tracking [8] 2019. Mixed-content WebSockets. https://bugs.chromium.org/p/
extensions [54], and more. We see this measurement technique chromium/issues/detail?id=85271. (September 2019).
[9] 2019. PushWoosh: An Open Source Push Notification. https://
as extremely potent for understanding dynamic phenomena pushwoosh.com/. (September 2019).
on the web, and we expect researchers to continue to expand [10] 2019. SimilarWeb: Traffic Analysis for tradingview.com.
https://www.similarweb.com/website/tradingview.com. (Septem-
the applications of these traces. We utilize the powerful ber 2019).
Chrome DevTools protocol interface to drive the browser and [11] 2019. Sockjs-node. https://github.com/sockjs/sockjs-node. (Sep-
gather fine-grained data. tember 2019).
[12] 2019. The Socket.io: real-time, bidirectional and event-based
Our work exists in a larger context of research which tracks communication. https://socket.io/docs/. (September 2019).
abuse of emerging web technologies. Recent work has studied [13] 2019. Web Sockets. https://caniuse.com/#feat=websockets. (Oc-
online scams [49], scareware [40, 60], PUPs [41, 50, 56], and tober 2019).
[14] 2019. The WebSocket API (WebSockets). https://developer.
the identification of risky websites [36, 58]. Online tracking mozilla.org/en-US/docs/Web/API/WebSockets_API. (Septem-
and privacy violations have been studied extensively in recent ber 2019).
[15] 2020. chromedp. https://github.com/chromedp/chromedp. (May
years [20, 28, 30, 38]. In particular, Bashir et al. explored how 2020).
WebSockets can assist trackers in bypassing ad blockers. We [16] 2020. The HTTP Archive. https://httparchive.org. (May 2020).
expand on this work by studying a broader set of troubling [17] 2020. Intent to Remove: HTTP/2 and gQUIC server
push. https://groups.google.com/a/chromium.org/g/blink-dev/
use cases for WebSockets and offering a methodology to c/K3rYLvmQUBY/m/vOWBKZGoAQAJ. (November 2020).
automatically identify abuse of the API. We believe that our [18] 2020. Stop Connection Timeouts from Happening. https://
work informs web developers and other researchers of the support.mozilla.org/en-US/questions/998088. (May 2020).
[19] 2020. WebShrinker Website Categorization API. https://
problems with WebSockets, and equips them with techniques webshrinker.com. (May 2020).
to counter malicious usage. [20] Gunes Acar, Marc Juarez, Nick Nikiforakis, Claudia Diaz, Seda
Gürses, Frank Piessens, and Bart Preneel. [n. d.]. FPDetective:
dusting the web for fingerprinters. In 20th ACM Conference on
Computer and Communications Security (CCS).
7 CONCLUSION [21] Sajjad Arshad, Amin Kharraz, and William Robertson. 2016.
Identifying Extension-based Ad Injection via Fine-grained Web
This study examined the modern real-time web ecosystem, Content Provenance. In Proceedings of the International Sympo-
offering an up-to-date picture of how WebSockets and other sium on Research in Attacks, Intrusions and Defenses (RAID).
real-time protocols are currently being used in the wild. Re- [22] Muhammad Ahmad Bashir, Sajjad Arshad, Engin Kirda, William
Robertson, and Christo Wilson. 2018. How Tracking Companies
flecting on the goals of the WebSocket protocol designers a Circumvented Ad Blockers Using WebSockets. In ACM Internet
decade ago, we provided an assessment of the successes and Measurement Conference (IMC).
[23] Muhammad Ahmad Bashir, Sajjad Arshad, William Robertson,
failures of these technologies from an empirical perspective. and Christo Wilson. 2016. Tracing information flows between
We compared WebSocket use with the use of other real-time ad exchanges using retargeted ads. In 25th USENIX Security
solutions, showing tangible benefits of switching to WebSock- Symposium (USENIX Security).
[24] Muhammad Ahmad Bashir and Christo Wilson. 2018. Diffusion of
ets and highlighting some remaining room for improvement. user tracking data in the online advertising ecosystem. Proceedings
When websites do adopt WebSockets, they should be mindful on Privacy Enhancing Technologies (PETS) (2018).
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Paul Murley, Zane Ma, Joshua Mason, Michael Bailey, and Amin Kharraz

[25] Justin Brookman, Phoebe Rouge, Aaron Alva, and Christina Ye- [44] Deepak Kumar, Zane Ma, Zakir Durumeric, Ariana Mirian, Joshua
ung. 2017. Cross-Device Tracking: Measurement and Disclosures. Mason, Michael Bailey, and J. Alex Halderman. 2017. Security
Proceedings on Privacy Enhancing Technologies 2017, 2 (2017), Challenges in an Increasingly Tangled Web. In 26th International
133–148. World Wide Web Conference (WWW).
[26] Aaron Cahn, Scott Alfeld, Paul Barford, and S. Muthukrishnan. [45] P. Laperdrix, W. Rudametkin, and B. Baudry. 2016. Beauty
2016. An Empirical Study of Web Cookies. In Proceedings of and the Beast: Diverting Modern Web Browsers to Build Unique
the 25th International Conference on World Wide Web (WWW Browser Fingerprints. In 2016 IEEE Symposium on Security and
’16). 891–901. Privacy (SP). 878–894.
[27] Jianjun Chen, Jian Jiang, Haixin Duan, Tao Wan, Shuo Chen, [46] Victor Le Pochat, Tom Van Goethem, Samaneh Tajalizadehkhoob,
Vern Paxson, and Min Yang. 2018. We Still Don’t Have Secure Maciej Korczyński, and Wouter Joosen. 2019. Tranco: A Research-
Cross-Domain Requests: an Empirical Study of CORS. In 27th Oriented Top Sites Ranking Hardened Against Manipulation. In
USENIX Security Symposium (USENIX Security). Proceedings of the 26th Annual Network and Distributed System
[28] Anupam Das, Gunes Acar, Nikita Borisov, and Amogh Pradeep. Security Symposium (NDSS 2019). https://doi.org/10.14722/
[n. d.]. The Web’s Sixth Sense: A Study of Scripts Accessing ndss.2019.23386
Smartphone Sensors. In 25th ACM Conference on Computer [47] Bo Li, Phani Vadrevu, Kyu Hyung Lee, and Roberto Perdisci.
and Communications Security (CCS). 2018. JSgraph: Enabling Reconstruction of Web Attacks via Effi-
[29] Steven Englehardt. [n. d.]. No boundaries: Exfiltration of personal cient Tracking of Live In-Browser JavaScript Executions. In 25th
data by session-replay scripts. https://freedom-to-tinker.com/ Network and Distributed System Security Symposium (NDSS).
2017/11/15/no-boundaries-exfiltration-of-personal-data-by- [48] Yang Liu, Armin Sarabi, Jing Zhang, Parinaz Naghizadeh, Manish
session-replay-scripts/. ([n. d.]). Karir, Michael Bailey, and Mingyan Liu. 2015. Cloudy with a
[30] Steven Englehardt and Arvind Narayanan. 2016. Online Track- chance of breach: Forecasting cyber security incidents. In 24th
ing: A 1-million-site Measurement and Analysis. In 23rd ACM {USENIX} Security Symposium ({USENIX} Security 15). 1009–
Conference on Computer and Communications Security (CCS). 1024.
[31] Steven Englehardt and Arvind Narayanan. 2016. Online Tracking: [49] Najmeh Miramirkhani, Oleksii Starov, and Nick Nikiforakis. 2017.
A 1-million-site Measurement and Analysis. In Proceedings of Dial One for Scam: A Large-Scale Analysis of Technical Sup-
the 2016 ACM SIGSAC Conference on Computer and Com- port Scams. In 24th Network and Distributed System Security
munications Security (CCS ’16). ACM, New York, NY, USA, Symposium (NDSS).
1388–1401. [50] Terry Nelms, Roberto Perdisci, Manos Antonakakis, and Mus-
[32] Sascha Fahl, Yasemin Acar, Henning Perl, and Matthew Smith. taque Ahamad. 2016. Towards Measuring and Mitigating Social
2014. Why eve and mallory (also) love webmasters: a study on Engineering Software Download Attacks. In 25th USENIX Secu-
the root causes of SSL misconfigurations. In Proceedings of the rity Symposium (USENIX Security).
9th ACM symposium on Information, computer and communi- [51] Nick Nikiforakis, Alexandros Kapravelos, Wouter Joosen, Christo-
cations security. ACM, 507–512. pher Kruegel, Frank Piessens, and Giovanni Vigna. 2013. Cook-
[33] Arpita Ghosh and Aaron Roth. 2011. Selling Privacy at Auction. ieless Monster: Exploring the Ecosystem of Web-Based Device
In 12th ACM Conference on Electronic Commerce (EC). Fingerprinting. In Proceedings of the 2013 IEEE Symposium on
[34] Phillipa Gill, Vijay Erramilli, Augustin Chaintreau, Balachander Security and Privacy (SP ’13). IEEE Computer Society, Wash-
Krishnamurthy, Konstantina Papagiannaki, and Pablo Rodriguez. ington, DC, USA, 541–555. https://doi.org/10.1109/SP.2013.43
2013. Best Paper – Follow the Money: Understanding Economics of [52] V. Pimentel and B. G. Nickerson. 2012. Communicating and
Online Aggregation and Advertising. In Proceedings of the 2013 Displaying Real-Time Data with WebSocket. IEEE Internet
Conference on Internet Measurement Conference (IMC ’13). Computing 16, 4 (2012), 45–53.
ACM, New York, NY, USA, 141–148. https://doi.org/10.1145/ [53] D. Skvorc, M. Horvat, and S. Srbljic. 2014. Performance evalua-
2504730.2504768 tion of Websocket protocol for implementation of full-duplex web
[35] Alejandro Gómez-Boix, Pierre Laperdrix, and Benoit Baudry. streams. In 2014 37th International Convention on Information
2018. Hiding in the Crowd: An Analysis of the Effectiveness of and Communication Technology, Electronics and Microelectron-
Browser Fingerprinting at Large Scale. In Proceedings of the 2018 ics (MIPRO). 1003–1008.
World Wide Web Conference (WWW ’18). International World [54] Peter Snyder, Lara Ansari, Cynthia Taylor, and Chris Kanich.
Wide Web Conferences Steering Committee, Republic and Canton 2016. Browser feature usage on the modern web. In 16th ACM
of Geneva, Switzerland, 309–318. Internet Measurement Conference (IMC).
[36] Luca Invernizzi and Paolo Milani Comparetti. 2012. Evilseed: A [55] Oleksii Starov, Phillipa Gill, and Nick Nikiforakis. 2016. Are
guided approach to finding malicious web pages. In 33rd IEEE you sure you want to contact us? quantifying the leakage of pii
Symposium on Security and Privacy (IEEE S&P). via website contact forms. Proceedings on Privacy Enhancing
[37] Sakshi Jain, Mobin Javed, and Vern Paxson. 2015. Towards Technologies 2016, 1 (2016), 20–33.
Mining Latent Client Identifiers from Network Traffic. PoPETs [56] Kurt Thomas, Juan A. Elices Crespo, Ryan Rasti, Jean-Michel
2016 (2015), 100–114. Picod, Cait Phillips, Marc-André Decoste, Chris Sharp, Fabio
[38] Dongseok Jang, Ranjit Jhala, Sorin Lerner, and Hovav Shacham. Tirelo, Ali Tofigh, Marc-Antoine Courteau, Lucas Ballard, Robert
2010. An Empirical Study of Privacy-violating Information Flows Shield, Nav Jagpal, Moheeb Abu Rajab, Panayiotis Mavrom-
in JavaScript Web Applications. In 17th ACM Conference on matis, Niels Provos, Elie Bursztein, and Damon McCoy. 2016.
Computer and Communications Security (CCS). Investigating Commercial Pay-Per-Install and the Distribution
[39] Amin Kharraz, , William Robertson, and Engin Kirda. 2018. of Unwanted Software. In 25th USENIX Security Symposium
Surveylance: Automatically Detecting Online Survey Scams. In (USENIX Security).
2018 IEEE Symposium on Security and Privacy (SP). [57] Phani Vadrevu, Jienan Liu, Bo Li, Babak Rahbarinia, Kyu Hyung
[40] Amin Kharraz, William Robertson, Davide Balzarotti, Leyla Bilge, Lee, and Roberto Perdisci. 2017. Enabling Reconstruction of At-
and Engin Kirda. 2015. Cutting the gordian knot: A look under tacks on Users via Efficient Browsing Snapshots. In 24th Network
the hood of ransomware attacks. In International Conference on and Distributed System Security Symposium (NDSS).
Detection of Intrusions and Malware, and Vulnerability Assess- [58] Thomas Vissers, Wouter Joosen, and Nick Nikiforakis. 2015. Park-
ment (RAID). ing Sensors: Analyzing and Detecting Parked Domains. In 21st
[41] Platon Kotzias, Leyla Bilge, and Juan Caballero. 2016. Measuring Network and Distributed System Security Symposium (NDSS).
PUP Prevalence and PUP Distribution through Pay-Per-Install [59] Xinyu Xing, Wei Meng, Byoungyoung Lee, Udi Weinsberg, Anmol
Services. In 25th USENIX Security Symposium (USENIX Secu- Sheth, Roberto Perdisci, and Wenke Lee. 2015. Understanding
rity). Malvertising Through Ad-Injecting Browser Extensions. In 24th
[42] Balachander Krishnamurthy, Delfina Malandrino, and Craig E. International Conference on the World Wide Web (WWW).
Wills. 2007. Measuring Privacy Loss and the Impact of Privacy [60] Apostolis Zarras, Alexandros Kapravelos, Gianluca Stringhini,
Protection in Web Browsing. In Proceedings of the 3rd Sympo- Thorsten Holz, Christopher Kruegel, and Giovanni Vigna. 2014.
sium on Usable Privacy and Security (SOUPS ’07). ACM, New The dark alleys of madison avenue: Understanding malicious
York, NY, USA, 52–63. advertisements. In 14th ACM Internet Measurement Conference
[43] Balachander Krishnamurthy, Konstantin Naryshkin, and Craig (IMC).
Wills. 2012. Privacy leakage vs. Protection measures: the growing
disconnect. (05 2012), 123–144.

You might also like