A Survey of HTTPS Traffic and Services Identification Approaches
A Survey of HTTPS Traffic and Services Identification Approaches
net/publication/343759266
CITATIONS READS
0 757
4 authors:
All content following this page was uploaded by Wazen Shbair on 20 August 2020.
‡ INRIA Nancy Grand Est, 615 rue du Jardin Botanique, 54600 Villers-les-Nancy, France
Abstract—HTTPS is quickly rising alongside the need of On the one hand, moving towards secure web using HTTPS
Internet users to benefit from security and privacy when accessing allows users and content providers benefit from better security
the Web, and it becomes the predominant application protocol on and privacy. On the other hand, for ISPs and network ad-
the Internet. This migration towards a secure Web using HTTPS ministrators, encryption makes them ”blind” to their network
comes with important challenges related to the management of traffic and curtails their capacity to perform proper network
HTTPS traffic to guarantee basic network properties such as
management activities, such as traffic engineering, capacity
security, QoS, reliability, etc. But encryption undermines the
effectiveness of standard monitoring techniques and makes it planning, performance/failure monitoring, or caching [8]. For
difficult for ISPs and network administrators to properly identify example, HTTPS prevents operators from applying QoS mea-
and manage the services behind HTTPS traffic. This survey surements that give a priority to critical services, or to use
details the techniques used to monitor HTTPS traffic, from the caching techniques to reduce network latency and conges-
most basic level of protocol identification (TLS, HTTPS), to the tion. While from the security side, HTTPS makes security
finest identification of precise services. We show that protocol monitoring methods unable to understand the traffic and to
identification is well mastered while more precise levels keep being identify anomalies or malicious activities that can be hidden
challenging despite recent advances. We also describe practical in encrypted connections [9]. Bortolameotti et al. [10] have
solutions that lead us to discuss the trade-off between security proposed indicators for malicious HTTPS connections that can
and privacy and the research directions to guarantee both of
be used in Data Exfiltration scenario, where a compromised
them.
enterprise’s machine transfers sensitive information to an ex-
ternal server controlled by an attacker over a HTTPS channel
I. I NTRODUCTION to circumvent the security monitoring techniques. Therefore,
there is a high demand for solutions able to analyse HTTPS
The global trend toward an encrypted Web quickly made traffic.
HTTPS the dominant share of Web traffic [1]. According to
Cisco 2016 annual security report, statistics show that HTTPS Previous surveys [11]–[14] are a valuable indicator to
accounts for 57% of all Web traffic in October 2015 [2]. That understand the evolution of the Internet and how the academic
number is in line with ISPs: in Europe, French ISPs reported and industrial communities handle its traffic classification.
that the amount of encrypted traffic reached 50% of the Internet Table I summarizes the main traffic classification goal of these
traffic in 2015 [3] against only 5% back in 2012, while another published surveys. The most recent one by Velan et al. [14]
ISP based in North America expects 65-70% of HTTPS traffic is focused on classification approaches for the identification
by the end of 2016 [4]. There are multiple reasons behind this of encryption protocols over the Internet. They show that in
migration towards HTTPS: the past years, the classification of encrypted traffic in large
protocol categories such as IPsec, SSL/TLS, SSH, BitTorrent,
• The personalization of the Web and the concern of Skype, etc. has been widely investigated in the community.
users’ privacy and security [5]. They conclude that simply identifying encrypted traffic is not
• The development of cloud-based services (online stor- enough but should be improved to identify the underlying
age, backup-servers, etc.) that hold sensitive data [2]. protocol. They also state that much work was conducted
• The wide spread of mobile applications, which gener- on SSH while TLS should now be at the center of such
ate inherently encrypted traffic [2]. studies regarding its importance. In this survey, we take these
• The arising of programs such as the Electronic Fron- conclusions and even go deeper by focusing on a single type
tier Foundation’s ”Let’s Encrypt” that facilitate the of underlying protocol, HTTPS, while trying to go further in
move toward HTTPS by a free, automated, and open the identification process to identify precise services. We think
SSL Certificate Authority [4]. that this focus is necessary because of the increased amount
• The diffusion of high-speed Internet and the de- and complexity of web applications and services run within
velopment of hardware equipment natively handling HTTPS traffic [1], [15], [16].
encryption to reduce computation overhead [6].
• The arising of Over-The-Top (OTT) content providers However, before performing such a deep traffic analysis,
that want to keep a maximum level of control over the encryption protocol and the top-level application need to
their traffic by obfuscating to other actors [7]. be identified first [18]. In our survey, as illustrated in Figure
TABLE I: The published surveys in the field of traffic classification
TABLE III: TLS Record first five bytes contents TABLE IV: Machine learning algorithms performance to iden-
(a) TLS Content Types (b) TLS version tify TLS flows [25]
Type Hex Version Hex AdaBoost C4.5 Naive Bayes RIPPER
ChangeCipherSpec 0xl4 SSLv3 0x0300 Accuracy 95.69% 85.13% 89.26% 82.59%
Alert 0xl5 TLS 1.0 0x0301 FPR TLS 0.04% 0.14% 0.11% 0.17%
Handshake 0xl6 TLS 1.1 0x0302 FPR Non-TLS 0.02% 0.01% 0.01% 0.01%
Application 0x17 TLS 1.2 0x0303
Bernaille et al. [24] are motivated to identify TLS traffic as programmed. The basic requirements are a training dataset
early as possible, so they use this format to detect Server-Hello (i.e., solved examples), statistical features, algorithms and eval-
packets. As illustrated in Figure 3, the Server-Hello packet is a uation techniques. The learning process is divided into three
part of the TLS handshake protocol and it sets the parameters phases; Training, Classification and Validation. In training,
of the TLS connection (e.g., version and encryption algorithm). the statistical features and machine learning algorithms are
Therefore, the presence of a valid Server-Hello is a strong trained to make prediction. The output of training phase is
indication that the flow is a TLS one. a model used in Classification phase to identify unseen data.
In Validation phase, the results of classification are validated
Authors in [6] propose a ”TLS Traffic Detector” to isolate to measure the performance of the classification model [12].
pure TLS flows, which are then used in more deep identifi-
cation to recognize services behind them. The TLS detector Machine learning approach has an important advantage
compares the first 5 bytes of packets payload (i.e., Bytes [0:4] related to its applicability to encrypted traffic. Encryption
as explained above) with the TLS record format to take a motivates the usage of this approach to address the limitation of
decision. It benefits from the idea that TLS packet payloads legacy methods (i.e., IP address, port-based, DPI) to identify
start with the same structure. So checking the first few bytes TLS traffic. Thus, the flow statistics such as flow duration,
of the payload for any packets in the flow (not just on the mean packet size, and mean inter-arrival time are used as
Server-Hello packet as in [24]) is sufficient to mark a flow as features to build a statistical profile for TLS protocol [28].
TLS. The feasibility of machine learning based method in the
Finsterbusch et al. [13] evaluate the OpenDPI 1 approach context of recognizing TLS traffic was performed in [25].
that has been used for traffic identification based on DPI. Four machine learning algorithms (AdaBoost, C4.5, RIPPER
OpenDPI is able to classify TLS traffic with an accuracy of and Naive Bayesian) have been evaluated with 22 statistical
100% by using the information in the TLS Record protocol features (e.g., Mean, Standard deviation, Max, Min, etc.) for
to identify TLS flows in two phases [27]. In the first phase, the packet size and Inter-Arrival time. The classification results
it detects a packet which has one Record Protocol Structure are either Native-TLS, TLS-Tunneled, or Non-TLS. Table IV
in the payload and the payload length is sufficient to read presents accuracy and False Positive Rate(FPR) of the machine
the content type and the TLS version. In the second phase, learning algorithm for identifying TLS flows. The AdaBoost
OpenDPI intercepts the next following packet in the reverse algorithm achieves the highest accuracy with 95% of flows
direction, if it has one or more TLS Record protocol structures, classified correctly as TLS and a 4% False Positive Rate.
then OpenDPI marks that packet as TLS and it continues to
check all packets in both directions.
D. Summary
Liu et al. [27] present a method to detect TLS traffic, named
Double Record Protocol Structure Detection (DRPSD). They To summarize this section, we notice that the identification
need to avoid checking all packets to identify TLS traffic. The of TLS is mainly handled by (1) using the TLS record format;
DRPSD uses only 8 packets to recognize TLS traffic based on (2) employing machine learning approach over the encrypted
detecting the format of TLS Record Protocol. Using a private payload, as shown in Table II. Based on experimental results
dataset, the accuracy comparison shows that the OpenDPI given in the related work, we are able to recognize TLS
achieves 87.69% accuracy and the DRPSD method has an traffic among other types of encrypted traffic with a high
accuracy of 99.17%. level of accuracy. Hence, it is possible to detect and identify
TLS traffic with high level of confidence and this is no
C. Machine learning based method more a research topic. However, investigating protocols run
Machine learning is a type of artificial intelligence that inside TLS is a totally different challenge. In the following
gives computers the ability to learn without being explicitly section, we overview the usage of TLS protocol for the HTTP
application with the goal of predicting exactly which TLS
1 https://github.com/thomasbhatia/OpenDPI flows hold HTTP traffic.
TABLE V: HTTPS identification methods
Paper Features Method Accuracy Publish Year
[5], [10], [16], [29]–[33] Port 443 Port based 100%* 2006-2016
Behaviour based
Wright et al. [29] Packet size, timing, direction 100%, 88% 2006
(KNN, HMM)
Haffner et al. [8] Keywords 99.2% 2005
Bernaille et al. [24] TLS-Format, First 5 packets size Machine learning 85% 2007
TLS-Format, Packets (size, timing)
Sun et al. [28] 93.13% 2010
flow duration, Packets number
* If user not malicious (i.e., alter port number)
IV. I DENTIFICATION OF HTTPS T RAFFIC identify HTTPS application with 81.8% accuracy rate. The
performance of their classifier has been improved (up to 85%)
Among applications that use TLS protocol, HTTP is the
in [24] by adding a pre-step phase, where they first detect
most used one [14]. Hence, in this section we consider the
TLS traffic based on protocol-format and then identify the
studies addressing the challenge of detecting HTTP traffic
HTTP traffic within TLS. Sun et al. [28] propose a hybrid
inside TLS (i.e., HTTPS).
solution, which firstly detects TLS protocol by inspecting TLS
protocol-format, then apply a machine learning algorithm to
A. Port-based method determine application protocols run with TLS connection. The
Basically, we can use the port-number 443 to identify Naive Bayes algorithm is used with 8 statistical features; Mean,
HTTPS traffic, but port 443 can also be used by malicious Maximum, Minimum of packet length, and Mean, Maximum,
applications to hide their activities behind the HTTPS port Minimum of Inter-Arrival time, flow duration and number of
to give an indication that a Web browsing traffic is running packets. Using a private dataset, results show the ability to
[25]. Alternatively, some HTTPS Web server can be configured recognize over 99% of TLS traffic and to detect HTTPS traffic
to use a different port number [24]. Many approaches were with 93.13% accuracy.
proposed to overcome the usage of non-standard port with
HTTPS, ranging from behaviour based method to machine D. Summary
learning ones as described below. In spite of that, port 443
The related work in the identification of HTTP application
is widely used in the large body of literature [5], [10], [16],
within TLS protocol, as illustrated in Table V, used different
[29]–[33] to collect and build HTTPS dataset for further
methods with an acceptable level accuracy but each one
experiments.
with its own built dataset that prevent others from having a
strong and fair comparison of their respective identification
B. Behaviour based method accuracy. The situation will remain ambiguous in the absence
Wright et al. [29] demonstrate how application behaviour of a reference dataset for all. That leads to another research
still can be used as a signature to identify the application, question about reproducible-research and dataset construction
even if its traffic is transmitted via HTTPS flows. They use [14]. We should also question the representativity of dataset,
the fact that some information remains intact after encryption since HTTPS nowadays is a multi-purpose protocol (i.e., it
like packet size, timing, and direction to identify the common can deliver video, music, games, etc.). The next section delves
application protocols by using k-Nearest Neighbor (kNN) and more into HTTPS application traffic itself, to identify the work
Hidden Markov Model (HMM). The KNN detects HTTPS and the perspectives to name the specific web applications and
flows with 100% accuracy and the HMM performs 88% services that generate HTTPS traffic.
accuracy.
V. I DENTIFICATION OF SERVICES INSIDE HTTPS
C. Machine learning based This section explores the methods, which precisely identify
Haffner et al. [8] propose extracting statistical signature the real source of HTTPS (i.e., Web services). The increased
from the packet payload. In the case of unencrypted traffic complexity of web applications provides the ability to deliver
they extract ASCII words from the data stream as features. very different kinds of services such as email, games, online
But for HTTPS they extract words from the handshake phase, storage, content providers, maps, social media plug-in, etc., all
since it is unencrypted as shown in Figure 3. The existence transmitted via HTTPS flows [6], [16]. The identification of
and the location of such words in the first 64-Bytes of a HTTPS services is a serious challenge, since most of the legacy
reassembled TCP data stream is encoded in binary vector and techniques like DPI lose their power when facing encryption.
used as input for different machine learning algorithms (Naive
Bayes, AdaBoost and Maximum Entropy). The evaluation over A. Website fingerprinting method
dataset from ISP shows that AdaBoost identifies HTTPS traffic
Identifying the accessed websites over secure connections
with 99.2% accuracy.
is well-known as Website Fingerprinting. This method is
Authors in [24], [28] share the concept of identifying presented in most of relevant work [35]–[38]. Cheng et al. [39]
HTTPS in two steps; in the first step TLS traffic is detected propose one of the earliest method to identify the pages visited
based on the protocol-format as discussed in section III-B, by users over TLS connection by inspecting TCP/IP header,
while in the second step HTTP traffic in TLS channel is which contains the size of payload and other information.
recognized using machine learning method. In [34], authors Their technique is based on calculating the size of a page
use the size of first five packets of a TCP connection to downloaded to browser, which is often unique among all files
in a given site. Moreover, as HTML files cannot be transmitted Certificate authority information, Server IP and Session ID
concurrently with other files, they thus remain distinguishable. from a SSL certificate; (3) Session ID-IP-based Service Identi-
At that time (i.e., 1998 and before) the browsers did not use fier module recognizes non-identified flows from the previous
the HTTP Pipelining, which now hides the objects size and modules by finding relation between server IP and session ID.
order. Miller et al. [23] proposes a method to identify the Based on their experiments, they can classify 95% of TLS
accessed page among 500 pages hosted at the same HTTPS traffic belonging to Google, Facebook and Kakaotalk with
website based on clustering techniques to identify patterns in about 90% accuracy for the corresponding services. Authors
traffic. Their results show the possibility to identify individual in [10] use the information in SSL certificate like certificate
pages from the same website accessed over HTTPS with 89% validity, release dates and the content of subject alternative
accuracy. They successfully identify the home-page or internal- name as features to detect suspicious TLS traffic.
pages from a website but at the cost of a specific learning at
a single website page level, while more effort is needed to D. Protocol-Structure based method
identity embedded services in web pages.
One widely used technique to identify HTTPS service is
B. Behaviour based method based on inspecting a field of TLS protocol header, namely
Server Name Indication (SNI), which has been recently im-
In [40], the authors develop a passive approach for webmail plemented in many firewall solutions and content filtering
traffic identification in HTTPS in order to understand the shift- solutions. The SNI is mainly used to allow a client to specify
ing usage trend and mail traffic evolution. Three novel features the server hostname when the TLS negotiation starts, as shown
are proposed (1) service proximity: the presence of POP, IMAP in Figure 3. The idea is using SNI to filter HTTPS traffic, since
or SMTP server within a domain is a strong indication that it indicates the name of the remote service a client intends to
a mail sever exists; (2) activity profiles: mail system clients access. SNI provides the identification system with the power
access their e-mail frequently in a scheduled manner, so its to early abort the access to prohibited services.
possible to build daily and weekly profile to such behaviour;
(3) periodicity: the usage of application timers like AJAX Bortolameotti et al. [10] used both SNI and SSL certificate
technology to periodically (e.g., 5 minutes) check for new information to detect malicious TLS connections by examining
messages creates high frequency time pattern and gives strong (1) Levenshtein distance between the SNI and top 100 most
indication the email service is running. These features are used visited websites; (2) the structure of the server-name string
with Support Vector Machine (SVM) algorithm to differentiate in SNI; (3) the format of the server-name string, which is a
between mail and non-mail services within HTTPS flows. The DNS hostname format. In [43], authors evaluate the reliability
evaluation over dataset from ISP shows the ability to identify of identifying HTTPS based on SNI. They found two inherent
HTTPS mail server with 93.2% accuracy. weaknesses, regarding (1) backward compatibility and (2) mul-
tiple services using a single certificate, which can be used by a
Chen et al. [41] use the traffic pattern of the AutoComplete client to cheat the identification system. As proof of concept,
function, which populates a list of suggested content with they develop a web browser plug-in to demonstrate how these
each letter an user enters, such as Google and Yahoo search weaknesses can be practically used to bypass firewalls relying
engine. Despite HTTPS, this small amount of input data on SNI to identify HTTPS traffic.
causes state transitions in a web application, which can be
used to enumerate all possible inputs to match the triggered E. Machine learning based method
traffic pattern. Based on real scenarios, they show how such a
method can be applied to leak out sensitive information (e.g., Recent work [16] argues that the page-level identification
Search Keywords) from top online web applications. V. Berg. is too fine-grained (i.e., Website Fingerprinting), specially in
[42] develops a tool that uses encrypted traffic patterns to the case of identifying content that is dynamically included
identify user activities over Google maps that already runs in other web pages (video, maps, etc.) Thus, they propose
over HTTPS. The tool collects satellite map tiles and builds a a method for identifying HTTPS at service-level thanks to
database of the image sizes correlated with their (x,y,z)-triplets a multi-level framework, without relying on specific header
coordinates. To identify the accessed region over Google maps, fields, such as SNI that can be easily altered or the TLS
the tool maps the size of images in HTTPS flow to (x,y,z)- certificate information. The proposed framework uses machine
triplets and then clusters the results into a specific region. learning algorithms (Randomforest and C4.5) with a statisti-
As a proof of concept, the tool’s dataset has been configured cal profile library of intended HTTPS services. The profile
with city profiles, where it can correctly detect the transition contains statistical measurements (Mean, variance, Max, Min,
between such cities. etc.) over packets size and the inter-arrival-time over TLS
flow packets. For evaluation, real traffic collected from their
C. SSL certificate based method university network (i.e., private dataset) has been used. Their
multi-level framework can identify HTTPS web services with
SSL certificates, which are originally used to verify the 93% accuracy.
identities of servers and clients, are also used to recognize the
accessed service over HTTPS flows. Kim et al. [6] use the F. Summary
certificate public information to build SSL/TLS Identification
Method (SSIM) to name the services behind HTTPS traffic. Many recent approaches, as summarized in Table VI, intend
The proposed method consists of three modules: (1) TLS Traf- to identify HTTPS services based on the plain-text information
fic Detector module isolates pure TLS traffic before beginning that appears in the TLS handshake phase or based on the
the service identification; (2) service signature module extracts statistical signature of HTTPS web services. However, the
TABLE VI: Services identification inside HTTPS
Paper Features Method Level of Identification Accuracy Publish Year
Cheng et al. [39] Packet size and order Internal pages 96% 1998
Website Fingerprinting
Miller et al. [23] Packets size and direction Internal pages 89% 2014
Service proximity, activity profiles,
Schatzmann et al. [40] Email services 94.8% 2010
session duration, and periodicity
Behaviour based
Chen et al. [41] AutoComplete function traffic Search keywords - 2010
Vincent Berg [42] Images size Google maps activities - 2011
Kim et al. [6] Certificate information Services 90% 2015
SSL certificate
Bortolameotti et al. [10] Certificate information Services 100%* 2016
Shbair et al. [43] SNI Services 100%* 2015
Protocol Structure
Bortolameotti et al. [10] SNI Malicious connections 100%* 2015
Shbair et al. [16] SNI, packets size and timing profile Machine learning Services 93% 2016
* If user not malicious (i.e., alter SNI)
reliability of handshake information for identification still need intended remote server then it establishes a secure connection
improvement, as discussed with SNI and SSL certificate. While to the real server. As shown in Figure 4, when a client connects
the reliability of machine learning method has a challenge to the remote server via a HTTPS proxy the client connects to
with the increased complexity of web applications that can the proxy server, which plays the role of a destination server by
be easily extended with new functionalities that may change providing its own SSL certificate. Then the proxy establishes
the application behaviour and the statistical-signature. This another secure connection with the real remote server. By this
complexity creates an overhead to the machine learning based method all encrypted web traffic is open to the proxy in clear
identification methods to re-evaluate their statistical features at the expense of users’ privacy [43].
and re-train classification models regularly to keep their meth-
ods effective with updated changes.
The related work that can precisely name the service behind
HTTPS flow are mainly depends on the offline/passive analysis
where full HTTPS flows available for training and classifi-
cation. However, the offline analysis is less critical to the
training time duration, computation overhead and classification
error over time. That opens a research question about the real-
time identification of HTTPS services. The existence of real-
time detection will help ISPs and administrators to manage Fig. 4: HTTPS Proxy Server
HTTPS services at the right time, and perform the proper Existing commercial solutions such as Forefront Threat
network management activities. The side question is to know Management Gateway (TMG) 2010 uses the HTTPS proxy
how currently the HTTPS traffic is monitored and filtered. The method for HTTPS inspection, which acts as a trusted man-
answer to this question is presented in the next section. in-the-middle instead of just tunnelling HTTPS connection
blindly [46]. Also the FireEye product uses the proxy model
VI. P RACTICAL HTTPS M ONITORING AND F ILTERING to provide visibility into untrusted TLS traffic. The product is
Once HTTPS traffic is identified then it can be monitored designed to intercept and forward all desired network traffic
and filtered. Monitoring is defined as recording all activities for temporarily decrypting, examining and then re-encrypting
on a network (URLs visited, session duration, bandwidth, TLS sessions again. The FireEye argues this method responds
etc.) and generating a report that can be used for network to the growing number of cyber criminals that use TLS as a
management. Some of current HTTPS monitoring approaches cover to get inside organizations and persist undetected [47].
use similar techniques than research papers. For example, some
monitoring approaches rely on the information exchanged B. Acquiring TLS encryption key
during handshake such as: SSL certificates [30], [31], the
handshake interactions sequence [44] and the most recent is There are at least two methods for acquiring the decryp-
SNI-based monitoring [43], which has been implemented in tion keys, the Key-Recovery mechanism and the cracking
solutions as Clavister Web Content Filtering and Sophos Uni- of encryption algorithms. In [48] author describe the Key-
fied Threat Management (UTM) to monitor accessed websites Recovery mechanism or ”Key escrow”, where all encryption
over HTTPS flows [43]. HTTPS filtering is intended to restrict keys are stored in a trusted third party, such as government,
access by intercepting the transmitted date between a client and or designated private entities. The third party has the right
server [45]. Hence, to deploy HTTPS filtering two approaches to access keys for authorized law enforcement purpose. As a
have been presented: HTTPS proxy server and acquiring TLS result, a government may limit access to HTTPS websites that
encryption keys. refuse sharing their TLS keys with the escrow system [45].
However, cracking encryption algorithms is different from
the preceding ones, as it needs high computation power to
A. HTTPS Proxy Server
be able to crack the encryption. A method for cracking is
A proxy server is a server acts as a man in the middle to using a flaw in the mathematical algorithm used to encrypt
processes and forwards clients requests towards servers. How- data, such as the factorization of widely used public-key
ever, when HTTPS is used the proxy server cannot directly cryptosystems. For instance, RSA 768-bit can be broken with
access data transmitted via HTTPS, so it pretends to be the a state of the art algorithm and a high computation power [49].
Adrian et al. [50] evaluated the security of Diffie-Hellman key not be sufficient to legitimate the exposition of employees
exchange, where they found that 82% of vulnerable servers use private data with HTTPS proxy.
a single 512-bit group, which makes it possible to compromise
connections of 7% of Alexa Top Million HTTPS sites. In conclusion, while challenging, we think that the identifi-
cation of HTTPS services without decryption is the way to go
to provide a viable compromise between the needed network
VII. C ONCLUSION knowledge to ensure proper management and users’ privacy.
The research community should focus on proposing new
HTTPS is quickly becoming the predominant application identification techniques offering both security and privacy.
protocol on the Internet. It answers to the need of Internet
users to benefit from security and privacy when accessing
R EFERENCES
the Web. But the increasing amount of HTTPS traffic comes
with challenges related to its management to guarantee basic [1] D. Naylor, A. Finamore, I. Leontiadis, Y. Grunenberger, M. Mellia,
network properties such as security, QoS, reliability, etc. The M. Munafò, K. Papagiannaki, and P. Steenkiste, “The cost of the S in
HTTPS,” in Proceedings of the 10th ACM International on Conference
encryption undermines the effectiveness of standard monitor- on emerging Networking Experiments and Technologies. ACM, 2014,
ing techniques and makes it difficult for ISPs and network pp. 133–140.
administrators to properly identify services behind HTTPS [2] “Cisco 2016 annual security report.”
traffic and to properly apply network management operations. http://www.cisco.com/c/dam/assets/offers/pdfs/cisco-asr-2016.pdf,
2016, available online, Accessed: 30/05/2016.
This survey provides a focused view of HTTPS traffic iden- [3] “Report 2015-0832 from the French regulatory authority for
tification method, starting from the identification of the lower- telecommunications (ARCEP),” [In French], Accessed: 19/04/2015.
level TLS protocol to the precise identification of HTTPS [Online]. Available: http://www.arcep.fr/uploads/tx gsavis/15-0832.pdf
services. We have found that efficient methods exploiting the [4] “Global internet phenomena spotlight: Encrypted internet
standard structures of the TLS protocol are able to identify TLS traffic,” accessed: 11/05/2016. [Online]. Available:
https://www.sandvine.com/downloads/general/global-internet-phenomena/2015/encry
traffic among other types with a high level of accuracy and are
no more a research topic. The identification of HTTPS uses [5] A. Hilts, “Scandinista! analyzing TLS handshake scans and HTTPS
browsing by website category,” in Workshop on Surveillance & Tech-
different methods with an acceptable level of detection rate, but nology. PETS, 2015.
the confusing matter is the representativity and the diversity [6] S.-M. Kim, Y.-H. Goo, M.-S. Kim, S.-G. Choi, and M.-J. Choi, “A
of dataset, which prevent any strong and direct comparison method for service identification of SSL/TLS encrypted traffic with
of their respective identification accuracy. This leads to im- the relation of session ID and Server IP,” in Network Operations and
portant research questions about reproducible research and the Management Symposium (APNOMS), 2015 17th Asia-Pacific. IEEE,
2015, pp. 487–490.
best practices for HTTPS dataset construction and diffusion.
Finally, based on the most recent works, we show that some [7] J. Erman, A. Gerber, K. K. Ramadrishnan, S. Sen, and O. Spatscheck,
“Over the top video: The gorilla in cellular networks,” in Proceedings
efforts were made to discriminate between services running of the 2011 ACM SIGCOMM Conference on Internet Measurement
within HTTPS traffic. While the results are encouraging, most Conference, ser. IMC ’11. New York, NY, USA: ACM, 2011, pp. 127–
solutions still suffer from significant drawbacks ranging from 136. [Online]. Available: http://doi.acm.org/10.1145/2068816.2068829
the specialized identification of a precise application (webmail, [8] P. Haffner, S. Sen, O. Spatscheck, and D. Wang, “ACAS: automated
maps, etc.) to the inability to operate in real time. construction of application signatures,” in Proceedings of the 2005 ACM
SIGCOMM workshop on Mining network data. ACM, 2005, pp. 197–
In the previous section of this survey, we noticed that 202.
there is a very efficient method to monitor and control HTTPS [9] M. Husák, M. Čermák, T. Jirsı́k, P. Čeleda et al., “HTTPS traffic
traffic based on HTTPS proxy. This easy solution has many analysis and client identification using passive SSL/TLS fingerprinting,”
2016.
supporters from National Security Agencies [51] to security
[10] R. Bortolameotti, A. Peter, M. H. Everts, and D. Bolzoni, “Indicators of
companies [47], and is even discussed for future Internet malicious SSL connections,” in Network and System Security. Springer,
technical standards [52]. However, such a method cannot be 2015, pp. 162–175.
treated lightly as it denies the right for privacy for the sake [11] A. Callado, C. Kamienski, G. Szabó, B. P. Gerö, J. Kelner, S. Fernandes,
of traffic inspection and therefore creates a paradox between and D. Sadok, “A survey on internet traffic identification,” Communi-
the need for security and users’ privacy. The answer to this cations Surveys & Tutorials, IEEE, vol. 11, no. 3, pp. 37–52, 2009.
conflict is not easy, as both sides may have valuable arguments. [12] S. Valenti, D. Rossi, A. Dainotti, A. Pescapè, A. Finamore, and
After the Snowden affair, this issue is known as ”Dilemmas M. Mellia, “Reviewing traffic classification,” in Data Traffic Monitoring
and Analysis. Springer, 2013, pp. 123–147.
of the Internet age” and has been discussed not only in the
[13] M. Finsterbusch, C. Richter, E. Rocha, J.-A. Muller, and K. Hanssgen,
academic community but in the overall society and human “A survey of payload-based traffic classification approaches,” Commu-
rights space. Authors in [53] claim that large-scale monitoring nications Surveys & Tutorials, IEEE, vol. 16, no. 2, pp. 1135–1156,
is ineffective as it is only able to identify trivial crimes, 2014.
but cannot recognize professional criminals or persons well [14] P. Velan, M. Čermák, P. Čeleda, and M. Drašar, “A survey of methods
educated in working under surveillance. Another important for encrypted traffic classification and analysis,” International Journal
question is how to guarantee that an administration in power of Network Management, vol. 25, no. 5, pp. 355–374, 2015.
will never abuse the intercepted information to intimidate its [15] M. Husák, M. Cermak, T. Jirsik, and P. Celeda, “Network-based HTTPS
opponents. Authors conclude that if online monitoring may client identification using SSL/TLS fingerprinting,” in Availability,
Reliability and Security (ARES), 2015 10th International Conference
fix some problems, it can create even more serious ones. In on. IEEE, 2015, pp. 389–396.
another domain, even if enterprise owners may have good [16] W. Shbair, T. Cholez, J. Francois, and I. Chrisment, “A multi-level
arguments for monitoring and filtering access to their network framework to identify HTTPS services,” in IEEE/IFIP Network Oper-
for security, productivity or responsibility reasons [54], it may ations and Management Symposium (NOMS). IEEE/IFIP, 2016.
[17] M. Zhang, W. John, K. Claffy, and N. Brownlee, “State of the art in [38] A. Pironti, P.-Y. Strub, and K. Bhargavan, “Identifying website users
traffic classification: A research review,” in PAM Student Workshop, by TLS traffic analysis: New attacks and effective countermeasures,”
2009, pp. 3–4. INRIA, Tech. Rep., 2012.
[18] Z. Cao, S. Cao, G. Xiong, and L. Guo, “Progress in study of en- [39] H. Cheng and R. Avnur, “Traffic analysis of SSL encrypted web
crypted traffic classification,” in Trustworthy Computing and Services. browsing,” URL citeseer. ist. psu. edu/656522. html, 1998.
Springer, 2013, pp. 78–86. [40] D. Schatzmann, W. Mühlbauer, T. Spyropoulos, and X. Dimitropoulos,
[19] R. Oppliger, SSL and TLS: Theory and Practice. Artech House, Inc., “Digging into HTTPS: flow-based classification of webmail traffic,”
2009. in Proceedings of the 10th ACM SIGCOMM conference on Internet
measurement. ACM, 2010, pp. 322–327.
[20] T. Dierks and E. Rescorla, “Rfc 5246-the transport layer security (tls)
protocol version 1.2 (2008),” URL: https://tools. ietf. org/html/rfc5246 [41] S. Chen, R. Wang, X. Wang, and K. Zhang, “Side-channel leaks in
(cited on pages 71, 96). web applications: A reality today, a challenge tomorrow,” in Security
and Privacy (SP), 2010 IEEE Symposium on. IEEE, 2010, pp. 191–
[21] E. Rescorla, “The transport layer security (TLS) protocol version 1.3,” 206.
Internet-Draft, IETF, Tech. Rep., October 2015. [Online]. Available:
https://tools.ietf.org/html/draft-ietf-tls-tls13-10 [42] V. Berg, “SSL traffic analysis attacks,” http://2011.ruxcon.org.au/2011-
talks/ssl-traffic-analysis-attacks/, 2011, Accessed: 31/05/2016.
[22] “TLS handshake protocol,” https://msdn.microsoft.com/en-
[43] W. M. Shbair, T. Cholez, A. Goichot, and I. Chrisment, “Efficiently
us/library/windows/desktop/aa380513(v=vs.85).aspx, Accessed:
bypassing SNI-based HTTPS filtering,” in Integrated Network Manage-
31/05/2016.
ment (IM), 2015 IFIP/IEEE International Symposium on. IEEE, 2015,
[23] B. Miller, L. Huang, A. D. Joseph, and J. D. Tygar, “I know why you pp. 990–995.
went to the clinic: Risks and realization of HTTPS traffic analysis,” in [44] M. Korczynski and A. Duda, “Markov chain fingerprinting to classify
Privacy Enhancing Technologies. Springer, 2014, pp. 143–163. encrypted traffic,” in INFOCOM, 2014 Proceedings IEEE. IEEE, 2014,
[24] L. Bernaille and R. Teixeira, “Early recognition of encrypted applica- pp. 781–789.
tions,” in Passive and Active Network Measurement. Springer, 2007, [45] B. D. McGinnes, “Cleaning a HTTPS feed,”
pp. 165–175. http://www.adversary.org/files/CleanFeedHTTPS-01.pdf, 2010,
[25] C. McCarthy et al., “An investigation on identifying SSL traffic,” Accessed: 10/03/2016.
in Computational Intelligence for Security and Defense Applications [46] “Configuring HTTPS inspection with forefront threat
(CISDA), 2011 IEEE Symposium on. IEEE, 2011, pp. 115–122. management gateway (TMG),” http://www.isaserver.org/articles-
[26] G.-L. Sun, Y. Xue, Y. Dong, D. Wang, and C. Li, “An novel hybrid tutorials/configuration-general/Configuring-HTTPS-Inspection-
method for effectively classifying encrypted traffic,” in Global Telecom- Forefront-Threat-Management-Gateway-TMG-2010.html, Accessed:
munications Conference (GLOBECOM 2010), 2010 IEEE. IEEE, 2010, 22/05/2016.
pp. 1–5. [47] “FireEye SSL intercept appliance, expose attacks hiding
[27] C. Liu, G. Sun, and Y. Xue, “DRPSD: An novel method of identifying in SSL traffic,” https://www.fireeye.com/content/dam/fireeye-
SSL/TLS traffic,” in World Automation Congress (WAC), 2012. IEEE, www/global/en/products/pdfs/ds-ssl-intercept.pdf, datasheet, Accessed:
2012, pp. 415–419. 22/05/2016.
[28] G.-L. Sun, Y. Xue, Y. Dong, D. Wang, and C. Li, “An novel hybrid [48] H. Abelson, R. N. Anderson, S. M. Bellovin, J. Benaloh, M. Blaze,
method for effectively classifying encrypted traffic,” in Global Telecom- W. Diffie, J. Gilmore, P. G. Neumann, R. L. Rivest, J. I. Schiller
munications Conference (GLOBECOM 2010), 2010 IEEE. IEEE, 2010, et al., “The risks of key recovery, key escrow, and trusted third-party
pp. 1–5. encryption,” 1997.
[49] T. Kleinjung, K. Aoki, J. Franke, A. K. Lenstra, E. Thomé, J. W.
[29] C. V. Wright, F. Monrose, and G. M. Masson, “On inferring applica-
Bos, P. Gaudry, A. Kruppa, P. L. Montgomery, D. A. Osvik et al.,
tion protocol behaviors in encrypted network traffic,” The Journal of
“Factorization of a 768-bit RSA modulus,” in Advances in Cryptology–
Machine Learning Research, vol. 7, pp. 2745–2769, 2006.
CRYPTO 2010. Springer, 2010, pp. 333–350.
[30] R. Holz, L. Braun, N. Kammenhuber, and G. Carle, “The ssl landscape: [50] D. Adrian, K. Bhargavan, Z. Durumeric, P. Gaudry, M. Green, J. A.
a thorough analysis of the x. 509 PKI using active and passive Halderman, N. Heninger, D. Springall, E. Thomé, L. Valenta et al.,
measurements,” in Proceedings of the 2011 ACM SIGCOMM conference “Imperfect forward secrecy: How Diffie-Hellman fails in practice,” in
on Internet measurement conference. ACM, 2011, pp. 427–444. Proceedings of the 22nd ACM SIGSAC Conference on Computer and
[31] Z. Durumeric, J. Kasten, M. Bailey, and J. A. Halderman, “Analysis Communications Security. ACM, 2015, pp. 5–17.
of the HTTPS certificate ecosystem,” in Proceedings of the 2013 [51] F. N. A. for the Security of Information Systems (ANSSI), “Security
conference on Internet measurement conference. ACM, 2013, pp. 291– recommendation regarding the analysis of https traffic.” [Online]. Avail-
304. able: http://www.ssi.gouv.fr/uploads/IMG/pdf/NP TLS NoteTech.pdf
[32] T. J. v. P. Petr Velan, Jana Medkov, “Network traffic characterisation [52] I. H. W. Group, “Explicit trusted proxy in http/2.0.” [Online]. Available:
using flow-based statistics,” in IEEE/IFIP Network Operations and https://tools.ietf.org/html/draft-loreto-httpbis-trusted-proxy20-01
Management Symposium, 2016.
[53] M. A. Caloyannides, “Is privacy really constraining security or is this a
[33] M. Husák, M. Čermák, T. Jirsı́k, and P. Čeleda, “HTTPS traffic red herring?” Security & Privacy, IEEE, vol. 2, no. 4, pp. 86–87, 2004.
analysis and client identification using passive SSL/TLS fingerprinting,” [54] “Internet filtering and monitoring in your business. available on-
EURASIP J. Inf. Secur., vol. 2016, no. 1, pp. 30:1–30:14, Dec. 2016. line,” http://www.currentware.com/whitepapers/Internet-Filtering-and-
[Online]. Available: http://dx.doi.org/10.1186/s13635-016-0030-7 Monitoring-in-your-BusinessWhite-Paper.pdf, 2015.
[34] L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule, and K. Salamatian,
“Traffic classification on the fly,” ACM SIGCOMM Computer Commu-
nication Review, vol. 36, no. 2, pp. 23–26, 2006.
[35] G. D. Bissias, M. Liberatore, D. Jensen, and B. N. Levine, “Privacy
vulnerabilities in encrypted HTTP streams,” Lecture notes in computer
science, vol. 3856, p. 1, 2006.
[36] D. Herrmann, R. Wendolsky, and H. Federrath, “Website fingerprinting:
attacking popular privacy enhancing technologies with the multinomial
naı̈ve-bayes classifier,” in Proceedings of the 2009 ACM workshop on
Cloud computing security. ACM, 2009, pp. 31–42.
[37] L. Lu, E.-C. Chang, and M. C. Chan, “Website fingerprinting and
identification using ordered feature sequences,” in Computer Security–
ESORICS 2010. Springer, 2010, pp. 199–214.