Smart Devices Information Extraction in Home Wi-Fi Networks: Pan Wang Feng Ye Xuejiao Chen
Smart Devices Information Extraction in Home Wi-Fi Networks: Pan Wang Feng Ye Xuejiao Chen
DOI: 10.1002/itl2.42
LETTER
KEYWORDS
deep packet inspection, information extraction, smart home, UDF, user agent
1 INTRODUCTION
In home networks, users use various smart devices, such as mobile phones, tablet computers (tablets hereafter), smart televisions
(TV)s, and so on, to access the Internet via home Wi-Fi settings. With the development of smart homes, even more smart devices
will access the Internet via home Wi-Fi. In order to provide service subscribers with better network quality-of-service (QoS)
guarantee and business experience, the operators or service providers often need to collect a large amount of measurement data,
including security and user behavioral information. Further data process is then performed, for example, in a cloud, to identify
the bottleneck of network management and improve quality-of-experience (QoE) of users. Extraction of smart devices is one
of the crucial tasks in the aforementioned data process procedure.
In this paper, we propose a scheme to extract information of smart devices by parsing data traffic from home Wi-Fi networks.
Although the focus is on home Wi-Fi networks, our proposed schemes can be extended to other types of networks. Operators of
traditional cellular networks can obtain most device information by parsing the signaling traffic. For example, an International
Mobile Equipment Identity is normally embedded in the signaling traffic.1 However, such information is not carried in Wi-Fi
data traffic by default. In order to obtain such information of smart devices in home Wi-Fi networks, one possible method is to
collect and parse user agent (UA) strings in hypertext transfer protocol (HTTP) messages that are exchanged between a smart
device and the cloud.2 An example of UA is shown in Figure 1. As it shows, information such as types of browsers, operating
systems, character sets, and so on, can be extracted through the parsing process.
Pan Wang and Xuejiao Chen are visiting scholars at the University of Dayton.
Abbreviations: CPU, central processing unit; DPI, deep packet inspection; HTTP, hypertext transfer protocol; QoE, quality-of-experience; QoS,
quality-of-service; RAM, random access memory; SDIR, smart devices information record; TV, television; UA, user agent; UDF, user-defined function; URL,
Uniform Resource Locator; WURFL, Wireless Universal Resource File.
Internet Technology Letters. 2018;1:e42. wileyonlinelibrary.com/journal/itl2 Copyright © 2018 John Wiley & Sons, Ltd. 1 of 6
https://doi.org/10.1002/itl2.42
2 of 6 WANG ET AL .
User-Agent
OS Device description
Although 2 public standards are available to format a device label, however, most manufacturers ignore the standards, thus
making information extraction even more difficult. There are 2 existing solutions to this issue. One is Wireless Universal
Resource File (WURFL),3 which is an identification method based on unique contents of a UA, for example, the device infor-
mation. By matching the uniquely identified content with a predefined profile, device information can be extracted from a web
server. However, due to the advent of new devices and frequent upgrade of existing devices, WURFL cannot guarantee a high
accuracy in the long run. The other solution is based on UA string matching. To apply UA string matching, a database is pre-
set with matching rules that include UA strings and device models. Extracting device information is conducted by mapping
a captured string to the preset values in the database. While being simple to implement, this method is inefficient especially
with massive amount of user data.4 Our proposed scheme is designed to overcome the drawbacks of the 2 existing methods. In
particular, we implement Hadoop in the scheme for fast and efficient processing of massive amount of data.5 Moreover, Hive
UDF is applied for data preprocessing in order to unify data formats.6 The rest of this paper is organized as follows. Section 2
presents the framework of smart devices information extraction scheme. Section 3 presents the data processing technology
based on Hadoop, and describes how to realize the UDF function. Section 4 shows the experimental results. Section 5 concludes
this work.
An overview of the proposed information extraction scheme is shown in Figure 2. Overall, the scheme includes 4 parts, that is,
traffic collection, data preprocessing, smart devices information extraction, and smart devices information record (SDIR).
• Traffic collection is to collect raw data traffic from the targeted network. Traffic collection points can be deployed in different
locations of a communication network, such as the core network, aggregation layer, remote server-based access nodes, and
the smart home network gateway.(7,8) Fiber splitter is often applied for data collection.
• Data preprocessing is to cleanse and filter the collected traffic data since the collected traffic data can be noisy. After data
preprocessing, the core function (ie, based on DPI9 ) only processes a small portion of the collected data, which greatly reduce
the computational overhead of the scheme. It is found that most mobile applications (also denoted as Mobile Apps) use
HTTP to communicate with the servers.10 In the paper, we will use HTTP to demonstrate the proposed information extraction
scheme. Our proposed scheme can be easily extended, for example, to apply header/message fields for better compatibility
of other proprietary protocols. The filtering policies used in the proposed scheme are designed based on extensive experi-
ments so that nearly 98% of the raw data can be cleansed and filtered accurately. The policies include user type, location,
communication protocol, and so on.
• Smart devices information extraction is the core function of the proposed scheme. This function is to match a captured
UA string from a predefined library that is maintained and updated frequently. Details of the core function will be further
discussed in the next section.
• SDIR is a summary of the extraction scheme. At this step, required information, such as device type, brand, model, and so
on, have been successfully extracted from raw data traffic. SDIRs will be collected for further data statistics, analysis, and
mining. Detailed application of SDIR is beyond the scope of the proposed information extraction scheme.
After data preprocessing, useful data streams that contain broadband subscriber account, user access Uniform Resource Loca-
tor (URL), and UA strings can be obtained for information extraction by the proposed UDF-based scheme. In order to deal
with massive amount of UA strings, Hadoop-based MapReduce is implemented to enable parallel processing. The overall data
processing procedure is shown in Figure 3. The 5 steps to be processed by Hadoop are as follows:
1. Collect and preprocessing raw traffic at a Hive data warehouse.
2. Cleanse and filter UA string data from preprocessing.
WANG ET AL . 3 of 6
Smart Devices
Information Extraction
IP packet arriving
Smart
Devices
Traffic Data String
Information
Collection Preprocessing Matching
Record
(SDIR)
UA String List
Library
SDIR
Hive Original UA Parsing based on Smart Devices
(Smart Devices
Data Warehouse UA String Data UDF of MapReduce Rule Database
Information Record)
Return the
matching string
Yes
Read the regular
Read End
Start expression file as a matching Yes/ No
UA data
data structure list
No
Return null
UA regular expression
windows nt ([0-9]+[0-9]+)
⧵((ipad);.*;.*⧵)
.*/.*⧵((iphone);.*cpu iphone os .*⧵)
⧵((iphone);.*;.*⧵)
3. Parse and regularize UA string data with a UDF function implemented on MapReduce.
4. Create and manage a database of smart device rules by WebMagic.11
5. Finally, extract smart device information by matching UA strings with the database.
In step 1, Hive is a Hadoop-based data warehouse that uses Hadoop-HDFS as a data store and provides HiveQL. In addition to
built-in functions, Hive also provides user-defined functions (UDF) to enhance data processing. Since nonstandard UA strings
cannot be processed by HiveQL built-in statements, we define a UDF that processes nonstandard data formats to expected ones
for information matching. An illustration of the defined UDF is shown in Figure 4.
To start the regulation process in step 3, a regular expression file of UA strings is read into memory as a data structure list.
The regular expression includes the smart devices information and it is used to parse the raw UA strings. Table 1 shows an
example of regular expression. When the raw UA strings are matched with the regular expressions, the parsed UA strings will
be obtained. Table 2 shows a few examples of raw UA stings and the corresponding parsed ones.
4 of 6 WANG ET AL .
TABLE 2 The comparison between the parsed UA and the original data
TV Box
Tablet
Mobile phone
Laptop
In step 4, WebMagic is a simple and flexible Java web crawler framework. It is deployed in this work to collect smart
device-related information from trusted e-commerce websites (eg, Bestbuy, JD, and so on). The information database is
automatically and periodically updated also by WebMagic. Final results are stored in the SDIR data structure.
In this section, we demonstrate the proposed information extraction scheme with experiments based on real-life network data.
Through our partnership with a network operator in China, raw data for a period of 5 days was collected from a metropolitan area
network (ie, home users). The data was generated from 4 types of smart devices. Specifically, 355 630 mobile phones, 13 609
tablets, 9567 TV boxes, and 6829 smart TVs were participated in the experiments. As we can see, smart phone is dominating
from the collected data. However, to provide better user QoE, a network operator may require more detailed information such as
device type, device brand, operating system, and so on. Without loss of generality, we show the results of extracting device type
and brand with the proposed scheme. In particular, our virtual computing cluster is equipped with an 8-core central processing
unit (CPU) and 64 GB random access memory (RAM). The entire information extraction process, including data preprocessing,
was completed in 12 hours. Note that the current work is more of a concept proof thus optimization and more evaluation in
computing will be conducted in our future work.
By running our proposed information extraction scheme, the brands of each type of smart devices are accurately extracted.
As shown in Figure 5, the accuracy reached over 92% for all types of smart devices in the real-life experiment. The accuracy is
measured as the ratio of the number of correct detections to the total number of smart devices.
Device brand can be further extracted for each device, as shown in Figure 6. With such information, network operators
will be able to provide services and manage their network resources more efficiently to improve user QoE. For example,
network operators may allocate different cache sizes and prioritize network flows for video-streaming services depending on
user devices. A smart TV user may satisfy with a larger cache size and prioritized network flows, while a smart phone may
satisfy with a smaller cache size and slightly delayed network flows in exchange for other communication capabilities. Similar
network resource management can be applied to other applications such as gaming, online shopping, social networking, and
so on. To further improve user QoE, smart device manufacturers and application developers may improve their products, for
example, through firmware and application updates, to fully utilize the optimized network resources accordingly.
5 CONCLUSIONS
In this paper, we proposed a smart device information extraction scheme. The proposed scheme applied UDF to deal with
nonstandard UA string formats so the processing was available using a Hadoop-based platform. Experiments were conducted
based on real-life network data. The results demonstrated that our proposed scheme can achieve over 92% accuracy in smart
WANG ET AL . 5 of 6
FIGURE 6 The distribution of different smart device brands from the extraction
device information extraction in practical cases. Although the proposed scheme in this work was focused on home Wi-Fi users,
it can be easily extended to other types of network users. Moreover, some open questions, such as UA signature database update
and maintenance, faster matching, and so on, will be explored in our future work. We will also conduct performance comparisons
with possible future-related works.
ACKNOWLEDGMENTS
The paper is sponsored by Jiangsu Provincial Government Scholarship Program, China, and Jiangsu Overseas Visiting Scholar
Program, China.
ORCID
Feng Ye http://orcid.org/0000-0002-2436-2300
REFERENCES
1. Japertas S, Budnikas A. Identification technology of mobile phone devices using RFF. 2014 11th International Conference on Wireless Information Network and
Systems. WINSYS. Vienna, Austria: IEEE; 2014:1-6.
2. Grill M, Rehak M. Malware detection using HTTP user-agent discrepancy identification. 2014 IEEE International Workshop on Information Forensics and Security
(WIFS). Atlanta, GA, USA: IEEE; 2014:221-226.
3. WURFL. The Power of Device Intelligence Web site. http://wurfl.sourceforge.net/. Accessed December 10, 2017.
4. La VH, Fuentes R, Cavalli AR. Network monitoring using MMT: an application based on the user-agent field in HTTP headers. 2016 IEEE 30th International
Conference on Advanced Information Networking and Applications (AINA). Crans-Montana, Switzerland: IEEE; 2016:147-154.
5. Hadoop. APACHE hadoop Web site. http://hadoop.apache.org/. Accessed December 25, 2017.
6. UDF. Hadoop Web site. https://pig.apache.org/docs/r0.9.1/udf.html. Accessed October 12, 2017.
7. Pan W. Big data plug-in technology for smart router based on multidimensional awareness. J Nanjing Univ Posts Telecommun. 2016;36:18-21.
8. Pan W, Xuejiao C. Co hijacking monitor: collaborative detecting and locating mechanism for HTTP spectral hijacking. The 2017 IEEE Cyber Science and Technology
Congress (CyberSciTech 2017). Orlando, FL, USA: IEEE; 2017.
9. Subhabrata S, Oliver S, Dongmei W. Accurate, scalable in-network identification of P2P traffic using application signatures. Proceedings of the 13th International
Conference on World Wide Web. New York, NY, USA: ACM; 2004:512-521.
6 of 6 WANG ET AL .
10. Xu Q, Liao Y, Miskovic S, et al. Automatic generation of mobile app signatures from traffic observations. 2015 IEEE Conference on Computer Communications
(INFOCOM). Kowloon, Hong Kong: IEEE; 2015:1481–1489.
11. WebMagic. Web Magic Web site. http://webmagic.io/. Accessed October 20, 2017.
How to cite this article: Wang P, Ye F, Chen X. Smart devices information extraction in home Wi-Fi networks. Internet
Technology Letters 2018;1:e42. https://doi.org/10.1002/itl2.42