0% found this document useful (0 votes)
86 views6 pages

Smart Devices Information Extraction in Home Wi-Fi Networks: Pan Wang Feng Ye Xuejiao Chen

Uploaded by

HassanMuayead
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views6 pages

Smart Devices Information Extraction in Home Wi-Fi Networks: Pan Wang Feng Ye Xuejiao Chen

Uploaded by

HassanMuayead
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Received: 13 February 2018 Revised: 14 March 2018 Accepted: 15 March 2018

DOI: 10.1002/itl2.42

LETTER

Smart devices information extraction in home Wi-Fi networks

Pan Wang1 Feng Ye2 Xuejiao Chen3

1 Schoolof Modern Posts, Nanjing University of


Posts and Telecommunication, Nanjing, China With the rapid development of smart homes, there will be more smart devices con-
2 Department of Electrical and Computer
nected to home Wi-Fi for Internet access. Knowing the exact information of smart
Engineering, University of Dayton, Dayton, Ohio,
3 Department of Communication, Nanjing College devices can further improve network quality-of-service from network operators as
of Information Technology, Nanjing, China well as security from service providers. In this paper, we propose a scheme based on
Correspondence Hadoop platform and user-defined function for smart device information extraction
Feng Ye, Department of Electrical and Computer in home Wi-Fi networks. The user-defined function is developed in the proposed
Engineering, University of Dayton, Dayton, Ohio.
scheme to deal with massive amount of data that are not formatted according to pub-
Email: [email protected]
lished standards. The core of the proposed information extraction scheme is based
Funding Information
Jiangsu Overseas Visiting Scholar Program, China. on string matching a processed input data to a prebuilt smart device rule database.
Jiangsu Provincial Government Scholarship Testing has been conducted based on a massive dataset that was collected from
Program, China. real-life home networks. The testing results demonstrate that our proposed method
can accurately extract device information from home Wi-Fi networks.

KEYWORDS

deep packet inspection, information extraction, smart home, UDF, user agent

1 INTRODUCTION

In home networks, users use various smart devices, such as mobile phones, tablet computers (tablets hereafter), smart televisions
(TV)s, and so on, to access the Internet via home Wi-Fi settings. With the development of smart homes, even more smart devices
will access the Internet via home Wi-Fi. In order to provide service subscribers with better network quality-of-service (QoS)
guarantee and business experience, the operators or service providers often need to collect a large amount of measurement data,
including security and user behavioral information. Further data process is then performed, for example, in a cloud, to identify
the bottleneck of network management and improve quality-of-experience (QoE) of users. Extraction of smart devices is one
of the crucial tasks in the aforementioned data process procedure.
In this paper, we propose a scheme to extract information of smart devices by parsing data traffic from home Wi-Fi networks.
Although the focus is on home Wi-Fi networks, our proposed schemes can be extended to other types of networks. Operators of
traditional cellular networks can obtain most device information by parsing the signaling traffic. For example, an International
Mobile Equipment Identity is normally embedded in the signaling traffic.1 However, such information is not carried in Wi-Fi
data traffic by default. In order to obtain such information of smart devices in home Wi-Fi networks, one possible method is to
collect and parse user agent (UA) strings in hypertext transfer protocol (HTTP) messages that are exchanged between a smart
device and the cloud.2 An example of UA is shown in Figure 1. As it shows, information such as types of browsers, operating
systems, character sets, and so on, can be extracted through the parsing process.

Pan Wang and Xuejiao Chen are visiting scholars at the University of Dayton.
Abbreviations: CPU, central processing unit; DPI, deep packet inspection; HTTP, hypertext transfer protocol; QoE, quality-of-experience; QoS,
quality-of-service; RAM, random access memory; SDIR, smart devices information record; TV, television; UA, user agent; UDF, user-defined function; URL,
Uniform Resource Locator; WURFL, Wireless Universal Resource File.

Internet Technology Letters. 2018;1:e42. wileyonlinelibrary.com/journal/itl2 Copyright © 2018 John Wiley & Sons, Ltd. 1 of 6
https://doi.org/10.1002/itl2.42
2 of 6 WANG ET AL .

User-Agent

dalvik/1.6.0 (linux; u; android 4.4.2; pe-tl10 build/huaweipe-tl10)

OS Device description

FIGURE 1 Example of a user agent

Although 2 public standards are available to format a device label, however, most manufacturers ignore the standards, thus
making information extraction even more difficult. There are 2 existing solutions to this issue. One is Wireless Universal
Resource File (WURFL),3 which is an identification method based on unique contents of a UA, for example, the device infor-
mation. By matching the uniquely identified content with a predefined profile, device information can be extracted from a web
server. However, due to the advent of new devices and frequent upgrade of existing devices, WURFL cannot guarantee a high
accuracy in the long run. The other solution is based on UA string matching. To apply UA string matching, a database is pre-
set with matching rules that include UA strings and device models. Extracting device information is conducted by mapping
a captured string to the preset values in the database. While being simple to implement, this method is inefficient especially
with massive amount of user data.4 Our proposed scheme is designed to overcome the drawbacks of the 2 existing methods. In
particular, we implement Hadoop in the scheme for fast and efficient processing of massive amount of data.5 Moreover, Hive
UDF is applied for data preprocessing in order to unify data formats.6 The rest of this paper is organized as follows. Section 2
presents the framework of smart devices information extraction scheme. Section 3 presents the data processing technology
based on Hadoop, and describes how to realize the UDF function. Section 4 shows the experimental results. Section 5 concludes
this work.

2 OVERVIEW O F THE PROPOSED SCHEME

An overview of the proposed information extraction scheme is shown in Figure 2. Overall, the scheme includes 4 parts, that is,
traffic collection, data preprocessing, smart devices information extraction, and smart devices information record (SDIR).
• Traffic collection is to collect raw data traffic from the targeted network. Traffic collection points can be deployed in different
locations of a communication network, such as the core network, aggregation layer, remote server-based access nodes, and
the smart home network gateway.(7,8) Fiber splitter is often applied for data collection.
• Data preprocessing is to cleanse and filter the collected traffic data since the collected traffic data can be noisy. After data
preprocessing, the core function (ie, based on DPI9 ) only processes a small portion of the collected data, which greatly reduce
the computational overhead of the scheme. It is found that most mobile applications (also denoted as Mobile Apps) use
HTTP to communicate with the servers.10 In the paper, we will use HTTP to demonstrate the proposed information extraction
scheme. Our proposed scheme can be easily extended, for example, to apply header/message fields for better compatibility
of other proprietary protocols. The filtering policies used in the proposed scheme are designed based on extensive experi-
ments so that nearly 98% of the raw data can be cleansed and filtered accurately. The policies include user type, location,
communication protocol, and so on.
• Smart devices information extraction is the core function of the proposed scheme. This function is to match a captured
UA string from a predefined library that is maintained and updated frequently. Details of the core function will be further
discussed in the next section.
• SDIR is a summary of the extraction scheme. At this step, required information, such as device type, brand, model, and so
on, have been successfully extracted from raw data traffic. SDIRs will be collected for further data statistics, analysis, and
mining. Detailed application of SDIR is beyond the scope of the proposed information extraction scheme.

3 UDF-BASED SMART DEVICES INFORMATION EXTRACTION S CHEME

After data preprocessing, useful data streams that contain broadband subscriber account, user access Uniform Resource Loca-
tor (URL), and UA strings can be obtained for information extraction by the proposed UDF-based scheme. In order to deal
with massive amount of UA strings, Hadoop-based MapReduce is implemented to enable parallel processing. The overall data
processing procedure is shown in Figure 3. The 5 steps to be processed by Hadoop are as follows:
1. Collect and preprocessing raw traffic at a Hive data warehouse.
2. Cleanse and filter UA string data from preprocessing.
WANG ET AL . 3 of 6

Smart Devices
Information Extraction

IP packet arriving
Smart
Devices
Traffic Data String
Information
Collection Preprocessing Matching
Record
(SDIR)
UA String List
Library

FIGURE 2 Overview of the smart devices information extraction scheme

Broadband Device Brand Model Operating


account type System
Internet data Ad1*** Smart Apple iPhone 7 iOS
WebMagic phone
from Network
(web crawler) Ad2*** Smart Huawei Mate 8 Android
Operators
Phone
Ad1*** … … … …
Ad2*** TV box Xiaomi MDZ-16 Android

SDIR
Hive Original UA Parsing based on Smart Devices
(Smart Devices
Data Warehouse UA String Data UDF of MapReduce Rule Database
Information Record)

FIGURE 3 Data processing procedure based on Hadoop

Return the
matching string
Yes
Read the regular
Read End
Start expression file as a matching Yes/ No
UA data
data structure list
No
Return null

FIGURE 4 The defined UDF for UA string regulation

TABLE 1 Examples of UA regular expression

UA regular expression
windows nt ([0-9]+[0-9]+)
⧵((ipad);.*;.*⧵)
.*/.*⧵((iphone);.*cpu iphone os .*⧵)
⧵((iphone);.*;.*⧵)

3. Parse and regularize UA string data with a UDF function implemented on MapReduce.
4. Create and manage a database of smart device rules by WebMagic.11
5. Finally, extract smart device information by matching UA strings with the database.

In step 1, Hive is a Hadoop-based data warehouse that uses Hadoop-HDFS as a data store and provides HiveQL. In addition to
built-in functions, Hive also provides user-defined functions (UDF) to enhance data processing. Since nonstandard UA strings
cannot be processed by HiveQL built-in statements, we define a UDF that processes nonstandard data formats to expected ones
for information matching. An illustration of the defined UDF is shown in Figure 4.
To start the regulation process in step 3, a regular expression file of UA strings is read into memory as a data structure list.
The regular expression includes the smart devices information and it is used to parse the raw UA strings. Table 1 shows an
example of regular expression. When the raw UA strings are matched with the regular expressions, the parsed UA strings will
be obtained. Table 2 shows a few examples of raw UA stings and the corresponding parsed ones.
4 of 6 WANG ET AL .

TABLE 2 The comparison between the parsed UA and the original data

Username Original UA Parsed UA


090010740425 dalvik/1.6.0 (linux; u; android 4.4.2; pe-tl10 build/huaweipe-tl10) huaweipe-tl10
090011380427 dalvik/1.6.0 (linux; u; android 4.4.2; huawei p7-l05build/huaweip7-l05) huaweip7-l05
090011458310 dalvik/1.6.0 (linux; u; android 4.4.4; coolpad 8675-abuild/ktu84p) coolpad8675-a

The accuracy of smart devices type

TV Box

Tablet

Mobile phone

Laptop

91.50% 92.00% 92.50% 93.00% 93.50% 94.00%

FIGURE 5 The accuracy of detected smart device types

In step 4, WebMagic is a simple and flexible Java web crawler framework. It is deployed in this work to collect smart
device-related information from trusted e-commerce websites (eg, Bestbuy, JD, and so on). The information database is
automatically and periodically updated also by WebMagic. Final results are stored in the SDIR data structure.

4 EXPERIMENTS AND CASE S TUDY

In this section, we demonstrate the proposed information extraction scheme with experiments based on real-life network data.
Through our partnership with a network operator in China, raw data for a period of 5 days was collected from a metropolitan area
network (ie, home users). The data was generated from 4 types of smart devices. Specifically, 355 630 mobile phones, 13 609
tablets, 9567 TV boxes, and 6829 smart TVs were participated in the experiments. As we can see, smart phone is dominating
from the collected data. However, to provide better user QoE, a network operator may require more detailed information such as
device type, device brand, operating system, and so on. Without loss of generality, we show the results of extracting device type
and brand with the proposed scheme. In particular, our virtual computing cluster is equipped with an 8-core central processing
unit (CPU) and 64 GB random access memory (RAM). The entire information extraction process, including data preprocessing,
was completed in 12 hours. Note that the current work is more of a concept proof thus optimization and more evaluation in
computing will be conducted in our future work.
By running our proposed information extraction scheme, the brands of each type of smart devices are accurately extracted.
As shown in Figure 5, the accuracy reached over 92% for all types of smart devices in the real-life experiment. The accuracy is
measured as the ratio of the number of correct detections to the total number of smart devices.
Device brand can be further extracted for each device, as shown in Figure 6. With such information, network operators
will be able to provide services and manage their network resources more efficiently to improve user QoE. For example,
network operators may allocate different cache sizes and prioritize network flows for video-streaming services depending on
user devices. A smart TV user may satisfy with a larger cache size and prioritized network flows, while a smart phone may
satisfy with a smaller cache size and slightly delayed network flows in exchange for other communication capabilities. Similar
network resource management can be applied to other applications such as gaming, online shopping, social networking, and
so on. To further improve user QoE, smart device manufacturers and application developers may improve their products, for
example, through firmware and application updates, to fully utilize the optimized network resources accordingly.

5 CONCLUSIONS

In this paper, we proposed a smart device information extraction scheme. The proposed scheme applied UDF to deal with
nonstandard UA string formats so the processing was available using a Hadoop-based platform. Experiments were conducted
based on real-life network data. The results demonstrated that our proposed scheme can achieve over 92% accuracy in smart
WANG ET AL . 5 of 6

Mobile phone Brand Top10 Tablet Brand distribution


distribution

Apple Samsung OPPO HUAWEI Apple HUAWEI Lenovo Samsung


Xiaomi Sony Coolpad Meizu
Lenovo Gionee ONDA Xiaomi CUBE Others

TV Box Brand distribution Smart TV Brand distribution

Xiaomi RK Honer Unblock LeTV Xiaomi Konka Sony


TOGIC IDER GIEC Kaiboer Skyworth Hisense Whaley 17TV
Skyworth Others TCL Sharp Others

FIGURE 6 The distribution of different smart device brands from the extraction

device information extraction in practical cases. Although the proposed scheme in this work was focused on home Wi-Fi users,
it can be easily extended to other types of network users. Moreover, some open questions, such as UA signature database update
and maintenance, faster matching, and so on, will be explored in our future work. We will also conduct performance comparisons
with possible future-related works.

ACKNOWLEDGMENTS

The paper is sponsored by Jiangsu Provincial Government Scholarship Program, China, and Jiangsu Overseas Visiting Scholar
Program, China.

ORCID
Feng Ye http://orcid.org/0000-0002-2436-2300

REFERENCES
1. Japertas S, Budnikas A. Identification technology of mobile phone devices using RFF. 2014 11th International Conference on Wireless Information Network and
Systems. WINSYS. Vienna, Austria: IEEE; 2014:1-6.
2. Grill M, Rehak M. Malware detection using HTTP user-agent discrepancy identification. 2014 IEEE International Workshop on Information Forensics and Security
(WIFS). Atlanta, GA, USA: IEEE; 2014:221-226.
3. WURFL. The Power of Device Intelligence Web site. http://wurfl.sourceforge.net/. Accessed December 10, 2017.
4. La VH, Fuentes R, Cavalli AR. Network monitoring using MMT: an application based on the user-agent field in HTTP headers. 2016 IEEE 30th International
Conference on Advanced Information Networking and Applications (AINA). Crans-Montana, Switzerland: IEEE; 2016:147-154.
5. Hadoop. APACHE hadoop Web site. http://hadoop.apache.org/. Accessed December 25, 2017.
6. UDF. Hadoop Web site. https://pig.apache.org/docs/r0.9.1/udf.html. Accessed October 12, 2017.
7. Pan W. Big data plug-in technology for smart router based on multidimensional awareness. J Nanjing Univ Posts Telecommun. 2016;36:18-21.
8. Pan W, Xuejiao C. Co hijacking monitor: collaborative detecting and locating mechanism for HTTP spectral hijacking. The 2017 IEEE Cyber Science and Technology
Congress (CyberSciTech 2017). Orlando, FL, USA: IEEE; 2017.
9. Subhabrata S, Oliver S, Dongmei W. Accurate, scalable in-network identification of P2P traffic using application signatures. Proceedings of the 13th International
Conference on World Wide Web. New York, NY, USA: ACM; 2004:512-521.
6 of 6 WANG ET AL .

10. Xu Q, Liao Y, Miskovic S, et al. Automatic generation of mobile app signatures from traffic observations. 2015 IEEE Conference on Computer Communications
(INFOCOM). Kowloon, Hong Kong: IEEE; 2015:1481–1489.
11. WebMagic. Web Magic Web site. http://webmagic.io/. Accessed October 20, 2017.

How to cite this article: Wang P, Ye F, Chen X. Smart devices information extraction in home Wi-Fi networks. Internet
Technology Letters 2018;1:e42. https://doi.org/10.1002/itl2.42

You might also like