0% found this document useful (0 votes)
502 views13 pages

Iot-Based Big Data Storage Systems in Cloud Computing: Perspectives and Challenges

1) The document discusses challenges and perspectives of IoT-based big data storage systems in cloud computing. 2) It proposes a framework that identifies key areas of IoT big data like acquisition, management, processing and mining. 3) The framework also defines functional modules for these areas and describes their characteristics for handling issues like heterogeneous and huge scale dynamic data from IoT applications in cloud platforms.

Uploaded by

SACHIN KUMAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
502 views13 pages

Iot-Based Big Data Storage Systems in Cloud Computing: Perspectives and Challenges

1) The document discusses challenges and perspectives of IoT-based big data storage systems in cloud computing. 2) It proposes a framework that identifies key areas of IoT big data like acquisition, management, processing and mining. 3) The framework also defines functional modules for these areas and describes their characteristics for handling issues like heterogeneous and huge scale dynamic data from IoT applications in cloud platforms.

Uploaded by

SACHIN KUMAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

IEEE INTERNET OF THINGS JOURNAL, VOL. 4, NO.

1, FEBRUARY 2017 75

IoT-Based Big Data Storage Systems in Cloud


Computing: Perspectives and Challenges
Hongming Cai, Senior Member, IEEE, Boyi Xu, Member, IEEE, Lihong Jiang, Member, IEEE,
and Athanasios V. Vasilakos, Senior Member, IEEE

Abstract—Internet of Things (IoT) related applications have and video streams. How to integrate these distributed
emerged as an important field for both engineers and researchers, data from multisource is fundamental for application
reflecting the magnitude and impact of data-related problems to development.
be solved in contemporary business organizations especially in
cloud computing. This paper first provides a functional frame- 2) Huge Scale Dynamic Data: IoT applications always
work that identifies the acquisition, management, processing and connect a huge quantity of sensors. Communications
mining areas of IoT big data, and several associated techni- between different objects in a large-scale dynamic envi-
cal modules are defined and described in terms of their key ronment generate a large volume of real-time, high-
characteristics and capabilities. Then current research in IoT speed, and uninterrupted data streams. Thus, scalable
application is analyzed, moreover, the challenges and opportu-
nities associated with IoT big data research are identified. We storage, filtering and compression schemes are essential
also report a study of critical IoT application publications and for efficient data processing in cloud platform.
research topics based on related academic and industry publica- 3) Low-Level With Weak Semantics Data: IoT data from
tions. Finally, some open issues and some typical examples are sensors are of low-level with weak semantics before
given under the proposed IoT-related research framework. they are processed. Relations of these data are temporal-
Index Terms—Big data, business intelligence, cloud com- spatial correlation. For the purpose of execution of
puting, data management, distributed processing, Internet of these intelligent systems, complex semantics need to be
Things (IoT) applications, performance isolation. abstracted in event-driven perspective from the mass of
low-level data.
I. I NTRODUCTION 4) Inaccuracy Data: Some experiments show that most
NTERNET of Things (IoT) technology have been an sensing systems can only capture 1/3 correct data caused
I popular approach to implement and run business appli-
cations in the past years. Since massive data have been
by unreliable reading, which brings difficulties into
direct usage. Thus, multidimension data analysis and
generated by huge amounts of distributed sensors, how to processing are important for wide adoption of IoT
acquire, integrate, store, process and use these data has applications.
become an urgent and important problem for enterprises IoT-based data storage systems in cloud computing face
to achieve their business goals. As a consequence, both three pairs of conflict requirements, which are distributed exe-
of researchers and engineers are faced with the chal- cution with united management of infrastructure resources,
lenge of handling these massive heterogeneous data in multitenant storage with isolated performance, and scalability
highly distributed environments, especially in cloud plat- with flexible. In addition, by the use of cloud platform for IoT
forms. Referring to some related research [1], character- data exchanging, processing and integration, different require-
istics of IoT data in cloud platforms can be summarized ments are given for mass, real-time and unstructured data pro-
as follows. cessing covering different levels, such as data representation,
1) Multisource High Heterogeneity Data: IoT applications data storage, and data analysis.
acquire data from different distributed sensors. These Based on data processing function, this paper first provides
data types vary from integer to character, including semi- a functional framework that identifies the acquisition, man-
structured and unstructured data such as images, audio, agement, disposing and mining areas of IoT data. Several
associated functional modules are defined and described in
Manuscript received May 6, 2015; accepted October 6, 2016. Date of publi- terms of their key characteristics and capabilities. Then, cur-
cation October 19, 2016; date of current version February 8, 2017. This work rent research in IoT applications is analyzed to identify the
was supported by the National Natural Science Foundation of China under
Grant 61373030 and Grant 71171132. challenges associated with related functional areas. Based on
H. Cai and L. Jiang are with the School of Software, Shanghai research analysis, some future technical tendencies are also
Jiao Tong University, Shanghai 200240, China (e-mail: [email protected]; proposed.
[email protected]).
B. Xu is with the College of Economics and Management, Shanghai Jiao This paper is organized as follows. Section II gives
Tong University, Shanghai 200052, China (e-mail: [email protected]). a framework of IoT in which data-processing process is
A. V. Vasilakos is with the Department of Computer Science, Electrical and given to show the overview of related studies on the
Space Engineering, Luleå University of Technology, SE-931 87 Skellefteå,
Sweden (e-mail: [email protected]). view of application. Next, we introduce studies that dis-
Digital Object Identifier 10.1109/JIOT.2016.2619369 cuss related new technological developments and related
2327-4662 c 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
76 IEEE INTERNET OF THINGS JOURNAL, VOL. 4, NO. 1, FEBRUARY 2017

Fig. 1. Framework of IoT-based data storage systems in cloud computing.

performance consideration covering six areas in Section III. from distributed and mobile devices is a fundamental
Then, Section IV proposes several open issues related to problem for the whole system construction.
recent IoT developments and applications. Finally, Section IV 2) Data Storage Module: Considering different types of
concludes our survey, highlights challenges, and provides IoT data, including structured, semi-structured, and
an outlook. unstructured data of huge quantity, different kinds
of database or file system, such as XML files in
Hadoop distributed file system (HDFS), relational
II. F UNCTIONAL F RAMEWORK OF C LOUD -BASED database management system (RDBMS), and Not only
I OT A PPLICATIONS SQL (NoSQL). An NoSQL database provides a mecha-
A common IoT application framework [2] consists of per- nism for storage and retrieval of data that is modeled in
ception layer, network layer, and application layer. Application means other than the tabular relations used in relational
layer is critical for IoT-based storage systems in cloud com- databases. Graph DBMS should be combined to achieve
puting because it is composed of middlewares and busi- a high efficiency for data storage in cloud platforms.
ness models. Much work has been done to enable effective 3) Data Management Module: For the purpose of searching
and intelligent data processing and analysis in application and retrieving data from huge volume of data sources
layer based on cloud computing. Front-end layer involves with high efficiency, different approaches, such as data
radio frequency identification (RFID), wireless sensor net- index, metadata, semantic relations, and linked data are
work (WSN), and other smart things. Based on the processing realized for data management in different platforms.
process of IoT application, a framework of IoT-based data stor- 4) Data Processing Module: In cloud platform, mass data
age systems in cloud computing is given, as Fig. 1 shows. The processing mechanisms, such as MapReduce are con-
framework consists of several modules, which are data acqui- structed for parallel and distributed data processing. Data
sition and integration, data storage, data management, data querying and reasoning can be carried out in a more
processing, data mining, and application optimization module. flexible way to adapt to large volumes of data.
Referring to Fig. 1, related technologies can be divided into 5) Data Mining Module: Considering that data from sen-
several functional modules as follows. sors are always raw and low-level data, high-level
1) Data Acquisition and Integration Module: As an input information needs to be extracted, classified, abstracted
module, how to acquire and integrate heterogeneous data and analyzed for application purpose. Thus, data mining
CAI et al.: IoT-BASED BIG DATA STORAGE SYSTEMS IN CLOUD COMPUTING: PERSPECTIVES AND CHALLENGES 77

on IoT data mainly aims to achieve comprehensive views proposed an approach to derive semantic similarity and relat-
or data analysis results for end-users. edness on a distributional statistical model of semantics. A rule
6) Application Optimization Module: Based on application language and an event matcher is used to present an approxi-
analysis, related algorithms or approaches are required mate model so as to divide the semantic coupling dimension
for processing IoT data in cloud platform providing dif- for further processing. This model has been validated correctly
ferent performance requirements, such as decreased I/O, by large sets of events from real-world smart city. On con-
accelerated convergence, security, scalability, availabil- sideration of amassed heterogeneous data streams, an event
ity, management, decreased cost and price, etc. information management platform [7] is proposed to collect
and analyze data streams coming from heterogeneous sensors.
III. M ETHODS AND C HALLENGES The platform makes applications run as cyber-physical-social
systems.
Based on the above framework of IoT data processing pro-
In the field of semantic level interaction with context, there
cess, related research are divided in the following sections to
are some existing research and standards. SensorML, which
show current methods and challenges.
is an approved Open Geospatial Consortium standard, pro-
vides standard models and an XML encoding for describing
A. Data Acquisition and Integration Module sensors and measurement processes. And the World Wide
Data acquisition and integration module is designed to Web Consortium (W3C) has initiated the Semantic Sensor
get data from different types of sensor devices, such as Networks Community Group to develop the semantic sen-
RFID, ZigBee sensors, GPS devices, temperature sensors, etc. sor network (SSN) ontology. The SSN ontology can model
Heterogeneous data brings a big challenge to IoT applica- sensor devices, systems, processes, and observations objects so
tions when the developers need to integrate massive structured, as to enable expressive representation of sensors, sensor obser-
semi-structured, and unstructured data. From the perspective vations, and knowledge of the environment. The SSN ontology
of data processing in computing, we can classify the three is encoded in the Web Ontology Language and has begun
main methods in the process of data acquisition and integra- to achieve broad adoption and application within the sensors
tion: data representation models, multisource data fusion, and community. It is currently being used by various organiza-
data transmission and communication. tions, from academia, government, and industry, for improved
1) Data Representation Models: Data Representation mod- management of sensor data on the Web, involving annotation,
els are used for IoT data acquisition and integration fundamen- integration, publishing, and search.
tally. There exist several different types of sensor devices, such 2) Multisource Data Fusion: Heterogeneous data from
as messages, events, pictures, videos, status data, etc. Data multiple sources mean the data from different sensors have
representation models should be designed based on different different structures. When heterogeneous and various sensor
application purposes with a flexible and common format. data are acquired, multisource data should be merged to cre-
Traditional transmitting method for the sensor data in pro- ate a comprehensive and meaningful view for further utility.
prietary formats is not enough for the IoT applications because This data integration process is also called multisource data
different objects of embedded systems are different either fusion to unify data from different data sources.
in the grammar or context level. Therefore, event data are Considering heterogeneous data from different sensor
required to manage in a unified, integrated and correlated way devices makes the aggregation, integration and collaboration of
because the event data rely on massive, diverse, and interre- the data harder, a heterogeneous data integration model [8] is
lated data sources. Therefore, sensor data must be enriched and proposed for data integration and analysis. The model provide
transformed into resource description framework (RDF) for- aim to eliminate distributed data heterogeneity as well as build
mat for further processing. In [3], a framework called HEP a customized application view for the upper applications. And
which integrates the representation of relational and XML also a way to maintain the integrity and consistency of the data.
event streams is proposed to support a unified event fusing In the level of protocol, a conceptual solution for heteroge-
and processing with a general specification. The framework neous sensor data integration [9] in crowd sensing applications
could be used to express almost all types of windows so as to is presented to combine different kinds of protocols for differ-
understand easily and keep a good scalability. In [4], a cor- ent types of data. Three kinds of protocols, such as HL7 for
responding virtual object model is proposed to enrich those medical data, BACnet for building monitoring, and observation
sensors with context information so as to support smart-city and measurement model for environmental data, are integrated
application based on cognitive IoT objects. The model repre- into one data model to manage sensors and actions. Aim to
sents real objects (sensors) for mass data remotely accessing, work in a heterogeneous network, a novel approach [10] is
and produces a stream of raw sensor measurements. proposed to monitor a typical plastic industry environment
User interface is becoming increasingly important with with WSN, services and Google Gadgets. It uses micro-
the development of the IoT applications. For the purpose of injection to make use of a heterogeneous network of wireless
abstract sensing and actuation primitives of smart devices, a sensor nodes and these sensor nodes transmit environment
model-based interface description scheme [5] is proposed to data, such as material storage conditions, ambient lab tem-
generate user interface. The user interface description lan- perature, and humidity. In [11], IoT resources are integrated
guages can carry out intuitive interfaces for developers to as a novel automatic resource type on the business process
propose taxonomy of device controller. Hasan and Curry [6] layer so as to accommodate more changes in future enterprise
78 IEEE INTERNET OF THINGS JOURNAL, VOL. 4, NO. 1, FEBRUARY 2017

environments. The research solves the problem of the integra- a) RDBMS: Many structured data storage platforms are
tion of the IoT paradigm and its devices coming with native based on RDBMS. Although massive data is generated rapidly
software components as resources. Thing Broker [12] aims to and variously, the relations between these data are always
integrate IoT objects with different characteristics, protocols, essential for a multitenant data storage system [16]. Then dif-
interfaces, and constraints while maintaining the simplicity and ferent traditional relational data with virtual relational data
flexibility required for a variety of applications. Thing Broker is combined in a single schema, but exports a unified data
provides a uniform access interface to different IoT objects. access view to act as a multitenant database for different ten-
Using a single abstraction to represent IoT objects with their ants. There is an approach of RDBMS called Ultrawrap [17],
own configurable attributes, Thing Broker involves all sorts of which encodes a logical representation of each RDBMS as an
objects, from physical sensors to high-level services. RDF graph, and uses SPARQL queries to get the data on the
3) Data Transmission and Communication: Once data existing relational stored views.
acquisition from different sensors is finished, data transmis- b) NoSQL DBMS: Unlike RDBMS, NoSQL DBMS
sion and communication should be carried out so as to transmit stores and manages unstructured data in a key-value model.
data to the back-end or to communicate with other sensors for The NoSQL DB is free in schema structure. It can provide
business purpose. some properties, such as horizontal scalability, distributed stor-
There are many protocols for data transmission and com- age, dynamically schema, etc. On the other hand, NoSQL DB
munication, such as user data protocol (UDP) and TCP. is not good at keeping atomicity, consistency, isolation, and
Developers often use UDP to transmit multimedia data due durability of data. Besides, it cannot support well for some
to its real-time characteristic. But when network congestion distributed queries. Besides, on the aspect of database access,
and channel noise occur, packet loss happen easily by UDP some work has been done on the potential integration of an
protocol. A new real-time multimedia transmission protocol ontology-based date access approach in NoSQL stores [18].
over UDP called control over UDP (CoUDP) [13] is designed It is a new data management style that exploits the semantic
to solve this problem. The performance of CoUDP proto- information represented in ontologies when searching the IoT
col is better than UDP and TCP because it adds rate control data stores. Therefore, some related work has tried to integrate
and fast retransmission mechanism over UDP application and the feature of distributed file system to the IoT data stores.
gives up redundant feedback like TCP. In [14], a layered fault c) DBMS integrated with HDFS: HDFS can also be
management scheme with fault managing program control extended to a special distributed file repository, which
and separate layer functions is designed to ensure the reli- processes massive unstructured files efficiently. In the area of
ability of end-to-end transmission for IoT applications. This IoT data stream, many data are generated in the XML format,
proposed scheme suits well to the IoT requirements. They and how to deal with these small-sized, huge-volume XML
use fuzzy cognitive maps to realize integrated evaluation and files becomes an important challenge. One approach [19] is
prediction of the possible fault, which can solve the problem to optimize storing and accessing massive small XML files
that current relative algorithms are not suitable for the complex in HDFS. Small XML files are merged into a larger file to
conditions. In [15], a proximity-based authentication approach reduce the metadata at name node, thus related mechanism
is proposed to utilize the wireless communication interface. could be used to improve the data store performance. With the
And the advantage of the approach is that inferring proximity help of a new central-indexing service discovering system [20],
within about 1 s relies on ambient radio signals. based on Hadoop HBase data store, the performance of service
In short, current methods related to data acquisition and discovering is increased.
integration could be included as follows: for the reason of d) Main-memory DBMS: High performance of IO stream
the great diversity and heterogeneity of data, data presen- processing is important for large-scale IoT application.
tation models in cloud environment do not have a unified Lu and Ye [21] implemented a large-scale RFID application
form in the aspect of sensor data acquisition and integration. in the main-memory database system H2. Besides, it also
Multisource data fusion is still a main problem in applica- provides a multidimensional hash-based index design frame-
tions. Although data transfer protocols are mature, there are work and achieves an outperformed performance evaluation.
still some problems in the interoperability between sensor data. Hara et al. [22] analyzed the physical database structure
that realizes high-speed database migration, which is an
important part of the cloud data storage, and propose sev-
B. Data Storage Module eral recovery approaches to the migratory main-memory
In IoT applications, the massive data from sensors consume databases.
large storage space. Meanwhile, because that different roles e) Graph DBMS: Graph DBMS is a database that uses
and tenants require different service and security levels, data graph structures with nodes, edges, and properties to repre-
should be isolated for various requirements. Therefore, how sent and store data. A database that uses graph structures for
to share and isolate these data in cloud platform are the main semantic queries with nodes, edges and properties to represent
challenges in IoT data storage. and store data. With graph DBMS, the relationship among sen-
1) Data Storage Types: In the aspect of data storage in sor data can be managed efficiently. Reference [23] provides
cloud platform, existing works can be classified into several a high performance Graph DBMS management system, sup-
types as: RDBMS, NOSQL DBMS, DBMS based on HDFS, porting efficient manipulation of large graphs that consists of
main-memory DBMS, and graph DBMS. large-scale nodes and edges.
CAI et al.: IoT-BASED BIG DATA STORAGE SYSTEMS IN CLOUD COMPUTING: PERSPECTIVES AND CHALLENGES 79

TABLE I
C OMPARISON B ETWEEN DATA S TORAGE T YPES

f) RDF-based data storage: RDF is a semi-structured consistency, [29] designed a distributed caching and schedul-
data model for Web information resources management. RDF ing middleware-MicroFuge which can provide performance
schema (RDFS) [24] provides an ontology denoting lan- isolation for storage system. Based on an empirical-driven
guage for grouping the resources into concepts and identifying performance model, MicroFuge uses an adaptive deadline-
the relationship among these concepts. Cloud-based RDF aware cache eviction module to reduce the deadline misses.
data management [25] provides a principled categorization of Similar idea is also used in noninvasive and energy efficient
existing work on RDF data management. performance isolation in virtualized servers. The framework-
Aiming to support different data type of IoT sensors, dif- NINEPIN combines self-adaptive machine learning and a
ferent data type should be combined so as to realize effective robust target tracking predictive model [30]. It outperforms
data storage. A comparison of different data storage types is a good performance isolation approach on a virtualized server
given in Table I. cluster.
2) Data Isolation in Cloud Computing: Another challenge In general, aim to adapt to high heterogeneity of IoT data
in IoT data storage comes from resource elasticity in the IoT from distributed data sources, it is a popular way to combine
cloud computing framework. The difference of data authority different data types, such as RDBMS integrated with HDFS,
and data provision performance arise the requirement of data so as to construct a scalable data storage in cloud environment.
isolation in cloud platform. For the purpose of improving the However, there is still great contradiction between user author-
utilization of resource, most cloud platform allows the tenants ity and performance flexibility. Performance isolation has to be
to share the same computing resources. This approach may implemented in different levels with a consideration of differ-
cause the problem of inconsistency and latency in data content, ent data types. The problem of sharing and isolating these data
and low efficiency in data performance. in cloud platform is still a main challenge in IoT data storage
a) Multitenant isolation: In the view of storage real- on considering characteristics of different applications.
ization, the common multitenant data isolation methods can
be classified into four types as shared table pattern, depen-
dent table pattern, dependent database pattern, and dependent C. Data Management Module
virtual machine pattern. The shared table pattern in [26] Data management module forms an intelligent and effective
reduces the cost on switching between different hosted owners. database for further distributed or parallel IoT application. In
The dependent table pattern, dependent database pattern, and the aspect of data management, related works can be classi-
dependent virtual machine pattern mean that different tenants fied into three aspects: 1) metadata management; 2) semantic
have their own target table, database or virtual machine. annotation; and 3) data indexing.
Cai et al. [27] proposed a database management model 1) Data Management Based on Metadata: Metadata man-
that supports multiple database integrating and unified data agement is generally defined as the end-to-end process and
accessing. Besides, the model also integrates a multitenant data the governance framework for creating, controlling, enhanc-
isolation mechanism. And in [27], a layered reference model is ing, attributing, defining, and managing a metadata schema,
proposed for IoT data management. The model consists of data model or other structured aggregation system, either indepen-
cleaning layer, event processing layer, data storage, and anal- dently or within a repository and the associated supporting
ysis layer. It provides a layered view of the data management processes. It can make data easily organized and understood
in IoT area. by users without being involved with everything concerning
b) Performance isolation: For resource provision, a the accessing solution.
tenant-based resource allocation model for resource manage- Many storage systems, such as HDFS, decouple metadata
ment in cloud computing environment is desperately needed. management from file data access. Based on HDFS in cloud
Some related works have been done on this area. It provides computing, [31] proposes a metadata management scheme.
formal measurements for provisioning of virtualized resources This scheme employs multiple name nodes and divides meta-
in cloud environment [28], and provides a resource alloca- data into “buckets,” which can be migrated among name nodes
tion model and a multitenant configuration environment to the dynamically according to the workloads of system. Also, in
application on the cloud platform. In order to maintain data order to maintain reliability, metadata is replicated in different
80 IEEE INTERNET OF THINGS JOURNAL, VOL. 4, NO. 1, FEBRUARY 2017

name nodes with log replication technology. Paxos algorithm searching every row in the database tables every time when
is used to keep replication consistency. database tables are accessed. In the area of data indexing,
In [32], an efficient distributed metadata management existing related work can be classified into three types that
scheme is proposed for cloud computing. It can deliver are bitmap index, complex data structure index, and inverted
high performance and scalable metadata services by using index.
techniques of metadata distribution method based on par- a) Bitmap index: Bitmap index is a kind of database
ent directory path ID, mimic hierarchical directory structure, index that uses the bitmap data structure. Bitmap index is
improved Chord, cooperative double layer cache mechanism considered to work well for low-cardinality columns and is
algorithm, and the application of database to metadata. In suitable for analytical processing with less data storage space.
order to access the distributed data with reduced latency, the To reduce the load of retrieving process in cloud storage, a
dynamic metadata model in database for cloud computing [33] method of bucketization [38] is proposed based on the tradi-
is proposed to reduce the overhead problem occurred in the tional bitmap index. The tuples with attributes of interest are
time that the data from the data server is retrieving. divided into buckets, depending on the attribute values that
Automatic metadata generation using associative net- they choose, and then the original attribute values will be hid-
works [34] provides an automatic metadata generation system den according to the relevant bucket indexes. And in [20], in
that leveraged resource relationship generated from existing order to help to trace objects along supply chains in the area
metadata as a medium for propagation from metadata-rich to of IoT, Li et al. [20] proposed storage schema which uses
metadata-poor resources. By means of a discrete-form spread- event time-stamp to identify column and event index content
ing activation algorithm, metadata associated with metadata- to serve as cell value to build central-indexing device service
rich resources is propagated to metadata-poor resources for system.
different substrate of associative network. b) Complex data structure index: Due to the weakness
2) Semantic Annotation: With the rapid development of of the index of bitmap data structure in transaction process-
the cloud computing, users expect a better experience through ing, index types of other data structure are used to improve the
cloud computing, such as service computing and multimedia performance. Ma et al. [39] used B+ Tree and R-Tree to make
computing. Lots of IoT applications heavily depend on the efficient indexes for the massive data of IoT in cloud comput-
understanding of the data, so that large-scale data annota- ing environment. Also, in [40], an object-store called Walnut
tion has received intensive attention in recent years. As an is proposed, which is developed at Yahoo! and uses the bLSM
important part of retrieval technology, the accuracy of semantic index based on the special data structure and LSM-Tree for
annotation determines the retrieval results. various cloud data management systems.
A new service model is proposed in [35], which is closely c) Inverted index: An inverted index is a kind of index
related to previous distributed computing methods, such as data structure that stores a mapping from content to its loca-
Web services and grid computing. By mean of two function tions in order to improve full text searching. It is the most
encapsulation covered the services registering with the seman- popular data structure used in document retrieval systems
tic description and the services searching with accomplish the for large-scale search engines. Indexing word sequences for
required expectations, the researchers implement a semanti- ranked retrieval [41] aims to present and analyze a new index
cally enhanced platform to assist the process of cloud service structure designed to improve query efficiency in dependency
discovery. retrieval models. They have presented a novel approach to
In [36], a new approach is presented to semantic annotation estimating n-gram statistics for information retrieval tasks.
with linked data for document enrichment in the domain of The index structure is scalable in both query processing time
education. Differed from traditional semantic annotation which and space requirements. The index structures describe the
connect relevant term of the document with an instance of the exploration of n-grams as query features. In [42], a data
ontology, the approach connects relevant terms to graphs of scheme that extends the inverted index approach is proposed
the ontology. And after its expansion process, only relevant to combines with techniques for the design of SSE. To reduce
and contextualized information is included. Since the docu- the size of the inverted indexes, Vishwakarma et al. [43]
ment is annotated with a set of interconnected graphs, students proposed an approach to prune the whole document from the
can access and navigate through these contents in the docu- index based on its importance and relevance of top-k results.
ment so as to deepen the topics. This approach provides a The elimination is taken based on the scores of individual
better description, moreover considering the semantic nature documents.
of linked data and is more suitable in the domain of education. A comparison of different data index methods is given
Tao et al. [37] presented a scheme for image annotation in Table II.
on the cloud, which transmits mobile images compressed With the rapid increase in the amount of data and their cor-
by Hamming compressed sensing to the cloud and conducts relation, automatic metadata generation, ontology generation
semantic annotation through a special support vector machine and evolution, and efficient, low cost and dynamic-updating
on the cloud. data indexing have attracted great attention. How to imple-
3) Data Indexing Strategy: Data indexing can make data ment a unified pervasive data management model that solves
retrieval operations more effective at the cost of additional the contradiction between secured sharing and performance
writing operations and extra data storage spaces for data isolation is the emphasis and difficulty in current study of
indexes. With indexes, DBMS can quickly locate data without data management.
CAI et al.: IoT-BASED BIG DATA STORAGE SYSTEMS IN CLOUD COMPUTING: PERSPECTIVES AND CHALLENGES 81

TABLE II
C OMPARISON B ETWEEN DATA I NDEXING M ETHODS

D. Data Processing Module the support of various styles of analytics in the same platform
In cloud-based parallel or distributing data processing, and on the same data.
MapReduce [44] and its open source implementation Hadoop In general, distributed processing in cloud environment
is one of the most popular parallel processing methods in cloud is still mainly based on MapReduce, which can be con-
platform. For the purpose of parallelization, scalability, load ducted after the expansion of different type (structured, semi-
balancing, and fault-tolerance, MapReduce is widely used in structured, and unstructured) data. However, on consideration
query processing for data analysis tasks in cloud platforms. of some demerits of MapReduce, such as high communica-
However, MapReduce does not directly support more complex tion cost, redundant processing, and lack of interaction ability
operations such as joins. More research on high-level, declar- in real-time processing, the methods on high-performance dis-
ative management of complex data such as RDF is required tributed data processing without MapReduce are required in
for massively parallel processing of IoT data in the cloud. some application related to complex processing like approxi-
1) Parallel Processing Methods for Complex Operations: mate reasoning or high interactive processing.
In [45], a processing framework called wave is designed for
bulk data processing, incremental computing, and iterative E. Data Mining Module
processing. Framework wave uses an implicit mechanism to Moreover, although there are many data mining methods
synchronize the parallel programs execution without any user employed in cloud computing, we cannot just transplant the
specification on which programmers use events and trigger algorithms directly into IoT application. Due to the high
reactions to process the data. The selective embedded just-in- dynamic and wide distributed features of IoT data, it is needed
time specialization (SEJITS) [46] executes complex analytic to find an effective and efficient way to process huge amount
queries on massive semantic graphs in big-data analytics. of data. There are mainly three kinds of data mining for IoT
A domain-specific language is implemented to enable flexi- applications.
ble filtering and customization of graph algorithms without 1) Data Mining in Parallel Programming: Data mining
sacrificing performance, using SEJITS selective compilation algorithms based on cloud platforms are mainly in parallel pat-
techniques. tern in contrast to other platform. The research [51] focuses
2) Parallel Processing Methods for Semi-Structural Data: on classification problem in data mining area in cloud plat-
For the purpose of RDF data processing, an efficient and cus- form, which is motivated from the identification of birds, and
tomizable data partitioning framework SPA [47], which targets it propose its own modern cyclic approach to solve the classifi-
at distributing processing of big RDF data, is presented to sup- cation problem which is verified to be quite efficient. And [52]
port fast processing of different size as well as complexity. takes more interest in clustering problem, and proposes a par-
A MapReduce framework [48] is designed to carry out allel K-means algorithm based on Hadoop platforms. Others
SPARQL query processing. Thus, RDFS reasoning can be may find it more important to mine the association rules of
involved in deductive databases and therefore recursive query big data or do some predictions based on the mining results.
processing techniques are implemented. Yu and Zhu [53] proposed a dynamic resource-provisioning
3) Parallel Processing Methods for Data Stream: In cloud algorithm to predict resource utilization.
platform, sometimes data arrives in stream and the process- 2) Data Mining in Mobile Computing: Mobile comput-
ing algorithm is tasked with data without explicitly storing ing is another hot topic in recent years, and may be another
it. Existing parallel frameworks in cloud, such as MapReduce direction of development for IoT data analysis since IoT
and its variations are unable to support complex parallel pro- also involves plenty of smart devices. The research on data
cessing effectively. Traditional algorithms of sequential pattern mining in mobile environment can be considered as a good
may raise the scalability challenge when dealing with big data. reference for IoT data mining. The research [54] proposes
To address the problems of optimizing parallel data mining, a novel algorithm named wireless heterogeneous data min-
a heuristic cloud bursting algorithm, maximally overlapped ing (WHDM) to find the frequent patterns or knowledge from
binpacking driven bursting, is developed which considers the big data and WHDM, which is proved efficient. In [55], dis-
time overlap to improve data mining parallelization [49]. tributed Hoeffding trees is used to classify streaming data in
Spreitzer et al. [50] presented Ripple, a middleware that is mobile computing environment. Personal mobile commerce
built on iterated MapReduce for distributed data analytics with pattern mining and prediction [56] develops pattern mining
82 IEEE INTERNET OF THINGS JOURNAL, VOL. 4, NO. 1, FEBRUARY 2017

TABLE III
C OMPARISON B ETWEEN DATA M INING M ETHODS

and prediction techniques that explore the correlation between transmission more efficient and more convenient. By using
the moving behavior and purchasing transactions of mobile machine learning-based analysis, a middleware approach is
users. implemented [62] to link scenario information to network
3) Data Mining for Graphs: Among different events, some and derive meaningful communication predictions for efficient
exhibit strong correlations with the network structure, while wireless communication.
others do not. Such structural correlations will shed light 2) Data Storage Optimization: Wang et al. [63] proposed
on viral influence existing in the corresponding network. a storage approach that uses hierarchical extended storage
Unfortunately, the traditional association mining concept is not mechanism to handle massive dynamic data. It can store data
applicable in graphs because it only works on homogeneous separately according to the data type and add storage nodes
data sets like transactions and baskets. A special kind of data dynamically. Data duplication is a process that breaks data
mining, graph mining is usually used to explore the frequent streams into some smaller data chunks and removes duplicate
patterns from networks or databases. The special structure chunks. Tan et al. [64] pointed out that the removal of dupli-
of graph makes it distinctive in data mining. In addition, it cate data chunks leads to a de-linearization of data placement,
is usually more efficient in some specific areas compared which sometimes affects the read performance, throughput and
to common cloud data mining method. Chen et al. [57] efficiency. An effective way is proposed to cut down the de-
proposed a new approach called the cloud-based SpiderMine linearization of data placement, which requires little reduction
that uses cloud computing to do graph mining. And similarly, on compression ratios. IoT data is often stored in the shape of
Lai et al. [58] also make use of cloud computing and propose a small files. Dong et al. [65] proposed different approaches to
robust and efficient MapReduce-based graph mining tool. The improve the efficiency of storing and accessing small files on
research [59] proposes a novel measure for assessing structural cloud storage, which include file grouping or file merging and
correlations in heterogeneous graph data sets with events. prefetching schemes. Ning et al. [66] designed a new virtual-
Considering the different methods of IoT data mining in ized socket library with the technology of shared memory in
cloud platform, a comparison is given in Table III. data transmission, which uses a buffer to store I/O requests.
In general, the mass IoT data on the cloud platform are 3) Data Operation Optimization: Because the effective-
highly dynamic and time-related. Most of them are formed ness and scalability of MapReduce-based implementations
as dataflow. Traditional data mining methods are unable to of complex data-intensive tasks depend on an even redis-
intelligently process the data. This is the weakest as well as the tribution of data between map and reduce tasks, complex
most difficult link in current applications. Dataflow oriented redistribution approaches are necessary to achieve load bal-
processing technology for real-time intelligent application are ancing among all reduce tasks to be executed in parallel
the main challenges of IoT data mining in cloud platform. for skewed data. Kolb et al. [67] proposed two approaches
for skew handling and load balancing to reduce the search
space of entity resolution, utilize a preprocessing MapReduce
F. Comprehensive Application Optimization Module job to analyze the data distribution, and distribute the enti-
The efficiency optimization of application layer can be ties of large blocks among multiple reduce tasks. The current
viewed from three aspects as follows. practices of static slot configuration and fair resource shar-
1) Architecture Optimization: Based on application analy- ing may not efficiently utilize resources. When high priority
sis, architecture optimization is an effective way to achieve jobs are sharing resource with lower priority jobs, fair shar-
good performance. In [60], an efficient scheduling model with ing is against priority-based scheduling. P2P-MapReduce [68]
caching mechanism is proposed for the gateway of distributed provides a more reliable MapReduce middleware that can be
sensors in smart-living. Peng et al. [61] analyzed the charac- effectively applied in dynamic cloud infrastructures. To pro-
teristics of data transmission, and proposed a message oriented vide a more reliable MapReduce middleware, an adaptive
middleware data processing model in IoT so as to make data MapReduce framework named P2P-MapReduce, is designed
CAI et al.: IoT-BASED BIG DATA STORAGE SYSTEMS IN CLOUD COMPUTING: PERSPECTIVES AND CHALLENGES 83

which exploits a peer-to-peer model to manage node churn,


and provides a good fault tolerance level. Lu et al. [69] studied
the optimization of resource utilization in Hadoop and present
a nonintrusive slot layering solution that uses two tiers of slot
(active and passive) to increase the degree of concurrency with
minimal performance interference.
The optimization of IoT applications can be conducted from
many aspects. It is currently a feasible and important method
to divide the applications into compute-intensive and storage-
intensive types and optimize-related algorithms, respectively,
on the basis of architecture optimization, which attracts much
attention. However, the architectural features of different kinds
of applications are varied from different levels of data contents
or storage types.

IV. N EW T ECHNOLOGY IN F UTURE P ROSPECTS


From the past ten years, we are stepping from Web1.0 which
focus on single direction information creation and passive cus-
tomer to Web2.0 which attracts information co-creation and
active customer stage. Now, we are moving from Web2.0
toward Web3.0, the stage of the ubiquitous computing Web.
Based on characteristics analysis of IoT data in Section I,
we provide some open issues referred to different technical
areas on handling these challenges, which covered new archi-
tecture for the integration cloud computing with IoT objects
seamlessly, smart things for contextual and real-time event
processing mechanism, big linked data for massive semantic
storage and management, new parallel and dynamic processing
pattern for high performance, data stream mining for dynamic Fig. 2. Ontology-based CoT platform [72].
and uncertainly decision making and so as to describe future
prospects of IoT storage systems in cloud platform.
management in cloud environment, a typical CoT platform is
also given for IoT applications [72], as shows in Fig. 2.
A. Cloud of Things When the numbers of devices are increasing, heterogeneous
With the rapid development of IoT application in cloud data and services will be involved in CoT. Other than data and
platform, number of connected devices has increased in a resources in single view of cloud or current IoT application,
very high speed. It has been said that the devices are more CoT will pay more attention in a business insight. The issues
than the people on the Earth in 2011. And the connected related to integration of IoT with cloud computing, require
devices are expected to reach 24 billion by 2020. All this smart gateway to perform the rich tasks and preprocessing,
devices will connected via cloud platforms for different appli- which traditional sensors are not capable of fulfilling the task
cations. IoT and cloud computing working in integration of data storage and processing.
makes a new paradigm, which have been termed as cloud of
things (CoT) [70].
In CoT, IoT objects are extended from sensors to every B. Smart Things With Contextual Data Representation
front-end things in the Internet. And distributed sites are con- Sensor data as a service faces the issues of interoperability
nected as a whole body, such as smart house, smart factory, and reusability for massive heterogeneous sensor data and data
smart city, and smart planet. Based on CoT, a logical archi- services. Therefore, how to integrate these data acquirement
tecture of smart city [71] is given. With the merge of cloud modules with flexible performance, especially, how to develop
platform and IoT, CoT is required to enrich the ability for mas- a smart device as an intelligent and self-organizational device
sive devices interactive and interoperability so as to support in the cloud platform are important open issues.
smart and intelligent applications. The CoT will take more and Since connected devices are rapidly increasing, there will be
more important role in different industries and research areas. lot of data as well. Massive unstructured and semi-structured
Some issues such as resource allocation balanced energy and data, such as video, image, and XML files are generated from
efficiency, IPv6-based identity management, quality of service the devices in IoT applications. So how to dispose these big
provisioning, architecture for data storage, security and pri- data is the main challenge.
vacy and unnecessary communication of data will be involved Considering that online object association could act as
in CoT [2]. And based on ontology-based multitenant data a user interface of smart things and provide a semantic
84 IEEE INTERNET OF THINGS JOURNAL, VOL. 4, NO. 1, FEBRUARY 2017

relationship for intelligent application, intelligence and contex-


tualization model is important and fundamental on the purpose
of construction a unified semantic view for further processing
purpose. However, it is difficult to maintain consistence and
archive intelligence in a massive heterogeneous data environ-
ment. From the view of data processing, a typical semantic
case is Lilliput [73], which is an ontology-based platform for
the IoT. By providing semantic information, including online
social networks and contextual information, such as location
of smart things as a social graph, Lilliput enables unified
access control with less effort to support intelligent IoT appli-
cation without having to understand device details with a good
consistence.
An architecture supporting semantic data processing is given
for knowledge acquisition and management [74]. The archi-
tecture aims to find high-level information from heterogeneous
Web resources, such as RDF files. In it, the unstructured con-
tent analysis capabilities, such as UIMA are integrated in
a coordinated environment supporting the processing, trans-
formation and projection of produced metadata into RDF
semantic repositories. Although current IoT semantic associa- Fig. 3. Referred architecture of social networks application based on linked
tion is able to reflect up-to-date physical/online context, it has a data [77].
limitation of representing temporal social relationship among
people, objects, and places, thus yielding additional cost on D. Data Stream Mining for Intelligent Application
utilizing various machine learning methods. With the develop-
Unstructured data such as video data could not be stored
ment of intelligent IoT applications, enhanced intelligence and
into a structured database system for analysis purpose. And
contextualization models would enrich IoT more expressivity
data mining on dataflow form different data sources with
semantic association and support social interaction reasoning
nonpersisted association is a new but important issue. There
between smart things. It will facilitate smart things to con-
are several different directions to process dataflow with some
struct a convenient and powerful devices or environment for
dynamic methods, for example, to retracted features from con-
intelligent IoT applications.
tinuous dataflow so as to build data association, or to process
the whole body of a fragment of dataflow by function transfor-
C. Big Linked Data for Semantic Data Management mation. Data stream processing in dynamic and decentralized
peer-to-peer network [78] process data streams of different
Considering complex data associations are generated from
active data sources. The approach involved three areas of
different sources or complex data structure, extracting relevant
data source management, continuous query distribution and
information in multilingual context from massive amounts of
distributed query management.
unstructured, structured and semi-structured data is a challeng-
Data stream mining involved uncertain reasoning based on
ing task. Various theories have been developed and applied to
partition data and utility of intermediate result for high effi-
ease the access to multicultural and multilingual resources.
ciency. When unstructured and semi-structured data are also
Linked data [75] is defined as some typed links between data
involved in the processing process, there are lots of research
from different data sources, such as different databases or data
and technical problem left to do.
nodes by means of the Web. Big linked data named Blinked
data [76] is an instance of big data which is the union of big
and linked data. For the purpose of effective data management, E. New Pattern for Parallel and Dynamic Data Processing
semantic annotation based on linked data provides a new issue In big IoT data environment, data are changing on types,
in a massive, complex associated and contextual application state and analysis purpose. Other than centralized master-
scene. These associated and contextual data play a critical role server implementations, parallel and particle data processing
for intelligent application. framework is need to enable the execution MapReduce pattern
Driven-by semantic technology such as linked data in dynamic cloud infrastructures. An approach which using the
and ontology, we could predict semantic data processing MapReduce framework for large-scale graph data processing
approaches will get a great improvement in the near future. is given [79]. The approach relies on a density-based partition-
And a more natural and meaningful way with high-level infor- ing to build balanced partitions of a graph database over a set
mation will be common in different IoT area. Combined with of machines. The experiments show that the performance and
natural language processing, the semantic technology will be scalability are satisfying for large scale of data processing.
used to create more intelligent application. A referred linked Scale features of the big data on various parts in many
data processing platform is also given for social networks rapidly changed sources produce obstacles to find useful
(see Fig. 3). information from these data. Therefore, these rebuild and
CAI et al.: IoT-BASED BIG DATA STORAGE SYSTEMS IN CLOUD COMPUTING: PERSPECTIVES AND CHALLENGES 85

TABLE IV
W EAKNESSES IN M APREDUCE AND S OLVING T ECHNOLOGY FOR I OT B IG DATA D ISPOSING

re-execution data mining algorithm are not applicable for big R EFERENCES
data analysis system. We need new dynamic data mining [1] M. Ma, P. Wang, and C.-H. Chu, “Data management for Internet
algorithms on the dataflow other than competed structural data. of Things: Challenges, approaches and opportunities,” in Proc.
However, despite its evident merits such as scalability, fault- IEEE Int. Conf. IEEE Cyber Phys. Soc. Comput. Green Comput.
Commun. (GreenCom) IEEE Internet Things (iThings/CPSCom),
tolerance, ease of programming, and flexibility, MapReduce Beijing, China, 2013, pp. 1144–1151.
has limitation in interactive or real-time processing on han- [2] M. Aazam, I. Khan, A. A. Alsaffar, and E.-N. Huh, “Cloud of
dling IoT data processing. MapReduce is not perfect for things: Integrating Internet of Things and cloud computing and
the issues involved,” in Proc. 11th Int. Bhurban Conf. Appl. Sci.
every large-scale analytical task, and the high communication Technol. (IBCAST), Islamabad, Pakistan, Jan. 2014, pp. 414–419.
cost and redundant processing make a big challenge for IoT [3] W. Wang and D. Guo, “Towards unified heterogeneous event processing
application. In [80], a technical framework for improvement for the Internet of Things,” in Proc. 3rd Int. Conf. Internet Things (IOT),
MapReduce is given. Based on weakness and current solved Wuxi, China, 2012, pp. 84–91.
[4] A. Somov, C. Dupont, and R. Giaffreda, “Supporting smart-city mobil-
methods, we given an optimization requirement on IoT data ity with cognitive Internet of Things,” in Proc. Future Netw. Mobile
for a large-scale processing purpose (see Table IV). Summit (FutureNetworkSummit), Lisbon, Portugal, 2013, pp. 1–10.
There exists an urgent need for design some new data [5] S. Mayer, A. Tschofen, A. K. Dey, and F. Mattern, “User interfaces for
smart things—A generative approach with semantic interaction descrip-
processing patterns other than MapReduce pattern, we could tions,” ACM Trans. Comput. Human Interact., vol. 21, no. 2, 2014,
predict that the next generation of parallel data processing Art. no. 12.
systems for massive data sets should combine the merits [6] S. Hasan and E. Curry, “Approximate semantic matching of events for
the Internet of Things,” ACM Trans. Internet Technol., vol. 14, no. 1,
of existing approaches to support complex operation and 2014, Art. no. 2.
unstructured data in a parallel and dynamic way. [7] M.-S. Dao et al., “A real-time complex event discovery platform for
cyber-physical-social systems,” in Proc. Int. Conf. Multimedia Retrieval,
Glasgow, U.K., 2014, p. 201.
V. C ONCLUSION [8] H. Liu, Y. Liu, Q. Wu, and S. Ma, Geo-Informatics in Resource
Management and Sustainable Ecosystem (Communications in Computer
As the IoT technologies are evolving, a substantial amount and Information Science). Heidelberg, Germany: Springer, 2013,
of their applications have been founded in many industries. pp. 298–312.
This paper is a timely research which overviews the current [9] S. Villarroya et al., “Heterogeneous sensor data integration for crowd-
and potential IoT big data storage systems in cloud computing sensing applications,” in Proc. 18th Int. Database Eng. Appl. Symp.,
Porto, Portugal, 2014, pp. 270–273.
and at the same time surveys the state-of-art in literature from [10] U. Raza, B. Whiteside, and F. Hu, “An enterprise service bus (ESB)
the view of data processing process. and Google gadgets based micro-injection moulding process monitoring
The IoT storage system enables tracking of essential infor- system,” in Proc. IET Conf. Wireless Sensor Syst. (WSS), 2012, pp. 1–6.
[11] S. Meyer, A. Ruppen, and C. Magerkurth, “Internet of Things-aware pro-
mation about items as they move through cloud platforms. It cess modeling: Integrating IoT devices as business process resources,”
shows significant value for IoT applications by providing an in Proc. Int. Conf. Adv. Inf. Syst. Eng., Valencia, Spain, 2013, pp. 84–98.
accurate knowledge of the current IoT data processing, which [12] R. A. P. de Almeida et al., “Thing broker: A Twitter for things,” in Proc.
ACM Conf. Pervasive Ubiquitous Comput., Zürich, Switzerland, 2013,
results in higher availability and flexible resource provision. pp. 1545–1554.
Data storage system supporting IoT devices can be uti- [13] W. Jiang and L. Meng, “Design of real time multimedia platform
lized to improve the entire data processing efficiency and and protocol to the Internet of Things,” in Proc. IEEE 11th Int.
Conf. Trust Security Privacy Comput. Commun., Liverpool, U.K., 2012,
offer huge competitive advantage to the IoT applications. It pp. 1805–1810.
has been shown that semantic relationships among IoT data [14] X. Li, H. Ji, and Y. Li, “Layered fault management scheme for end-
will lead to greater global intelligent and interoperational to-end transmission in Internet of Things,” Mobile Netw. Appl., vol. 18,
capabilities (contextual business scene, semantic annotation, no. 2, pp. 195–205, 2013.
[15] H. Shafagh and A. Hithnawi, “Poster: Come closer: Proximity-based
multidevices cooperation, etc.). IoT data storage systems will authentication for the Internet of Things,” in Proc. 20th Annu. Int. Conf.
enable enterprise to acquire such capability. Mobile Comput. Netw., Maui, HI, USA, 2014, pp. 421–424.
86 IEEE INTERNET OF THINGS JOURNAL, VOL. 4, NO. 1, FEBRUARY 2017

[16] H. M. Yaish, M. L. Goyal, and J. G. Feuerlicht, “Multi-tenant elas- [40] J. Chen et al., “Walnut: A unified cloud object store,” in Proc.
tic extension tables data management,” Proc. Comput. Sci., vol. 29, ACM SIGMOD Int. Conf. Manag. Data, Scottsdale, AZ, USA, 2012,
pp. 2168–2181, Jun. 2014. pp. 743–754.
[17] J. F. Sequeda and D. P. Miranker, “Ultrawrap: SPARQL execution on [41] S. Huston, J. S. Culpepper, and W. B. Croft, “Indexing word sequences
relational data,” Web Semant. Sci. Services Agents World Wide Web, for ranked retrieval,” ACM Trans. Inf. Syst., vol. 32, no. 1, 2014,
vol. 22, pp. 19–39, Oct. 2013. Art. no. 3.
[18] O. Curé, F. Kerdjoudj, D. Faye, C. L. Duc, and M. Lamolle, “On [42] S. Kamara, C. Papamanthou, and T. Roeder, “Dynamic searchable sym-
the potential integration of an ontology-based data access approach in metric encryption,” in Proc. ACM Conf. Comput. Commun. Security,
NoSQL stores,” Int. J. Distrib. Syst. Technol., vol. 4, no. 3, pp. 17–30, Hangzhou, China, 2012, pp. 965–976.
2013. [43] S. K. Vishwakarma, K. I. Lakhtaria, D. Bhatnagar, and A. K. Sharma,
[19] Y. Zhang, W. Han, W. Wang, and C. Lei, “Optimizing the storage of “An efficient approach for inverted index pruning based on document
massive electronic pedigrees in HDFS,” in Proc. 3rd Int. Conf. Internet relevance,” in Proc. 4th Int. Conf. Commun. Syst. Netw. Technol. (CSNT),
Things (IOT), Wuxi, China, 2012, pp. 68–75. Bhopal, India, 2014, pp. 487–490.
[20] M. Li, Z. Zhu, and G. Chen, “A scalable and high-efficiency discovery [44] S. Blanas et al., “A comparison of join algorithms for log process-
service using a new storage,” in Proc. IEEE 37th Annu. Comput. Softw. ing in MapReduce,” in Proc. ACM SIGMOD Int. Conf. Manag. Data,
Appl. Conf. (COMPSAC), Kyoto, Japan, 2013, pp. 754–759. Indianapolis, IN, USA, 2010, pp. 975–986.
[21] Y.-F. Lu and S.-S. Ye, “A multi-dimension hash index design for main- [45] K. Lu et al., “Wave: Trigger based synchronous data process system,” in
memory RFID database applications,” in Proc. Int. Conf. Inf. Security Proc. 14th IEEE/ACM Int. Symp. Cluster Cloud Grid Comput. (CCGrid),
Intell. Control (ISIC), 2012, pp. 61–64. Chicago, IL, USA, 2014, pp. 540–541.
[22] T. Hara, K. Harumoto, M. Tsukamoto, S. Nishio, and J. Okui, “Main [46] A. Lugowski et al., “Parallel processing of filtered queries in attributed
memory database for supporting database migration,” in Proc. IEEE semantic graphs,” J. Parallel Distrib. Comput., vols. 79–80, pp. 115–131,
Pac. Rim Conf. Commun. Comput. Signal Process. 10 Years PACRIM May 2015.
1987–1997 Netw. Pac. Rim, vol. 1. Victoria, BC, Canada, 1997, [47] K. Lee, L. Liu, Y. Tang, Q. Zhang, and Y. Zhou, “Efficient and customiz-
pp. 231–234. able data partitioning framework for distributed big RDF data processing
[23] N. Martinez-Bazan, S. Gomez-Villamor, and F. Escale-Claveras, “DEX: in the cloud,” in Proc. IEEE CLOUD, Santa Clara, CA, USA, 2013,
A high-performance graph database management system,” in Proc. IEEE pp. 327–334.
27th Int. Conf. Data Eng. Workshops (ICDEW), Hanover, Germany, [48] F. N. Afrati and J. D. Ullman, “Optimizing multiway joins in a map-
2011, pp. 124–127. reduce environment,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 9,
[24] D. Brickley and R. V. Guha, RDF Vocabulary Description pp. 1282–1298, Sep. 2011.
Language 1.0: RDF Schema, 2004. [Online]. Available: [49] G. Jung, N. Gnanasambandam, and T. Mukherjee, “Synchronous parallel
http://www.w3.org/2001/sw/RDFCore/Schema/200203 processing of big-data analytics services to optimize performance in fed-
[25] Z. Kaoudi and I. Manolescu, “Cloud-based RDF data management,” in erated clouds,” in Proc. IEEE 5th Int. Conf. Cloud Comput. (CLOUD),
Proc. ACM SIGMOD Int. Conf. Manag. Data, 2014, pp. 725–729. Honolulu, HI, USA, 2012, pp. 811–818.
[26] M. Grund, M. Schapranow, J. Krueger, J. Schaffner, and A. Bog, “Shared [50] M. Spreitzer, M. Steinder, and I. Whalley, “Ripple: Improved archi-
table access pattern analysis for multi-tenant applications,” in Proc. IEEE tecture and programming model for bulk synchronous parallel style of
Symp. Adv. Manag. Inf. Glob. Enterprises (AMIGE), Tianjin, China, analytics,” in Proc. IEEE 33rd Int. Conf. Distrib. Comput. Syst. (ICDCS),
2008, pp. 1–5. Philadelphia, PA, USA, 2013, pp. 460–469.
[51] D. B. Babu, R. S. R. Prasad, and N. K. Chakravarthy, “A modern cyclic
[27] H. Cai et al., “IoT-based configurable information service platform for
approach to solve a classification problem in cloud environment,” in
product lifecycle management,” IEEE Trans. Ind. Informat., vol. 10,
Proc. Int. Conf. Adv. Comput. Sci. Appl. Technol. (ACSAT), Kuching,
no. 2, pp. 1558–1567, May 2014.
Malaysia, 2013, pp. 207–212.
[28] J. Espadas et al., “A tenant-based resource allocation model for scaling
[52] X. Zhengqiao and Z. Dewei, “Research on clustering algorithm for mas-
software-as-a-service applications over cloud computing infrastructures,”
sive data based on Hadoop platform,” in Proc. Int. Conf. Comput. Sci.
Future Gener. Comput. Syst., vol. 29, no. 1, pp. 273–286, 2013.
Service Syst. (CSSS), Nanjing, China, 2012, pp. 43–45.
[29] A. K. Singh, X. Cui, B. Cassell, B. Wong, and K. Daudjee, “MicroFuge: [53] J. Yu and T. Zhu, “Towards dynamic resource provisioning for traffic
A middleware approach to providing performance isolation in cloud mining service cloud,” in Proc. IEEE Internet Things (iThings/CPSCom)
storage systems,” in Proc. IEEE 34th Int. Conf. Distrib. Comput. IEEE Int. Conf. IEEE Cyber Phys. Soc. Comput. Green Comput.
Syst. (ICDCS), Madrid, Spain, 2014, pp. 503–513. Commun. (GreenCom), Beijing, China, 2013, pp. 1296–1301.
[30] P. Lama and X. Zhou, “NINEPIN: Non-invasive and energy efficient [54] S. Pandey, N. Gupta, and A. K. Dubey, “A novel wireless heterogeneous
performance isolation in virtualized servers,” in Proc. IEEE/IFIP Int. data mining (WHDM) environment based on mobile computing environ-
Conf. Depend. Syst. Netw. (DSN), Boston, MA, USA, 2012, pp. 1–12. ments,” in Proc. Int. Conf. Commun. Syst. Netw. Technol. (CSNT), 2011,
[31] B. Li, Y. He, and K. Xu, “Distributed metadata management scheme pp. 298–302.
in cloud computing,” in Proc. 6th Int. Conf. Pervasive Comput. [55] F. T. Stahl, M. M. Gaber, M. Bramer, and P. S. Yu, “Distributed hoeffding
Appl. (ICPCA), Port Elizabeth, South Africa, 2011, pp. 32–38. trees for pocket data mining,” in Proc. Int. Conf. High Perform. Comput.
[32] Y. Wang and H. Lv, “Efficient metadata management in cloud com- Simulat. (HPCS), Istanbul, Turkey, 2011, pp. 686–692.
puting,” in Proc. IEEE 3rd Int. Conf. Commun. Softw. Netw. (ICCSN), [56] E. H.-C. Lu, W.-C. Lee, and V. S.-M. Tseng, “A framework for personal
Xi’an, China, 2011, pp. 514–519. mobile commerce pattern mining and prediction,” IEEE Trans. Knowl.
[33] R. Anitha and S. Mukherjee, Global Trends in Information Systems and Data Eng., vol. 24, no. 5, pp. 769–782, May 2012.
Software Applications (Communications in Computer and Information [57] C.-C. Chen, K.-W. Lee, C.-C. Chang, D.-N. Yang, and M.-S. Chen,
Science). Heidelberg, Germany: Springer, 2012, pp. 13–21. “Efficient large graph pattern mining for big data in the cloud,” in Proc.
[34] M. A. Rodriguez, J. Bollen, and H. V. D. Sompel, “Automatic metadata IEEE Int. Conf. Big Data, Silicon Valley, CA, USA, 2013, pp. 531–536.
generation using associative networks,” ACM Trans. Inf. Syst., vol. 27, [58] H.-C. Lai, C.-T. Li, Y.-C. Lo, and S.-D. Lin, “Exploiting and evaluat-
no. 2, 2009, Art. no. 7. ing MapReduce for large-scale graph mining,” in Proc. IEEE/ACM Int.
[35] M. Á. Rodríguez-García, R. Valencia-García, F. García-Sánchez, and Conf. Adv. Soc. Netw. Anal. Min. (ASONAM), Istanbul, Turkey, 2012,
J. J. Samper-Zapater, “Ontology-based annotation and retrieval of pp. 434–441.
services in the cloud,” Knowl. Based Syst., vol. 56, pp. 15–25, Jan. 2014. [59] J. Wu, Z. Guan, Q. Zhang, A. K. Singh, and X. Yan, “Static and
[36] J. C. Vidal, M. Lama, E. Otero-García, and A. Bugarín, “Graph-based dynamic structural correlations in graphs,” IEEE Trans. Knowl. Data
semantic annotation for enriching educational content with linked data,” Eng., vol. 25, no. 9, pp. 2147–2160, Sep. 2013.
Knowl. Based Syst., vol. 55, pp. 29–42, Jan. 2014. [60] Y. Lyu et al., “High-performance scheduling model for multisensor gate-
[37] D. Tao, L. Jin, W. Liu, and X. Li, “Hessian regularized support vec- way of cloud sensor system-based smart-living,” Inf. Fusion, vol. 21,
tor machines for mobile image annotation on the cloud,” IEEE Trans. pp. 42–56, Jan. 2015.
Multimedia, vol. 15, no. 4, pp. 833–844, Jun. 2013. [61] Z. Peng, Z. Jingling, and L. Qing, “Message oriented middleware data
[38] L. Xu and X. Wu, “Hub: Heterogeneous bucketization for database processing model in Internet of Things,” in Proc. 2nd Int. Conf. Comput.
outsourcing,” in Proc. Int. Workshop Security Cloud Comput., Wuhan, Sci. Netw. Technol. (ICCSNT), Changchun, China, 2012, pp. 94–97.
China, 2013, pp. 47–54. [62] F. Nordemann, “A communication-optimizing middleware for effi-
[39] Y. Ma et al., “An efficient index for massive IoT data in cloud envi- cient wireless communication in rural environments,” in Proc. 9th
ronment,” in Proc. 21st ACM Int. Conf. Inf. Knowl. Manag., Maui, HI, Middleware Doctoral Symp. 13th ACM/IFIP/USENIX Int. Middleware
USA, 2012, pp. 2129–2133. Conf., Montreal, QC, Canada, 2012, p. 3.
CAI et al.: IoT-BASED BIG DATA STORAGE SYSTEMS IN CLOUD COMPUTING: PERSPECTIVES AND CHALLENGES 87

[63] Y. Wang, Q. Deng, W. Liu, and B. Song, “A data-centric storage Boyi Xu (M’14) received the B.S. degree in indus-
approach for efficient query of large-scale smart grid,” in Proc. 9th Web trial automation and Ph.D. degree in management
Inf. Syst. Appl. Conf. (WISA), Haikou, China, 2012, pp. 193–197. science from Tianjin University, Tianjin, China, in
[64] Y. Tan, Z. Yan, D. Feng, E. H.-M. Sha, and X. Ge, “Reducing the de- 1987 and 1996, respectively.
linearization of data placement to improve deduplication performance,” He is currently an Associate Professor with the
in Proc. High Perform. Comput. Netw. Stor. Anal. (SCC) SC Companion, College of Economics and Management, Shanghai
Salt Lake City, UT, USA, 2012, pp. 796–800. Jiao Tong University, Shanghai, China. His current
[65] B. Dong et al., “An optimized approach for storing and accessing research interests include enterprise information sys-
small files on cloud storage,” J. Netw. Comput. Appl., vol. 35, no. 6, tems, Internet of Things, and business intelligence.
pp. 1847–1862, 2012.
[66] F. Ning, C. Weng, and Y. Luo, “Virtualization I/O optimization based
on shared memory,” in Proc. IEEE Int. Conf. Big Data, Silicon Valley,
CA, USA, 2013, pp. 70–77.
[67] L. Kolb, A. Thor, and E. Rahm, “Load balancing for MapReduce-based
entity resolution,” in Proc. IEEE 28th Int. Conf. Data Eng., Washington,
DC, USA, 2012, pp. 618–629.
[68] F. Marozzo, D. Talia, and P. Trunfio, “P2P-MapReduce: Parallel data
processing in dynamic cloud environments,” J. Comput. Syst. Sci.,
vol. 78, no. 5, pp. 1382–1402, 2012.
[69] P. Lu, Y. C. Lee, and A. Y. Zomaya, “Non-intrusive slot layering
in Hadoop,” in Proc. 13th IEEE/ACM Int. Symp. Cluster Cloud Grid
Comput. (CCGrid), Delft, The Netherlands, 2013, pp. 253–260.
[70] J. Zhou et al., “CloudThings: A common architecture for integrating the
Internet of Things with cloud computing,” in Proc. IEEE 17th Int. Conf.
Comput. Supported Cooperative Work Design (CSCWD), Whistler, BC,
Canada, 2013, pp. 651–657.
[71] K. Tei and L. Gürgen, “ClouT: Cloud of things for empowering the
citizen clout in smart cities,” in Proc. IEEE World Forum Internet
Things (WF-IoT), Seoul, South Korea, 2014, pp. 369–370.
[72] L. Jiang et al., “An IoT-oriented data storage framework in cloud Lihong Jiang (M’10) received the B.S., M.S.,
computing platform,” IEEE Trans. Ind. Informat., vol. 10, no. 2, and Ph.D. degrees from Tianjin University, Tianjin,
pp. 1443–1451, May 2014. China, in 1989, 1992, and 1996, respectively.
[73] J. Byun, S. H. Kim, and D. Kim, “Lilliput: Ontology-based platform for From 1992 to 1993, she was as an Assistant
IoT social networks,” in Proc. IEEE Int. Conf. Services Comput. (SCC), Professor with the Department of Computer,
Anchorage, AK, USA, 2014, pp. 139–146. Qingdao Ocean University, Qingdao, China. From
[74] M. Fiorelli, M. T. Pazienza, A. Stellato, and A. Turbati, “ART lab infras- 1996 to 1998, she was as a Post-Doctoral Research
tructure for semantic big data processing,” in Proc. Int. Conf. High Fellow with the School of Management, Fudan
Perform. Comput. Simulat. (HPCS), Bologna, Italy, 2014, pp. 327–334. University, Shanghai, China. She is currently an
[75] C. Bizer, T. Heath, and T. Berners-Lee, “Linked data-the story so far,” Associate Professor with the School of Software,
Int. J. Semantic Web Inf. Syst., vol. 5, no. 3, pp. 1–22, 2009. Shanghai JiaoTong University, Shanghai.
[76] R. Haque and M.-S. Hacid, “Blinked data: Concepts, characteristics, and
challenge,” in Proc. IEEE World Congr. Services, Anchorage, AK, USA,
2014, pp. 426–433.
[77] C. Xie, G. Li, H. Cai, L. Jiang, and N. N. Xiong, “Dynamic
weight-based individual similarity calculation for information
searching in social computing,” IEEE Syst. J., to be published,
doi: 10.1109/JSYST.2015.2443806.
[78] T. Michelsen, “Data stream processing in dynamic and decentralized
peer-to-peer networks,” in Proc. SIGMOD PhD Symp., 2014, pp. 1–5.
[79] S. Aridhi, L. d’Orazio, M. Maddouri, and E. M. Nguifo, “Density-based
data partitioning strategy to approximate large-scale subgraph mining,”
Inf. Syst., vol. 48, pp. 213–223, Mar. 2015.
[80] C. Doulkeridis and K. Nørvåg, “A survey of large-scale analytical query
processing in MapReduce,” VLDB J., vol. 23, no. 3, pp. 355–380, 2014.

Hongming Cai (M’10–SM’15) received the B.S.,


M.S., and Ph.D. degrees from Northwestern
Polytechnical University, Xi’an, China, in 1996,
1999, and 2002, respectively. Athanasios V. Vasilakos (M’00–SM’11) is currently
He is currently an Associate Professor with the a Professor with the Luleå University of Technology,
School of Software, Shanghai Jiao Tong University. Luleå, Sweden.
He served as a Post-Doctoral Research Fellow with Prof. Vasilakos has served or is serving as an
the Computer Science and Technology Department, Editor for many technical journals, such as the IEEE
Shanghai Jiao Tong University, Shanghai, China, T RANSACTIONS ON N ETWORK AND S ERVICE
from 2002 to 2004. He served as a Visiting Professor M ANAGEMENT, the IEEE T RANSACTIONS ON
with the Business Information Technology Institute, C LOUD C OMPUTING, the IEEE T RANSACTIONS
University of Mannheim, Mannheim, Germany, from 2008 to 2009. His vis- ON I NFORMATION F ORENSICS AND S ECURITY , the
iting scholarship was appointed and sponsored by Alfried Krupp von Bohlen IEEE T RANSACTIONS ON C YBERNETICS, the
und Halbach Foundation, Germany. IEEE T RANSACTIONS ON NANOBIOSCIENCE, the
Dr. Cai was a recipient of the National Outstanding Scientific and IEEE T RANSACTIONS ON I NFORMATION T ECHNOLOGY IN B IOMEDICINE,
Technological Workersİ by China Association for Science and Technology ACM Transactions on Autonomous and Adaptive Systems, and the IEEE
in 2012. He is the Standing Director of China Graphics Society and a Senior J OURNAL ON S ELECTED A REAS IN C OMMUNICATIONS. He is also the
Member of the ACM. General Chair of the European Alliances for Innovation.

You might also like