Research IN BIG Data - AN: Dr. S.Vijayarani and Ms. S.Sharmila
Research IN BIG Data - AN: Dr. S.Vijayarani and Ms. S.Sharmila
Research IN BIG Data - AN: Dr. S.Vijayarani and Ms. S.Sharmila
3, September 2016
ABSTRACT
Big data is a prominent term which characterizes the improvement and availability of data in all three
formats like structure, unstructured and semi formats. Structure data is located in a fixed field of a record
or file and it is present in the relational data bases and spreadsheets whereas an unstructured data file
includes text and multimedia contents. The primary objective of this big data concept is to describe the
extreme volume of data sets i.e. both structured and unstructured. It is further defined with three “V”
dimensions namely Volume, Velocity and Variety, and two more “V” also added i.e. Value and Veracity.
Volume denotes the size of data, Velocity depends upon the speed of the data processing, Variety is
described with the types of the data, Value which derives the business value and Veracity describes about
the quality of the data and data understandability. Nowadays, big data has become unique and preferred
research areas in the field of computer science. Many open research problems are available in big data
and good solutions also been proposed by the researchers even though there is a need for development of
many new techniques and algorithms for big data analysis in order to get optimal solutions. In this paper,
a detailed study about big data, its basic concepts, history, applications, technique, research issues and
tools are discussed.
KEYWORDS:
Big data, Technologies, Visualization, Classification, Clustering
1. INTRODUCTION
Big data is associated with large data sets and the size is above the flexibility of common
database software tools to capture, store, handle and evaluate [1][2]. Big data analysis is essential
for analysts, researchers and business people to make better decisions that were previously not
attained. Figure 1 explains the structure of big data which contains five dimensions namely
volume, velocity, variety, value and veracity [2][3]. Volume refers the size of the data which
mainly shows how to handle large scalability databases and high dimensional databases and its
processing needs. Velocity defines the continuous arrival of data streams from this useful
information’s are obtained. Furthermore big data has enhanced improved through-put,
connectivity and computing speed of digital devices which has fastened the retrieval, process and
production of the data.
Veracity determines the quality of information from various places. Variety describes how to
deliver the different types of data, for example source data includes not only structured traditional
relational data but it also includes quasi-structured, semi-structured and unstructured data such as
text, sensor data, audio, video, graph and many more type. Value is essential to get the economic
DOI : 10.5121/ieij.2016.4301 1
Informatics Engineering, an International Journal (IEIJ), Vol.4, No.3, September 2016
value of different data which varies significantly. The primary challenge is to identify which are
valuable and the way to perform transformation and the technique to be applied to perform data
analysis [1].
Big data has three types of knowledge discovery; they are novelty discovery, class discovery and
association discovery. Novelty discovery is used to find a new, rare one, previously undiscovered
and unknown from a billion or trillion objects or events [2]. Class discovery finds new classes of
objects and behavior and association discovery is used to find an unusual co-occurring
association. This data by its innovative method is changing our world. This innovative concept is
being driven by various aspects: A proliferation of sensors, creation of almost all information in
digital form, dramatic cost reductions in storage, remarkable increase in network bandwidth,
impressive cost reductions and scalability improvements in computation, efficient algorithmic
breakthroughs in machine learning and other areas [2]. Analysis of big data is used to reduce
fraud, helps to improve scientific research and field development. Figure 1 illustrates the structure
of big data [1].
Few typical characteristics of big data are the integration of structured data, semi-structured data
and unstructured data. Big data addresses speed and measurability, quality and security,
flexibility and stability. Another important advantage of big data is data analytic. Big data
analytics refers to the process of collecting, organizing and analyzing large sets of data to
discover patterns and other useful information. Table 1 shows the comparative study of different
types of data based on its size, characteristic, tools and methods [1] [3].
2
Informatics Engineering, an International Journal (IEIJ), Vol.4, No.3, September 2016
The remaining portion of the paper is systematized as follows. Section 2 gives the need for big
data, applications, advantages and characteristics. Big data tools and technologies are discussed in
Section 3. Section 4 provides the detailed description about big data. Section 5 presents big data
challenges. Finally Section 6 concludes and discussed about recent trends
3
Informatics Engineering, an International Journal (IEIJ), Vol.4, No.3, September 2016
Chukwa
Chukwa analysis monitors large distributed system and it adds required semantics for log
collections and it uses end to end delivery model [5] .
Some sources belonging to this class may fall into the category of "Administrative data", i.e. data
produced by Public Agencies, Medical and health records [5]. Data produced by businesses are
Commercial transaction data, Banking/stock records, E-commerce, Credit cards, etc. The last
classification is Internet of Things (machine-generated data): derived from the phenomenal
growth in the number of sensors and machines used to measure and record the events and
situations in the physical world. The output of these sensors is machine-generated data, and from
simple sensor records to complex computer logs, it is well structured [6]. As sensors proliferate
and data volumes grow, it is becoming an increasingly important component of the information
stored and processed by many businesses. Its well-structured nature is suitable for computer
processing, but its size and speed beyond traditional approaches. Data from sensors are divided
into fixed sensors, home automation, weather/pollution sensors, traffic sensors/webcam, scientific
sensors, videos, mobile sensors (tracking) like mobile phone location, cars, satellite images and
data from computer system logs and web logs [5][6].
Big data are classified into different categories to understand their characteristics. The
classification is based on five aspects: data sources, content format, data stores, data staging and
data processing. This is represented in Figure 2 [5]. Each classification requires new algorithms
and techniques for performing classification tasks efficiently in big data domain.
Data source is nothing but data is collected from different sources. Some of the important data
sources are web and social media, machine generated data, sensor data, transaction data and
internet of things (IoT). Social media contains volume of information which is generated using
URL (Uniform resource language) to share or exchange information in virtual communities and
network for example face book, twitter, and blogs. In Machine generated data information are
automatically generated from both hardware and software, for example computers and medical
devices. Sensor data are collected from various sensing devices and these are used to measure
physical quantities [7]. Transaction data involves a time dimension to illustrate the data, for
example, financial and business data. Finally IoT represents set of objects they are identified
uniquely as a part of internet i.e. smart phones and digital cameras.
Content format has three formats namely structured, unstructured and semi-structured. Structured
format is often managed by SQL and data resides in affixed field within a record or a file.
Unstructured format is often includes text and multimedia content, it is opposite to structured
data. Semi-structure format does not reside in a relational database [7]; it might include XML
documents and NOSQL database. Data stores classified into four categories such as document-
oriented, key-value, column-based and graph based. Document-oriented data are designed to store
6
Informatics Engineering, an International Journal (IEIJ), Vol.4, No.3, September 2016
and collect information and supports complex data whereas column –based data stores data in
row and column format. Key-value data store is an alternative to relational database which is
designed to scale very large data set and it can be accessed and stored easily. Finally graph based
data stores are designed to represent the graph model with edges, nodes and properties and these
are related to one another [8].
Data staging is classified into three forms; cleaning, transforming and normalization. Cleaning
identifies the incomplete data. Normalization is a method which minimizes redundancy.
Transform data staging which transfers data into suitable form. Finally data processing is based
on two types namely batch and real-time [9]. From the above analysis it is observed that content
format is suitable for all types of data like structure ,un structure and semi structured
The Clustering algorithm deals with a large amount of data. It is the most distinct feature that
demands specific requirements to all classical technologies and tools used. To guide the selection
of a suitable clustering algorithm with respect to the Volume property, the following criteria are
considered: size of the dataset, handling high dimensionality and handling outliers/noisy data.
Variety: refers to the ability of a clustering algorithm to handle different types of data (numerical,
categorical and hierarchical). It deals with the complexity of big data [7]. To guide the selection
of a suitable clustering algorithm with respect to the Variety property, the following criteria are
considered: Type of dataset and clusters shape. Velocity: refers to the speed of a clustering
7
Informatics Engineering, an International Journal (IEIJ), Vol.4, No.3, September 2016
algorithm on big data. Big Data are generated at high speed. To guide the selection of a suitable
clustering algorithm with respect to the Velocity property shows the criteria and Complexity of
algorithm. Many clustering algorithms are available few are listed below. [5][6][7].
K-means
Gaussian mixture models
Kernel K-means
Spectral Clustering
Nearest neighbor
Latent Dirichlet Allocation.
8
Informatics Engineering, an International Journal (IEIJ), Vol.4, No.3, September 2016
5.2. ZingChart
ZingChart is a powerful charting library and they have ability to create charts dashboards and
infographics. It is featured with -rich API set that allow user to built interactive Flash or HTML5
9
Informatics Engineering, an International Journal (IEIJ), Vol.4, No.3, September 2016
charts. It provide hundreds of chart variation and many methods For Example Bar, Scatter,
Radar, Piano, Gauge, Sparkline, Mixed, Rank flow and word cloud. Figure shows Zing chart.[22]
5.3. Polymaps
Polymaps is a free java script charting library for image and vector- tiled maps using Scable
Vector Graphics (SVG).They provide dynamic and interactive maps in web browsers. Complex
data sets can be visualized using polymaps and offers multi-zoom functionality. The
characteristics of polymaps are it uses Scalable Vector Graphics (SVG) and the Basic CSS rules
are used and its imagery in spherical Mercator tile format. Figure 4 shows the layout of
Polymaps. [22]
Figure.4. Polymaps
5.4. Timeline
Timeline is a different tool which delivers an effective and interactive timeline that responds to
the user's mouse, it delivers lot of information in a compressed space. Each element can be
clicked to reveal more in-depth information; it gives a big-picture view with full detail. Timeline
is demonstrated in figure 5.[22]
Figure.5. Timeline
5.5. Exhibit
Exhibit is an open-source data visualization and it is developed by MIT, and Exhibit makes it
easy to create interactive maps, and other data-based visualizations measure oriented towards
teaching or static/historical based mostly knowledge sets like birth-places of notable persons.
Sample model is shown in figure 6.[22]
10
Informatics Engineering, an International Journal (IEIJ), Vol.4, No.3, September 2016
5.7. Leaflet
Leaflet is an open source java script tool developed for interactive data visualization in an
HTML5/CSS3.Leaflet tool is designed with clarity, performance and mobilization. Few
visualizing features are given zooming and planning animation such as multi touch and double
tap zoom, hardware acceleration on IOS and utilizing CSS3 features. Figure 8 shows Leaflet
structure [22].
5.8. Visual.ly
Visual.ly is a combined gallery and infographic generation tool. It provides simple toolset for
building data representations and platform to share creations. This goes above pure data
visualisation, representation of visual.ly is displayed in figure 9[22].
11
Informatics Engineering, an International Journal (IEIJ), Vol.4, No.3, September 2016
Figure.9 Visual.ly
12
Informatics Engineering, an International Journal (IEIJ), Vol.4, No.3, September 2016
Figure.12 jqPlot
Figure.14 jqPlot
5.14. JpGraph
JpGraph is an object-oriented graph creating library for PHP-based data visualization tool. It
Generates drill down graphs and large range of charts like pie, bar, line, scatter point and impulse
. Some features of JpGraph are web friendly; automatically generates client-side image maps. It
supports alpha blending, flexible scale, support integer, linear, logarithms and multiple Y- axes.
Figure 15 shows the JpGraph representation [22].
13
Informatics Engineering, an International Journal (IEIJ), Vol.4, No.3, September 2016
Figure.17 Layout of R
5.17. WEKA
WEKA is an open source software and collection of machine-learning algorithms assigned for
data-mining, Weka is a excellent tool for classifying and clustering data using many attributes. It
explores data and generates simple plots in a powerful way. Figure 18 explains the representation
of WEKA [22].
14
Informatics Engineering, an International Journal (IEIJ), Vol.4, No.3, September 2016
5.19. RAPHAEL
RAPHAEL is a tool which provides a wide range of data visualization options rendered using
SVG.It works with vector graph on web. RAPHAEL tool can be easily integrated with own web
site and codes. The supporting web browsers for RAPHAEL tools are Internet Explorer6.0+,
firefox 3.0+, Safari 3.0+, Chrome 5.0+ and Opera 9.5.Model of RAPHAEL is shown in figure 20
[22].
5.20. Crossfilter
Crossfilter is an interactive GUI tool for massive volume of data and it reduces the input range on
any one chart. This is a powerful tool for dashboards or other interactive tools with large volumes
of data .It displays data, but at the same time, it restricts the range of the data and displays the
other linked charts. Representation of Crossfilter is shown in figure 21 [22].
15
Informatics Engineering, an International Journal (IEIJ), Vol.4, No.3, September 2016
16
Informatics Engineering, an International Journal (IEIJ), Vol.4, No.3, September 2016
9. jQuery Focus on ARIA support, Developers can It uses only HTML5 for
Visualize user friendly to screen completely designing.
readers separates java
script code
from HTML.
10. Leaflet Eliminates tap delay on Works on all Difficult for new users.
mobile devices major desktop
and mobile
browsers
11. Many Eyes Multiple ways to display Upload data It is difficult to use in
data sets for public large dataset
use
12. Modest Maps Used with several Designed to Support only limited
extensions, such as provide basic applications.
MapBox.js, HTMAPL, and controls and
Easey building
mapping tools
13. Polymaps display complex data sets Uses Scalable It is ideal only for
Vector Graphic zooming in and out of
form levels
14. R is a general statistical R also results It is difficult to use in
R analysis platform in graphs, large dataset
charts and plots
15. Multi-chart capabilities Create a
RAPHAEL variety of
charts, graphs It is not easy to customize
and other data
visualizations
16. Timeline Display events as sequential Embed audio Build timelines using
time lines and video in only Google Spreadsheet
timelines from data
3rd-party apps
17. Visual.ly Infographic generation tools It is specially Difficult for new users
designed to
develop
simple toolset
representation
18. Visualize Free Upload data in Excel or Drag-and-drop Not applicable for large
CSV formats components to dataset
build
visualizations
Uses
Sandboxes for
data analysis
19. WEKA WEKA is a collection tools Free Sequence modeling is not
for data pre-processing, availability covered by the algorithms
classification, regression, Portability included in the Weka
clustering, association, and Comprehensive distribution. Not capable
visualization collection of of multi-relational data
data mining. Memory bound
preprocessing
and modeling
techniques
17
Informatics Engineering, an International Journal (IEIJ), Vol.4, No.3, September 2016
6. VISUALIZATION ALGORITHMS
Digital data are visualized in digital form with the help of visualization concept. There are various
data source to display the digital data with the help of equipment for example Antennas.
Visualization has issues in digital signal to overcome this problem algorithms are applied to raw
data, digital 3D data and various digital equipments produces digital dataset [19]. Data should be
represented in discrete form; data objects are classified into two categories organizing structure
and data attribute whereas organizing structure determines the spatial location of the data and
describes the topology and geometry on which data is illustrated and they are specified as cells
and points. Cell is defined as an ordered sequences of points and the type of cells namely vertex,
poly vertex, triangle, line, poly line, pixel, voxel and tetraheder which characterizes the sequence
of points and number of points which specifies the size of the cell. Data attributes determines the
format of the data [20]. Data attributes are commonly described as scalar, vector and texture
coordinates etc. The Visualization algorithm has two sets of transformations. In Figure 3
visualization algorithm is explained. First set of transformation converts the data or sub data into
virtual scene. These transformations are applied to structure and data type of the data. The second
transformation is done when the virtual scene is created it consists of geometrical objects,
textures and computer graphics. Transformations are applied to form images [21].
The main objectives of visualization are understanding data clearly which Is recorded, Graphical
representation, Placing the search query to find the location of the data, Discovering hidden
patterns, perceptibility of data items [22].The characteristics of transformation algorithm are
explained in Figure 4, transformation algorithms are characterized by structure and type. The
Structure transformation has two formats such as topology and geometric. Topology represent
changes in topology for example conversion of polygon data into unstructured format.Geometric
format represent changes in coordinate , scaling,rotation,translation and geomentry[22]. Few
examples for transformations are scalar algorithm, vector algorithm, and modeling algorithm.
18
Informatics Engineering, an International Journal (IEIJ), Vol.4, No.3, September 2016
Big data visualization has overcome the five massive challenges of Big Data:
19
Informatics Engineering, an International Journal (IEIJ), Vol.4, No.3, September 2016
REFERENCES
1. Neelam Singh, Neha Garg, Varsha Mittal, Data – insights, motivation and challenges, Volume 4,
Issue 12, December-2013, 2172, ISSN 2229-5518 2013.
2. Karthik Kambatlaa, Giorgos Kollias b, Vipin Kumarc, Ananth Gramaa, Trends in big data
Analytics, (2014) 74 2561–2573
3. Francis X. “On the Origin(s) and Development of the Term \Big Data"_ Francis X., 2012
4. Venkata narasimha inukollu1, sailaja arsi1 and srinivasa rao ravuri3 Security issues associated
with big data in cloud computing Vol.6, No.3, May 2014
5. Matzat1, Ulf-Dietrich Reips2,3 1 Eindhoven “Big Data” 2012, 7 (1), 1–5 ISSN 1662-5544
6. Hong Kong, Park Shatin, Mining Big Data: Current Status, and Forecast to the Future
7. Anil K. Jain Clustering Big Data, 2012
8. Daniel Keim Big-Data Visualization.
9. Hsinchun Chen Business Intelligence And Analytics: From Big Data To Big Impact AZ 85721,
OH 45221-0211 U.S.A. Mack Robinson, GA 30302-4015.
10. Ibrahim Abaker Targio Hashema,n, Ibrar Yaqooba, Nor Badrul Anuara, Salimah Mokhtara,
Abdullah Gania, Samee Ullah Khanb, The rise of “big data” on cloud computing: Review and
open research issues. 2014
11. Edd Dumbill, Making Sense of Big Data
12. Silva Robak , prof. Z. Szafrana, Zielona Góra Uniwersytet Zielonogórski Research Problems
Associated with Big Data Utilization in Logistics and Supply Chains Design and Management
2014 249 DOI: 10.15439/2014F472
13. C.L. Philip Chen , Chun-Yang Zhang Data-intensive applications, challenges, techniques and
technologies: A survey on Big Data 275 (2014) 314–347
14. Chaitanya Baru,1 Milind Bhandarkar,2Raghunath Nambiar,3 Meikel Poess,4and Tilmann Rabl
Survey of Recent Research Progress and Issues in Big Data 2013.
15. Tackling the Challenges of Big Data 2014.
16. Stephen Kaisleri_SW. Alberto Espinosa Big Data: Issues and Challenges Moving Forward
Stephen Kaisleri_SW. Alberto Espinosa 013 46th Hawaii International Conference on System
Sciences
17. Challenges and Opportunities with Big Data A community white paper developed by leading
researchers across the United States 1819
18. Danyang Dua , Aihua Lia*, Survey on the Applications of Big Data in Chinese Real Estate
Enterprise 1st International Conference on Data Science,2014
19. Shilpa, Manjit Kaur challenges and issues during visualization of big data , International Journal
For Technological Research In Engineering Volume 1, Issue 4, December - 2013 ISSN (Online) :
2347 - 4718
20. http://fellinlovewithdata.com/research/the-role-of-algorithms-in-data-visualization
21. http://prosjekt.ffi.no/unik-4660/lectures04/chapters/Algorithms2.html
22. http://www.creativebloq.com/design-tools/data-visualization-712402
20