Big Data Platforms and Techniques: January 2016
Big Data Platforms and Techniques: January 2016
net/publication/299433674
CITATIONS READS
12 4,192
3 authors:
Shafaatunnur Hasan
Universiti Teknologi Malaysia
39 PUBLICATIONS 84 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
GPUMLib: Deep Learning SOM Library for Surface Reconstruction View project
All content following this page was uploaded by Siti Mariyam Shamsuddin on 27 April 2016.
Abstract
Data is growing at unprecedented rateand has led to huge volume generated; the data sources
include mobile, internet and sensors.This voluminous data is generated and updated at high velocity by
batch and streaming platforms. This data is also varied along structured and unstructured types. This
volume, velocity and variety of data led to the term big data. Big data has been premised to contain
untapped knowledge, its exploration and exploitation is termed big data analytics. This literature reviewed
platforms such as batch processing, real time processing and interactive analytics used in big data
environments. Techniques used for big data are machine learning, Data Mining, Neural Network and Deep
Learning. There are big data architecture offerings from Microsoft, IBM and National Institute of Standards
and Technology. Big data potentials can transform economies and reducerunning cost of institutions. Big
data has challenges such as storage, computation, security and privacy.
Keywords: Big data, big data architecture, big data techniques, big data platform
Copyright © 2016 Institute of Advanced Engineering and Science. All rights reserved.
1. Introduction
The concept big data came into being through explosion of data from the Internet,
cloud, data center, mobile, Internet of things, sensors and domains that possess and process
huge datasets. Volume, velocity and variety are the main features of big data (1). These
features make traditional computing models ineffective. A premise of tremendous value in the
huge datasets is the motive for big data exploration and exploitation. Big data has been
identified with potential to revolutionize many aspects of life (2); applications of big data in some
domains have practically changed their practices (3). Data has penetrated each industry and all
business functions; it is now considered a major factor in production (4).
Big data would inspire a lot of innovative models, companies, products and services.
This is due to the strategic insight it behold for the IT industry and businesses. The IT industry
would create new products and target untapped markets. Businesses would optimize existing
businesses and have new business models. A lot of academic research is being conducted in
the field of big data; ranging from applications, tools, techniques and architecture. The research
is interdisciplinary and generally called data science. In view of the divergent interest in big data
from distinct domains, a clear and innate appreciation of its definition, advancement, constituent
technologies and challenges becomes paramount (5). In this regard, this paper would give a
literature review on some aspectsof big data; there are definitions, features, opportunities,
challenges, platforms, techniques and architectures. The review is unique from existing ones
due to its wide coverage of big data architecture offerings from companies and the unified
reference architecture. These architectures are the most important blueprint to help navigate the
big data terrain. The rest of this paper is outlined as follows; section two discuss the big data
context; section three examines big data platforms while section four analyze big data
techniques; section fiveelucidate big data architectures. Section six is theconclusion of the
literature review.
Received August 21, 2015; Revised November 24, 2015; Accepted Decmmber 17, 2015
192 ISSN: 2502-4752
The opportunities brought by big data such as transformation of economies and cost saving for
government institutions are provided equally. Challenges confronting big data such as storage,
privacy, security and computation are covered as well.
recent big data types are unstructured and semi structured. These data types lack data schema
entirely or the schemais not enforced strictly. Examples of varied data sources are social media
logs, email and internet logs. Not all data sources have acceptable quality criteria in them; this
makes the veracity feature of big data important.
Veracity involves the reliability of the data; this is challenging due to the numerous data
sourcesof big data. This affects the quality and predictability of the data due to inconsistency,
incompleteness, ambiguities, latency and model approximations. Criteria would need to be in
place to gauge the quality of the data, when the acceptability criteria are met; the data isthen
used for preprocessing and analytics. The last outstanding feature of big data is value.
The value in big data comes from use cases in companies, governments and research.
Companies such as Facebook and Amazon have taken the big data leap. An in-depth study by
Mckingsey (4) showed big data can boost productivity, increase competitive advantage and
increase economic surplus of consumers. When big data is deployed creatively and effectively,
the study project a value of $300 billion for USA healthcare, 60% increase in profit for USA
retails and savings of $149 billion for governments in Europe. Big data would herald new
innovation, competition, productivity, growth and value capture; while the demand side is from
consumers, companies and different economic sectors (4). Epidemic outbreak forecast and
discounted ticket sales are other big data values targeted by Google (16) and Microsoft (9)
respectively.
augment their weaknesses to meet big data challenges (2). Neural networks are one of the
most used machine learning techniques.
Neural Network (NN) consists of mature techniques that mimic human brain function.
The techniques have successful applications used for clustering and classification. NN consists
of nodes and layers for its operation, with more layers and nodes, better accuracies are
obtained but at a higher computational cost. Big data with its complexity causes high
computational cost for neural networks (30). The second challenge faced by NN in big data is
the algorithms perform poorly. Two approaches are used to overcome these challenges. First is
to reduce data size by sampling techniques; the second is to scale Neural Network for
distributed computing. A research trained neural networks on large scale data, incorporating a
hash based implementation of maximum entropy model. The method achieved lower
computational cost and 10% reduction in word error rate (31). Another research implemented a
parallel and pipelined neural network on FPGA hardware; the resulting prototype had an
advantage of no limitation to the network size and topology of the network (32). Deep learning is
a new technique that leverage neural networks and are able to produce compelling results
based on multiple layer feature extraction. Deep convolution neural networks were used to
classify images of MNIST dataset that yielded near human recognition by reducing error rate up
to 30% (33). Neural deep learning was able to achieve 15.8% accuracy on a 10 million image
recognition problem (34).
Figure 1. IBM Big Data & Analytics Reference Architecture – source: (38)
Streaming Events. The third vertical layeris called Organize; this layer prepares data
(MapReduce) for the analytical operations and ensures data quality assurance. The fourth
vertical layer is the Analyze; this layer carries out analytical operations such as In-place
analytics, Faceted Analytics and SQL Analytics. The fifth layer is the decide layer, this involves
the processed data presented in formats such as visualization, recommendations, alerts and
dashboards. The last vertical layer is the Management, Security and Governance. Next are the
horizontal layers; the top layer here is the specialized hardware which involves technology
platforms. The bottom horizontal layer provides an Industry Integration layers that integrates
different technology offerings, this makes the operation of diverse technologies seamless(40).
The first component is data provider; this is an entity or organization that provides
information for collection and processing by the big data ecosystem. The second component is
the big data application provider, this is where the core functionality of big data system is
implemented, and this involves data collection, multiple transformations and diverse usage of
data. The specific functions are collection, preparation, analytics, visualization and access. The
application provider ensures compliance with all management, security and privacy
requirements. The big data framework provider is the third component; this layer provides the IT
value chain components that include diverse technologies in a certain hierarchy. These
components are used by the application provider to achieve the big data goal. The components
in this layer are infrastructure, platforms and processing frameworks.The fourth component is
the data consumers. Data consumer can be a user or another system. Data consumer accesses
services provided bythe application provider component through interfaces. The services
involved are initiation, data transfer and termination. The last functional component is system
orchestrator. This is the role of defining and administering the data application activities into an
operational system. The system orchestrator is either a human or a software component; it can
also involve a combination of both. The system orchestratorin an enterprise system can be
mapped to the role of a system governor; this role stipulates requirements and constraints within
which the system must operate. The last two components are management layers. There are
system management and data life cycle management. System management involves the
processes of setting up, oversight and upkeep of big data infrastructure. This management
component is challenging due to the inherent complexity of the big data ecosystem. The
management of numerous processing frameworks is the most challenging because they are
expected to scale, be safe and resilient in their operations. The last component of the NBRDA is
Data Life Cycle Management. Due to the diverse data requirements of big data systems, there
is a need for data life cycle management. This component is similar to the data provider in
functionality; but has a broader scope across all the other four architecture components.
6. Conclusion
Big data platforms and tools are the cornerstone of the data revolution taking place.
Volume, velocity and variety are the main features thatmake big data outstanding from
traditional computing. Big data has been premised to make profound insights in all domains.
Innovative models, products, services and huge cost savings are some of the opportunities big
data present. Big data due to its huge potential and user base has multiple definitions reflecting
wide perspective of its stakeholders. Platforms for big data have diverse purposes with regard to
speed at which data needs to be produced and processed. The platforms are batch processing,
stream processing and interactive analytics. The techniques involvedin big data are many and
from different domains; but eachtechnique exhibit characteristics that can be mapped to a
problem pattern. The multidisciplinary nature of big data warrants collaboration of diverse fields;
this can lead tohybrid techniques for solving a particular problem.
The need for a standardized reference architecture due to diverse offerings from
numerous Industry players led to the National Big Data Reference Architecture. The reference
architecture is a working document as it continues to evolve; reflecting the dynamic nature of
the big data ecosystem. Big data has many challenges confronting it at this infancy stage of its
growth. Some of them are privacy, security, storage and processing.
Privacy gaps can be resolved by extending the capacity of encryption schemes. Novel
encryption techniques that are industry ready can equally help, incorporating and
operationalizing institutional governance can help as well. Security can be improved by
improving current security architectures reliability, this would be an on going effort due to the
dynamic nature of big data ecosystem. Storage of data is a challenge due to the huge volume
generated. Hadoop can equally fit the storage gap due to its scalability. Cloud computing can
provide a limitless storage capacity for data as well. Processing involves the techniques used
for analyzing data, challenges faced by machine learning techniques are high sample
dimension, huge data size and problem complexity. Possible solution strategies are
dimensionality reduction, instance selection, incremental learning and adapting techniques for
distributed and parallel computing.
References
[1] Laney D. 3D Data Management: Controlling Data Volume, Velocity and Variety. Application Delivery
Strategies. 2001.
[2] Philip Chen CL, Zhang CY. Data-intensive applications, challenges, techniques and technologies: A
survey on Big Data. Inf Sci. 2014; 275: 314–347.
[3] Huang T, Lan L, Fang X, An P, Min J, Wang F. Promises and Challenges of Big Data Computing in
Health Sciences. Big Data Res. 2015; 2(1): 2–11.
[4] James M, Michael C, Brad B, Jacques B, Richard D, Charles R. Big data: The next frontier for
innovation, competition, and productivity. McKinsey Glob Inst. 2011.
[5] Hu H, Wen Y, Chua T-S, Li X. Toward Scalable Systems for Big Data Analytics: A Technology
Tutorial. IEEE Access. 2014; 2: 652–687.
[6] McKerlich R, Ives C, McGreal R. Measuring use and creation of open educational resources in higher
education. Int Rev Res Open Distance Learn. 2013; 14(4): 90–103.
[7] Gantz BJ, Reinsel D. Extracting Value from Chaos State of the Universe : An Executive Summary.
IDC iView. 2011: 1–12.
[8] Wu X, Zhu X, Wu G-Q, Ding W. Data mining with big data. IEEE Trans Knowl Data Eng. 2014; 26(1):
97–107.
[9] Chen M, Mao S, Liu Y. Big data: A survey. Mob Networks Appl. 2014; 19(2): 171–209.
[10] Kraska T. Finding the needle in the big data systems haystack. IEEE Internet Comput. 2013; 17(1):
84–6.
[11] Jin X, Wah BW, Cheng X, Wang Y. Significance and Challenges of Big Data Research. Big Data
Res. 2015; 1: 3–8.
[12] Steering the future of computing. Nature. 2006; 440(7083): 383.
[13] Collins J. The fourth paradigm. The fourth paradigm. XVIII ISA World Congress of Sociology.
Yokohama. 2014.
[14] Assunção MD, Calheiros RN, Bianchi S, Netto M a. S, Buyya R. Big Data computing and clouds:
Trends and future directions. J Parallel Distrib Comput. 2014; 3–15.
[15] Chardonnens T, Cudre-Mauroux P, Grund M, Perroud B. Big data analytics on high Velocity streams:
A case study. 2013 IEEE Int Conf Big Data. 2013: 784–787.
[16] Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L. Detecting influenza
epidemics using search engine query data. Nature. 2009; 457(7232): 1012–1024.
[17] Katal A, Wazid M, Goudar RH. Big data: Issues, challenges, tools and Good practices. 6th Int Conf
Contemp Comput IC3 2013. Noida. 2013; 404–409.
[18] Michael K, Miller K. Big data: New opportunities and new challenges. IEEE Computer Society Long
Beach. 2013; XX(X): 1–20.
[19] Carter P. Top Ten Big Data Security and Privacy Challenges. Int Data Corp. 2011.
[20] Dean J, Ghemawat S. MapReduce. Commun ACM. 2008; 51(1): 107–113.
[21] Kamburugamuve S, Fox G, Leake D, Qiu J. Survey of Distributed Stream Processing for Large
Stream Sources. Grids Ucs Indiana Edu. 2013.
[22] Rossi RA, Ahmed NK. Interactive Data Repositories : From Data Sharing to Interactive Data
Exploration & Visualization. Purdue University. 2015; 78–82.
[23] Hausenblas M, Nadeau J. Apache Drill: Interactive Ad-Hoc Analysis at Scale. Big Data. 2013; 1(2):
100–104.
[24] Steve O. Machine Learning, Cognition, and Big Data. CA Technologies. 2012
[25] Chen C, Liu Z, Lin WH, Li S, Wang K. Distributed modeling in a mapreduce framework for data-
driven traffic flow forecasting. IEEE Trans Intell Transp Syst. 2013; 14(1): 22–33.
[26] Mitra P, Murthy CA, Pal SK. A Probabilistic Active Support Vector Learning Algorithm. IEEE Trans
Pattern Anal Mach Intell. 2004; 26(3): 413–418.
[27] Jiang W, Zavesky E, Chang SF, Loui A. Cross-domain learning methods for high-level visual concept
classification. Proc - Int Conf Image Process ICIP. San Diego. 2008; 161–164.
[28] Raykar VC, Duraiswami R, Krishnapuram B. A fast algorithm for learning a ranking function from
large-scale data sets. IEEE Trans Pattern Anal Mach Intell. 2008; 30(7): 1158–1170.
[29] Sun P, Yao X. Sparse approximation through boosting for learning large scale kernel machines. IEEE
Trans Neural Netw. 2010; 21(6): 883–894.
[30] Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C. Evaluating MapReduce for multi-
core and multiprocessor systems. Proc - Int Symp High-Performance Comput Archit. 2007; 13–24.
[31] Deoras A, Povey D. Strategies for Training Large Scale Neural Network Language Models. Asru
2011.
[32] Ahn JB. Neuron Machine : Parallel and Pipelined Digital Neurocomputing Architecture. Cybernetics
Com. Bali. 2012; 143–147.
[33] Ciresan D, Meier U, Schmidhuber J. Multi-column Deep Neural Networks for Image Classification.
Comput Vis Pattern Recognit (CVPR), 2012 IEEE Conf. 2012; 3642–3649.
[34] Le QV, Ranzato M, Monga R, Devin M, Chen K, Corrado GS, et al. Building high-level features using
large scale unsupervised learning. 29th International Conference on Machine Learning. Edinburgh.
2011
[35] NIST Special Publication. XXX-XXX.DRAFT NIST Big Data Interoperability Framework: Reference
Architecture. National Institute of Standards and Technology. Maryland. 2015.
[36] Levin BO, Big Data Ecosystem Reference Architecture. Microsoft Corporation.2013;
[37] Cloud Analytics Playbook. Booz Allen Hamillton. 2012.
[38] Ralph B, IBM Big Data Platform. IBM Corporation. 2013.
[39] We H, Hadoop D. SAP and Hortonworks Reference Architecture. SAP AG. 2014.
[40] An Enterprise Architect’s Guide to Big Data. Oracle Corporation. 2015.
[41] Lopes N, Ribeiro B. GPUMLib: An Efficient Open-source GPU Machine Learning Library.J Comput
Inf Sys. 2011;3:355–362.
[42] NIST Special Publication. 1500-3. DRAFT NIST Big Data Interoperability Framework : Use Cases
and General Requirements. National Institute of Standards and Technology. Maryland. 2015
[43] NIST Special Publication. XXX-XXX. DRAFT NIST Big Data Interoperability Framework:Architectures
White Paper Survey. National Institute of Standards and Technology. Maryland. 2015.
[44] NIST Special Publication. 1500-2.DRAFT NIST Big Data Interoperability Framework: Big Data
Taxonomies DRAFT NIST Big Data Interoperability Framework. National Institute of Standards and
Technology. Maryland. 2015.