Big Data Methods 1
Big Data Methods 1
net/publication/334167196
CITATIONS READS
4 1,210
4 authors, including:
10 PUBLICATIONS 32 CITATIONS
Bennett University
20 PUBLICATIONS 72 CITATIONS
SEE PROFILE
SEE PROFILE
Ayush Goyal
Texas A&M University - Kingsville
71 PUBLICATIONS 345 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Development of a Software Tool for Segmentation of Brain MRI Gray and White Matter Regions View project
All content following this page was uploaded by Devottam Gaurav on 20 November 2019.
Devottam Gaurav Jay Kant Pratap Singh Yadav Rohit Kumar Kaliyar Dr. Ayush Goyal
Research Scholar, Assistant Professor, Research Scholar, Assistant Professor,
Bennett University, NIET, Gr. Noida, Bennett University, Texas A&M,
Gr. Noida, Uttar Pradesh Uttar Pradesh Gr. Noida, Uttar Pradesh University-Kingsville, USA
[email protected] [email protected] [email protected] [email protected]
Abstract—With the rapid advancements in technological TABLE I. TERMS USED FOR BIG DATA STORAGE CAPACITY[3]
applications have led to the flooding of data from various
sources like web, social network data, business data, medical Term Size Example of Capacity
records, etc. over the preceding years. As compared to 1GB = 2 hours of CD-
1000,000,000 (1
traditional data, big data reveals a unique characteristic from Gigabyte(GB)
billion bytes)
quality audio or 7 minutes
its three V’s which means big data is unstructured. In this era, of HDTV
the emerging trend requires the involvement of advanced data Terabyte (TB) 1000,000,000,000 1TB = 2000 hours of CD-
analysis, acquisition and management techniques to mine and (one trillion bytes) quality audio or 5 days of
HDTV
collect appropriate data in a structured way. In this paper, we
Petabyte (PB) 1000,000,000,000, 1GB = 2hours of CD-
describe the definitions and the challenges of big data systems. 000 (one quality audio or 7 minutes
Next, a systematic framework decomposes the architecture of quadrillion bytes) of HDTV
big data systems into four stages like data generation, data Exabyte (EB) 1000,000,000,000, 1GB = 2hours of CD-
acquisition, data storage and data analytics. These stages form 000,000 (one quality audio or 7 minutes
the basis of big data value chain. Finally, some solutions are quintillion bytes) of HDTV
discussed to tackle the challenges of big data and future x According to the reports from The International Data
attention is required for big data systems. Corporation (IDC) study forecasts that whole data will grow
by 50 times by 2020[4], driven in large part more by
Keywords—big data, big data analytics, data acquisition embedded systems such as sensors in clothing, medical
devices etc. This study also determines that unstructured
I. INTRODUCTION
information like files, email and video will account for 90%
With the growth in technologies and services in the of all data produced over the next decade [5].
past twenty years, the large amounts of datasets are x There are 277,000 tweets per minute, 2 million queries
increasing at a rapid speed. At the start of computer era, the are searched on Google every minute and 72 hours of new
size of data was measured in KB (Kilo Byte). Later it was videos are uploaded to YouTube in every minute[6].
extended to MB (Mega Byte), GB (Giga Byte), TB (Tera Irrespective of these, more than 100 million emails are sent,
Byte), PB (Peta Byte), EB (Exa Byte) and ZB (Zeta Byte) 350 GB of data is processing on Facebook and more than
[1]. The size of databases has been rising at exponential 570 websites are created per minute [7]. In the year 2012,
rates in today‘s enterprises. Thus, it is necessary to process 2.5 quintillion bytes of data were produced every day [8].
and examine these large volumes of data for decision In spite of these, many issues of big data are covered in
making in business has also increased along with it. public media, such as The Economist [9], New York Times
In addition to these, petabytes of datas are processed in [10], and National Public Radio [11]. Among all scientific
structured ways which are produced from numerous journals, only two, namely, Nature and Science opened the
business and scientific applications. The interactions of special columns to deliberate the challenges and impacts of
millions of people using mobile devices and Internet every big data [12]. The era of big data has come beyond all
day, also leads to flooding of data. The rising volume of doubt. The rapid rise in the volume of big data is shown in
enterprise information, medical records, information related Fig.1[13]. Due to increase in large amount of data, it
to sensing mobile devices, multimedia, social media and so becomes very difficult to handle such datasets because the
on will fuel exponential growth in data in the future. This volume is increasing at a rapid speed in comparison to
huge amount of data is called as “Big Data” [2]. An computing resources.
example of such data for the size is illustrated in below When it is compared with traditional datasets, big data
Table 1[3]. contains amorphous data which require real time analysis. In
Big Data is a term coined from the need of big addition to these, big data provides new opportunities in
companies like Yahoo, Google, Facebook etc. to analyze finding new values, provides the thorough understanding of
large amount of unstructured data which is generated in the hidden values, and gives a new way to experience new
every second. Some facts are as: - challenges, i.e., how to effectively manage and organize
such datasets. These may include many challenging
problems like analysis, data privacy, security, data
management, search, sharing, capture, transfer and processing of distributed data along with its storage and
visualizations. management of systems in order to build powerful systems
level solutions to handle big data challenges. Many of the
applications related to big data can be explored with the
help of these innovative technologies. The explosion of
these technologies require a systematic framework which
involves the capturing of modern big data research and
development efforts and apply those advancement in
different area of subjects.
The remaining part of the paper is summarized as follows:
Section 2 provides definitions and characteristics of big
data. In section 3, description of big data analytics are given.
In section 4, we presented the architectural view of big data
system. In section 5, some cases of big data analytics are
discussed. Finally, Section 6 gives the view of the
Fig. 1: The continuously rising of global data [13]
conclusion.
These challenging problems also demand many prompt II. DEFINITIONS AND CHARACTERISTICS OF BIG DATA
solutions as: -
The term “Big Data” was first coined by Roger
x The latest development in information technology (IT)
Magoulas in the computing era from O’Reilly media in
field makes it easier to generate data. For example, in every
2005 in order to define a great amount of data that
minute, 72 hours of new videos are uploaded on YouTube.
traditional data management techniques unable to process
Therefore, the main challenge that we face is to collect and
and manage the data due to its complexity and size of data.
integrate the vast datas from various distributed data
There are three aspects of big datas as: - (a) numerous datas
sources.
are available. (b) Relational databases have no
x Due to increase in large datasets, the major challenge categorization of data and (c) at last, generates the data,
becomes for researchers and practitioners are that how to capture the data and process the data in a rapid fashion. In
manage and store these petabytes of datasets with latest addition to these, many things are being transformed with
technologies. Such enormous data will far surpass the the help of big data like science, healthcare, finance,
capacities of existing IT architectures and infrastructure of engineering, business and also the society. Madden
existing enterprises along with its real time requirements characterizes the huge information as “information that is
which will greatly affects the available computing capacity. too huge, too quick or too hard to exist apparatuses to
For instance, according to the estimation of Wal-Mart, it handle” [14]. The expression “Too Big” states that the
collects more than 2.5 petabytes of data every hour from its associations must manage zettabytes scale accumulations of
customer transactions. Thus, we need to design appropriate information that originates from web sources, sensors,
cloud computing platforms which can update their intensive sound, pictures, recordings, and so on. “Too Fast” means
workloads. data must be processed quickly irrespective of how much
x In such heterogeneous data, it becomes very difficult for the size of data is such as carrying out fraud detection or to
IT companies to identify the right data and determine how find an ad to display. “Too Hard” means such data cannot
to best use it. In order to utilize it in a best way, the datasets be processed by existing tools.
are mined at different levels of analysis, forecasting, Also, in 2001, industry analyst Doug Laney articulated
visualization and modeling in order to reveal its intrinsic these three Vs. of Big Data as main stream definition of big
property and make an improve in its decision making. data. The features of Big Data are shown in Fig. 2 [15].
x The technology landscape in the data world is growing at x Volume: - As day by day, data is growing everywhere
a faster rate. Leveraging data provides an innovative from MB, GB, TB, PB, EB, ZB. According to the IBM, 2.5
technology partner which helps to create the IT architecture quintillion bytes of data are generated every day which itself
in a right way and can efficiently adapt to changes made in indicates 90% of the data were created in the previous two
the landscape. years. This data comes from everywhere like climate
x At last, security concerns about data protection are a information are gathered with the help of sensors, post to
major obstacle preventing companies from taking full social media sites, images, videos, purchase transaction
advantage of their data. records etc. In order, to obtain desired results data needs to
To tackle these challenges, various solutions have been be manipulated and analyze to make it structured data from
proposed by eminent researchers for big data systems in an unstructured data. However, in manipulating and analyzing
ad-hoc manner. As such, cloud computing is considered as the large volume of data, it possesses a great challenge
one of the infrastructure for big data systems to meet because it requires a lot of resources which will eventually
requirements like reliability, compensability, ubiquitous materialize in displaying the required results.
access, and scalability. For efficient storage and
management of vast datasets, NoSQL and distributed file
systems are used. MapReduce is a software framework
which is used in processing of group-aggregation tasks like
website ranking. Hadoop has also gained importance in
75
International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)
76
International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)
is so large that is very difficult to process using traditional years to reach 40 zettabytes (ZB) by 2020 or more upward
database and software methods. to 44ZB (i.e., equivalent to 44 trillion gigabytes) [19]. The
rate of increasing in the amount of sensors was more than
IV. BIG DATA SYSTEM ARCHITECTURE 30 percent per year.
Drastically, big data is seen as gathering, processing x Business Data: - In 2012, the amount of new data that are
and management of data for production of new information generated each day was 2.2 million TB but 90% of world
from end user’s side. Due to increase in such large datasets, wide’s current data worldwide were originated during the
the main key challenges that arose are often related to last two years [20]. However, in 2010, the market value for
storage, transportation and processing of these datas. To get Big Data was increased from $3.2 billion to $16.9 billion in
appropriate results, data needs to mined or cleaned, tagged, 2015 [21]. As per Facebook, it alone only accesses, analyzes
classified or formatted to separate the one kind of data from and stores 30 + PB of user data [22] which is quite lesser
another. In other words, big data can be said as a complex than world wide’s digital data, i.e., 2.7 ZB in 2012 [23]. The
system which includes multiple distinct phases to deal with processing of data reached to 20,000 TB of data daily by
different applications in the digital data life cycle. In Google in 2008 [24] and Wal-Mart had processed over 1
industry, a system-engineering approach is well adopted to million customer transactions which generates more than
decompose a typical big-data system, i.e., there are 4 2.5 PB of data. The amount of people doing messages, calls
consecutive phases: data generation, data acquisition, and and browsing on mobile and social devices was more than 5
data storage and data analytics. Basically, data value chain billion [25]. According to the calculated value an e-mail is
provides a framework to examine the ways in which sent in every 3.5 × 10−7 seconds. As per the data obtained
different datas can be brought together in an organized from IDC, the business data is expected to reach to a total
fashion and by making decisions to create valuable value of 40 ZB by the end of 2020 [26].
information as shown in Fig. 3 [18]. (iii) Data Acquisition: - In this step, the datas are being
acquired from different sources in digital form for storage
and analysis purposes. It is basically divided in three steps: -
data collection, data transmission and data preprocessing
[27]. The steps are shown in the below Fig. 4 [28].
Data Acquisition
77
International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)
x Sensors: - Sensors are often used to capture physical process and identification is added to the identification list.
quantities which are finally converted into different digital In this way, it greatly helps in reducing the storage space in
signals for storage and processing. It may be further big data.
categorized into sound wave, voice, vibration, automobile, (iv) Data Storage: - This step is concerned for coordinating
chemical, pressure, weather, current, and temperature. and organizing the collected information into storage
(b) Data Centre Transmission: - After raw datas are systems for analysis and value extraction. To carry out this
collected, they are transferred to a data storage infrastructure process, the data storage scheme must possess two features:
commonly in a data center for subsequent processing and x The information must be reconciled by the storage
analysis. It provides high capacity trunk medium to infrastructure in a persistent and reliable manner.
channelize big data from its source to a data center. The x A scalable access must be provided by the storage system
purpose behind this is to re-route the traffic in case of link. to analyze and process the large amount of data.
(c) Data Pre-Processing: - Since data gets collected from The functional components of the data storage system can
various sources, they may also differ in levels of quality as be further divided into two parts: - hardware infrastructure
noise, consistency, redundancy etc. The data pre-processing and data management.
techniques are designed in such a way to improve the 9 Hardware infrastructure is used for physical storage of
quality of data which improves the accuracy of analysis and collected information into the storage devices which is
reduces the storage expenses. The following three based on the specified technology.
techniques are used to process the data: - o Random Access Memory (RAM): - It stores volatile
x Integration: - This technique combines the data from type of information in its memory, which means as power
various sources to provide a unified view of data. turns off, the data gets losses. Modern RAM includes
Preferably, two commonly used traditional methods are data dynamic RAM (DRAM), static RAM (SRAM) and phase
warehousing and data federation. Data warehousing are also changed RAM (PRAM).
called as ETL which consists of three steps: extract, 9 Data Management Framework: - It concerns the way in
transform and loading. which how the data can be organized in a convinient manner
9 Extraction: - In this phase, datas are selected, collected, for efficient processing. There are three layers which are
processed and analyzed after being connected to the source used to classify the layered structure of data management
systems. framework as: -
9 Transformation: - This step involves the transformation o File Systems: - It provides the base for big data source
or conversion of extracted data into the standard format. and are widely used both in industry and academy. In the
9 Loading: - The extracted and transformed datas are below section, some examples are considered either as open
imported into a target storage infrastructure. source or designed for enterprise purpose.
The second technique is data federation, where datas are o Database Technology: - Various databse tecnologyies
being queried and aggregated from various sources to make had been used from many decades for storing of datasets in
data integration technique more dynamic using virtual storage systems and in diverse applications. Now-a-days,
database. The virtual database does not contain data itself NoSQL databases are becoming more popular to overcome
but contains information about the data or information the problem of big data with certain characteristics like
related to original data and its location. schema free, more consistent and can handle large amount
x Data Cleaning: - This technique involves the removing of data. This is the example of mordern database systems
of irrelevant data or inaccurate data to improve the data but traditional database systems cannot overcome from all
quality for accuracy, completeness and consistency. This is the variety and scale challenges of big data.
also renamed as data scrubbing or data cleansing. It can be Key-value stores: - This is the simplest model where
done with the help of five sub-processes: datas are stored in the form of key-value pair and a unique
9 Types of errors are defined and determined. key is generated for each client’s request. The best example
9 Identification of errors from the data. for this is Amazon’s Dynamo [32]. In Dynamo, datas are
9 Correction of errors. collected and partitioned across cluster of servers and finally
9 Documentation of error types and instances. replicated to form multiple copies.
9 Modification of data entry procedures to avoid future MapReduce: - This model was developed by Google for
errors. processing and storing of large datasets on commodity
x Redundancy Elimination: - Here, many datasets hardware. The processed datasets are stored in the form of
contains repetition of data or duplication of datas called as clusters. It possess two component model functions as
redundant data. These unnecessarily increases the data map() function and reduce() function. The Map function
transmission overhead which degrades the performance of arranges all together intermediate pairs based on the same
storage systems in an manner like wastage of storage space, intermediate keys (i.e., in the form of <key, value> pairs)
inconsistent data, reduced reliability and corrupted data. To and passess it to the Reduce() function. The Reduce()
overcome from this problem, various techniques have been function receives the same intermediate key and merges
proposed by various eminent researchers like data them together to produce a smaller set of values [33]. In
compression [29], redundancy detection [30], and data de- other words, we can say that, a Map Reduce framework
duplication [31]. Data de-duplication techniques are used to works on the basis of master-slave architecture where a
remove duplicate copies of repeating data where a unique numerous slave nodes are handled by one master node. To
segment of data is allocated and stored during storage provide equal load distribution, the input datasets are
78
International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)
divided into even sized data blocks and each data block is [10]Lohr S, 2012, “The age of big data,” New York Times, PP. 11.
then further assigned to one slave node. To generate the [11]Yuki N, 2011, “The search for analysts to make sense of big data,”
result, the data blocks are processed using map and merged http://www.npr.org/2011/11/30/142893065/the-searchforanalyststo-make-
together using Reduce() function. This framework is more sense-of-big-data
scalable; monitors heterogeneous datasets; provides [12]Special online collection: dealing with big data, 2011,
distributed and parallel I/O scheduling to overcome from the http://www.sciencemag.org/site/special/data/
issues of volume aspect; more fault tolerant; more efficient [13]http://bigdatamsritise.blogspot.in/2015/12/introduction-abstract-term-
of-big-data.html
and less time is required to retrieve data. This has been
[14]Hiba Jasim Hadi, Ammar Hameed Shnain, Sarah Hadi Shaheed,
designed to overcome from the issues of large scale data Azizahbt Haji Ahmad, 1st Nov., 2014, “Big Data and Five V’s
processing. Characteristics,” Proceedings of IRF International Conference,
(v) Data Analysis: - This is the last and most important Tirupati, India, ISBN: 978-93-84209-61-2.
step of big data value chain which uses analytical tools or [15]Marko Grobelnik, 8th May, 2012, “Big-Data Tutorial,” Jozef Stefan
methods to inspect, transform and model data for extraction Institute, Ljubljana, Slovenia.
purpose. The main emphasis of this classification is to [16]Stephen Kaisler, Frank Armour, J. Alberto Espinosa, William Money,
highlight the importance of data characteristics in each area. 2013, “Big Data: Issues and Challenges Moving Forward,” 46th
The purpose of using analytical tool is to extract as much Hawaii International Conference on System Sciences, IEEE.
relevant data as possible from that subject under [17]Avita Katal, Mohammad Wazid, R H Goudar, 2013, “Big Data: Issues,
challenges, Tools and Good Practices,” IEEE.
consideration.
[18]Han Hu, Yonggang Wen, Tat-Seng Chua, Xuelong Li, 8th July 2014,
V. CONCLUSION “Toward Scalable Systems for Big Data Analytics: A Technology
Tutorial,” Digital Object Identifier 10.1109/ACCESS.2014.2332453,
In this paper, the concept of big data is presented. Big data IEEE Access, vol. 2, pp: 652-687.
is the collection of large and complex datasets which are [19]http://www.zdnet.com/article/the-internet-of-things-and-big-data-
unlocking-the-power/
generated from various sources such as email attachments,
social media comments, playing a video, etc. We have also [20]Marcia, 2012, “Data on Big Data,” http://marciaconner.com/ blog/data-
on-big-data/.
discussed three V’s of big data, i.e., volume, velocity and
[21]Facebook Statistics, 2014, http://www.statisticbrain .com/facebook-
variety. These terms played a major role in big data statistics/
analytics. We have also illustrated the concept of big data [22]K. Douglas, 2012, “Infographic: big data brings marketing big
value chain along with its four main stages like data numbers,” http://www.marketingtechblog.com/ibm-big-
generation, data acquisition, data storage and data analysis. datamarketing/
During storing and generating the data some variations may [23]B. Buxton, V. Hayward, I. Pearson et al., 2008, “Big data: the next
Google. Interview by Duncan Graham-Rowe,” Nature, vol. 455, no.
be possible, i.e., whether data is in video, audio, images, etc. 7209, pp. 8–9.
For easy understanding of big data analytics, researchers [24]P. Russom, 2011, “Big data analytics,” TDWI Best Practices Report,
have divided data into various big data applications like web Fourth Quarter.
analytics, text analytics, etc. Finally, to improve efficiency [25]S. Sagiroglu, D. Sinanc, May 2013, “Big data: a review,” in
of government sectors, industries, there is an urgent need for Proceedings of the International Conference on Collaboration
advanced data management, analysis and acquisition Technologies and Systems (CTS ’13), pp. 42–47, IEEE, San Diego,
Calif, USA.
mechanisms on typical big data applications to generate
[26]H.V. Jagadish, D. Agrawal, P. Bernstein, E. e. a. Bertino, 2015,
profit. “Challenges and Opportunities with Big Data,” The Community
Research Association.
REFERENCES [27]Cheikh Kacfah Emani, Nadine Cullot, Christophe Nicolle, 2015,
[1]Hari Kumar, Dr. P. UmaMaheswari, Nov-Dec, 2014, “Literature Survey “Understandable Big Data: A survey,” Computer Science Review I7,
on Big Data and Preserving Privacy for the Big Data in Cloud,” 70-8I.
IJTRA, ISSN: - 2320-8163, Vol. 2, Issue 6, PP. 128-133. [28]Y. Zhang, J. Callan, T. Minka, 2002, “Novelty and redundancy
[2]Apache Hive. Available at: http://hive.apache.org detection in adaptive filtering,'' in Proc. 25th Annu. Int. ACM SIGIR
[3]https://support.rackspace.com/white-paper/turning-big-data-into-big- Conf. Res. Develop. Inform. Retr., pp. 81-88.
dollars/ [29]D. Salomon, 2004, “Data Compression,” New York, NY, USA:
[4]R.Devakunchari, “Analysis on big data over the years,” IJSRP, Volume Springer-Verlag.
4, Issue 1, 1st January 2014, ISSN: 2250-3153. [30]S. Sarawagi, A. Bhamidipaty, 2002, “Interactive deduplication using
[5]World's data will grow by 50X in next decade, IDC study predicts active learning,” in Proc. 8th ACM SIGKDD Int. Conf. Knowl.,
http://www.computerworld.com/s/article/9217988/World_s_data_will Discovery Data Mining, pp. 269-278.
_grow_by_50X_in_next_decade_IDC_study_predicts [31]G. DeCandia et. al., 2007, “Dynamo: Amazon's highly available key-
[6]Min Chen, Shiwen Mao, Yunhao Liu, “Big Data: A Survey,” Mobile [32]C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, C. Kozyrakis,
Netw Appl, 19:171–209, DOI 10.1007/s11036-013-0489-0, Springer 2007, “Evaluating mapreduce for multi-core and multiprocessor
Science, Business Media New York, 2014. systems,” in: Proceedings of the 2007 IEEE 13th International
[7]http://www.how-much-data-is-on-the-internet-and-generated-online- Symposium on High Performance Computer Architecture, HPCA
every-minute/ ’07, IEEE Computer Society, Washington, DC, USA, pp. 13–24.
[8]Shilpa, Manjit Kaur, “BIG Data and Methodology-A review,” [33]J. H. Howard et. al., “Scale and performance in a distributed file
IJARCSSE, Volume 3, Issue 10, October 2013, ISSN: 2277 128X. system,” ACM Trans. Comput. Syst., vol. 6, no. 1, pp. 51–81, 1988.
[9]Drowning in numbers - digital data will flood the planet- and help us
understand it better, 2011,
http://www.economist.com/blogs/dailychart/2011/11/bigdata-0
79