0% found this document useful (0 votes)
78 views

Big Data Methods 1

Big Data Methods

Uploaded by

Artur De Assis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views

Big Data Methods 1

Big Data Methods

Uploaded by

Artur De Assis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/334167196

An Outline on Big Data and Big Data Analytics

Conference Paper · October 2018


DOI: 10.1109/ICACCCN.2018.8748683

CITATIONS READS
4 1,210

4 authors, including:

Devottam Gaurav Rohit Kaliyar

10 PUBLICATIONS   32 CITATIONS   
Bennett University
20 PUBLICATIONS   72 CITATIONS   
SEE PROFILE
SEE PROFILE

Ayush Goyal
Texas A&M University - Kingsville
71 PUBLICATIONS   345 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Development of a Software Tool for Segmentation of Brain MRI Gray and White Matter Regions View project

Let's Save the World and Humanity View project

All content following this page was uploaded by Devottam Gaurav on 20 November 2019.

The user has requested enhancement of the downloaded file.


International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)

An Outline on Big Data and Big Data Analytics

Devottam Gaurav Jay Kant Pratap Singh Yadav Rohit Kumar Kaliyar Dr. Ayush Goyal
Research Scholar, Assistant Professor, Research Scholar, Assistant Professor,
Bennett University, NIET, Gr. Noida, Bennett University, Texas A&M,
Gr. Noida, Uttar Pradesh Uttar Pradesh Gr. Noida, Uttar Pradesh University-Kingsville, USA
[email protected] [email protected] [email protected] [email protected]

Abstract—With the rapid advancements in technological TABLE I. TERMS USED FOR BIG DATA STORAGE CAPACITY[3]
applications have led to the flooding of data from various
sources like web, social network data, business data, medical Term Size Example of Capacity
records, etc. over the preceding years. As compared to 1GB = 2 hours of CD-
1000,000,000 (1
traditional data, big data reveals a unique characteristic from Gigabyte(GB)
billion bytes)
quality audio or 7 minutes
its three V’s which means big data is unstructured. In this era, of HDTV
the emerging trend requires the involvement of advanced data Terabyte (TB) 1000,000,000,000 1TB = 2000 hours of CD-
analysis, acquisition and management techniques to mine and (one trillion bytes) quality audio or 5 days of
HDTV
collect appropriate data in a structured way. In this paper, we
Petabyte (PB) 1000,000,000,000, 1GB = 2hours of CD-
describe the definitions and the challenges of big data systems. 000 (one quality audio or 7 minutes
Next, a systematic framework decomposes the architecture of quadrillion bytes) of HDTV
big data systems into four stages like data generation, data Exabyte (EB) 1000,000,000,000, 1GB = 2hours of CD-
acquisition, data storage and data analytics. These stages form 000,000 (one quality audio or 7 minutes
the basis of big data value chain. Finally, some solutions are quintillion bytes) of HDTV
discussed to tackle the challenges of big data and future x According to the reports from The International Data
attention is required for big data systems. Corporation (IDC) study forecasts that whole data will grow
by 50 times by 2020[4], driven in large part more by
Keywords—big data, big data analytics, data acquisition embedded systems such as sensors in clothing, medical
devices etc. This study also determines that unstructured
I. INTRODUCTION
information like files, email and video will account for 90%
With the growth in technologies and services in the of all data produced over the next decade [5].
past twenty years, the large amounts of datasets are x There are 277,000 tweets per minute, 2 million queries
increasing at a rapid speed. At the start of computer era, the are searched on Google every minute and 72 hours of new
size of data was measured in KB (Kilo Byte). Later it was videos are uploaded to YouTube in every minute[6].
extended to MB (Mega Byte), GB (Giga Byte), TB (Tera Irrespective of these, more than 100 million emails are sent,
Byte), PB (Peta Byte), EB (Exa Byte) and ZB (Zeta Byte) 350 GB of data is processing on Facebook and more than
[1]. The size of databases has been rising at exponential 570 websites are created per minute [7]. In the year 2012,
rates in today‘s enterprises. Thus, it is necessary to process 2.5 quintillion bytes of data were produced every day [8].
and examine these large volumes of data for decision In spite of these, many issues of big data are covered in
making in business has also increased along with it. public media, such as The Economist [9], New York Times
In addition to these, petabytes of datas are processed in [10], and National Public Radio [11]. Among all scientific
structured ways which are produced from numerous journals, only two, namely, Nature and Science opened the
business and scientific applications. The interactions of special columns to deliberate the challenges and impacts of
millions of people using mobile devices and Internet every big data [12]. The era of big data has come beyond all
day, also leads to flooding of data. The rising volume of doubt. The rapid rise in the volume of big data is shown in
enterprise information, medical records, information related Fig.1[13]. Due to increase in large amount of data, it
to sensing mobile devices, multimedia, social media and so becomes very difficult to handle such datasets because the
on will fuel exponential growth in data in the future. This volume is increasing at a rapid speed in comparison to
huge amount of data is called as “Big Data” [2]. An computing resources.
example of such data for the size is illustrated in below When it is compared with traditional datasets, big data
Table 1[3]. contains amorphous data which require real time analysis. In
Big Data is a term coined from the need of big addition to these, big data provides new opportunities in
companies like Yahoo, Google, Facebook etc. to analyze finding new values, provides the thorough understanding of
large amount of unstructured data which is generated in the hidden values, and gives a new way to experience new
every second. Some facts are as: - challenges, i.e., how to effectively manage and organize
such datasets. These may include many challenging
problems like analysis, data privacy, security, data

ISBN: 978-1-5386-4119-4/18/$31.00 ©2018 IEEE 74


International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)

management, search, sharing, capture, transfer and processing of distributed data along with its storage and
visualizations. management of systems in order to build powerful systems
level solutions to handle big data challenges. Many of the
applications related to big data can be explored with the
help of these innovative technologies. The explosion of
these technologies require a systematic framework which
involves the capturing of modern big data research and
development efforts and apply those advancement in
different area of subjects.
The remaining part of the paper is summarized as follows:
Section 2 provides definitions and characteristics of big
data. In section 3, description of big data analytics are given.
In section 4, we presented the architectural view of big data
system. In section 5, some cases of big data analytics are
discussed. Finally, Section 6 gives the view of the
Fig. 1: The continuously rising of global data [13]
conclusion.
These challenging problems also demand many prompt II. DEFINITIONS AND CHARACTERISTICS OF BIG DATA
solutions as: -
The term “Big Data” was first coined by Roger
x The latest development in information technology (IT)
Magoulas in the computing era from O’Reilly media in
field makes it easier to generate data. For example, in every
2005 in order to define a great amount of data that
minute, 72 hours of new videos are uploaded on YouTube.
traditional data management techniques unable to process
Therefore, the main challenge that we face is to collect and
and manage the data due to its complexity and size of data.
integrate the vast datas from various distributed data
There are three aspects of big datas as: - (a) numerous datas
sources.
are available. (b) Relational databases have no
x Due to increase in large datasets, the major challenge categorization of data and (c) at last, generates the data,
becomes for researchers and practitioners are that how to capture the data and process the data in a rapid fashion. In
manage and store these petabytes of datasets with latest addition to these, many things are being transformed with
technologies. Such enormous data will far surpass the the help of big data like science, healthcare, finance,
capacities of existing IT architectures and infrastructure of engineering, business and also the society. Madden
existing enterprises along with its real time requirements characterizes the huge information as “information that is
which will greatly affects the available computing capacity. too huge, too quick or too hard to exist apparatuses to
For instance, according to the estimation of Wal-Mart, it handle” [14]. The expression “Too Big” states that the
collects more than 2.5 petabytes of data every hour from its associations must manage zettabytes scale accumulations of
customer transactions. Thus, we need to design appropriate information that originates from web sources, sensors,
cloud computing platforms which can update their intensive sound, pictures, recordings, and so on. “Too Fast” means
workloads. data must be processed quickly irrespective of how much
x In such heterogeneous data, it becomes very difficult for the size of data is such as carrying out fraud detection or to
IT companies to identify the right data and determine how find an ad to display. “Too Hard” means such data cannot
to best use it. In order to utilize it in a best way, the datasets be processed by existing tools.
are mined at different levels of analysis, forecasting, Also, in 2001, industry analyst Doug Laney articulated
visualization and modeling in order to reveal its intrinsic these three Vs. of Big Data as main stream definition of big
property and make an improve in its decision making. data. The features of Big Data are shown in Fig. 2 [15].
x The technology landscape in the data world is growing at x Volume: - As day by day, data is growing everywhere
a faster rate. Leveraging data provides an innovative from MB, GB, TB, PB, EB, ZB. According to the IBM, 2.5
technology partner which helps to create the IT architecture quintillion bytes of data are generated every day which itself
in a right way and can efficiently adapt to changes made in indicates 90% of the data were created in the previous two
the landscape. years. This data comes from everywhere like climate
x At last, security concerns about data protection are a information are gathered with the help of sensors, post to
major obstacle preventing companies from taking full social media sites, images, videos, purchase transaction
advantage of their data. records etc. In order, to obtain desired results data needs to
To tackle these challenges, various solutions have been be manipulated and analyze to make it structured data from
proposed by eminent researchers for big data systems in an unstructured data. However, in manipulating and analyzing
ad-hoc manner. As such, cloud computing is considered as the large volume of data, it possesses a great challenge
one of the infrastructure for big data systems to meet because it requires a lot of resources which will eventually
requirements like reliability, compensability, ubiquitous materialize in displaying the required results.
access, and scalability. For efficient storage and
management of vast datasets, NoSQL and distributed file
systems are used. MapReduce is a software framework
which is used in processing of group-aggregation tasks like
website ranking. Hadoop has also gained importance in

75
International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)

million emails are sent, about 20 million photos are viewed


and on Flickr 30,000 photos are uploaded, nearly 3,00,000
tweets are sent and about 2.5 million queries are performed
on Google. Another source of high velocity big data is
social media. Twitter users are estimated to generate nearly
100,000 tweets every 60 seconds and 700,000 posts are
being posted on Facebook. The challenge that organizations
have to cope up is the speed of creating the data and get
utilized in real-time. In this case, velocity pertains not only
to how quickly data is generated, but also to how data is
quickly someone interprets and acts upon it. . At last, the
big data can be treated as “High Velocity Big Data”.
x Variety: - In this world, datas are available in numerous
formats like database, word, csv, stored in text file, cd,
pendrive, etc. Sometimes data may not be available in
traditional format, they may be available in form of videos,
Fig. 2. Features of Big Data [15] SMS, pdf or in some way as we might have not thought off.
The biggest challenge that we are facing are that how to
As the size of data is unlimited, it will be very difficult to arrange all these datas so that it can be read and accessed by
process these datas because our computer systems are users in a meaningful way. Online networking information
limited to the current technology in terms of speed of is additionally a decent case of the assortment of data that
processing operations as its speed is limited in some extent. portrays Big Data. Web-based social networking data, as
In order to overcome from this limitation, infrastructure about 80-90% of all information today, is unstructured.
must be developed which results in higher processing To conclude, dealing with Big Data in an effective manner
speeds but cost of developing it will be higher. Finally we requires creating a value against the features or
conclude with the fact that big data can be treated as “High characteristics of big data [16, 17].
Volume Big Data”.
x Velocity: - The information development and online III. BIG DATA ANALYTICS
networking blast have changed on what we look like at the
Big data analytics refers to the process to collect
information. There was a time when we used to believe that
organize and analyze vast data sets (called as big data) for
data of yesterday is recent. This is the fact regarding about
discovering patterns and other useful insights. The analytics
matter that the newspaper follows. However, the news and
help organizations to better understand the information
radio channels are making their systems being upgraded
contained within the data and helps in identifying the most
with the latest technology so that their news can be received
relevant data related to business. It also helps in analyzing
quickly to the people. Apart from this, people mostly rely on
better business decision in the future. Collectively, these
social media like Facebook, twitter, etc. so that they can
processes are highly integrated functions of high
update themselves with the latest news. They often discard
performance analytics. In other words, these are a specific
old messages and pay attention to latest updates. The data
form of analytics, and can be regarded as advanced
movement is now almost the real time and update window
analytics. In order to analyze such large datasets, analysts
has been reduced to fractions of the seconds. This high
usually use specialized software tools and applications
velocity represents the “Big Data”.
which are based on SQL queries, data mining, text mining,
In other words, to process huge amount of data, speed
predictive analytics, forecasting and data optimization, fast
plays an important role. Many ecommerce websites are fully
clustering. The rundown can likewise be stretched out to
dependent upon this factor. Google provided a fact that to
cover information representation, manmade brainpower,
get good ranking, the speed of a page is essential. The speed
characteristic dialect preparing, and database abilities that
factor also provides comfort to the customer/visitor to do
bolster examination, (for example, MapReduce, in-database
their task in an easiest way like shopping, surfing, etc. To
investigation, in-memory databases, columnar information
process other information, the same thing matters a lot. The
stores). The name of big data analytics can be changed to
term “velocity” is the speed at which the data is created,
discovery analytics instead of advanced analytics because
stored, analyzed and visualized. In the ancient stage, with
business analyst users are trying to discover new business
the help of batch processing, it was very easy to get an
facts which are not known by anyone before. For that,
update from the database in every week or every night.
expert needs huge volumes of information with a lot of
Servers and computers require significant time to update the
detail. This is regularly information that the undertaking has
databases or to process the data. When talking about the
not yet tapped for examination. However, big data analysis
concept of big data, creation of data takes place in real-time
is a challenge for most organizations. Firstly, there is a test
or near real-time. When devices are connected to the
in separating information storehouses to get to all
Internet, either wireless or wired machines, devices pass
information an association stores in better places and
their data the moment it is created.
frequently in various frameworks. Secondly, it faces a
The rapid increase of data is behind our thought. On
challenge in creating a platform that can pull unstructured
YouTube, 72 hours of video are uploaded in every minute.
data as compared to structured data. The volume of datasets
In addition to this, in every minute over and above 200

76
International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)

is so large that is very difficult to process using traditional years to reach 40 zettabytes (ZB) by 2020 or more upward
database and software methods. to 44ZB (i.e., equivalent to 44 trillion gigabytes) [19]. The
rate of increasing in the amount of sensors was more than
IV. BIG DATA SYSTEM ARCHITECTURE 30 percent per year.
Drastically, big data is seen as gathering, processing x Business Data: - In 2012, the amount of new data that are
and management of data for production of new information generated each day was 2.2 million TB but 90% of world
from end user’s side. Due to increase in such large datasets, wide’s current data worldwide were originated during the
the main key challenges that arose are often related to last two years [20]. However, in 2010, the market value for
storage, transportation and processing of these datas. To get Big Data was increased from $3.2 billion to $16.9 billion in
appropriate results, data needs to mined or cleaned, tagged, 2015 [21]. As per Facebook, it alone only accesses, analyzes
classified or formatted to separate the one kind of data from and stores 30 + PB of user data [22] which is quite lesser
another. In other words, big data can be said as a complex than world wide’s digital data, i.e., 2.7 ZB in 2012 [23]. The
system which includes multiple distinct phases to deal with processing of data reached to 20,000 TB of data daily by
different applications in the digital data life cycle. In Google in 2008 [24] and Wal-Mart had processed over 1
industry, a system-engineering approach is well adopted to million customer transactions which generates more than
decompose a typical big-data system, i.e., there are 4 2.5 PB of data. The amount of people doing messages, calls
consecutive phases: data generation, data acquisition, and and browsing on mobile and social devices was more than 5
data storage and data analytics. Basically, data value chain billion [25]. According to the calculated value an e-mail is
provides a framework to examine the ways in which sent in every 3.5 × 10−7 seconds. As per the data obtained
different datas can be brought together in an organized from IDC, the business data is expected to reach to a total
fashion and by making decisions to create valuable value of 40 ZB by the end of 2020 [26].
information as shown in Fig. 3 [18]. (iii) Data Acquisition: - In this step, the datas are being
acquired from different sources in digital form for storage
and analysis purposes. It is basically divided in three steps: -
data collection, data transmission and data preprocessing
[27]. The steps are shown in the below Fig. 4 [28].

Data Acquisition

Data Data Data


Collection Transmission Pre-processing
Pre processing
x Log Files x Data Centre x Integration
x Sensors Transmission x Cleaning
Fig. 3. Big Data Value Chain [18] x Redundancy Elimination
Fig. 4. Data Acquisition Steps[28].
(i) Data Discovery: - Before to perform an analysis on
data, it is necessary to know what data sources are available (a) Data Collection: - In this phase, subsequent datas or
needed for decision-making. The next task of data discovery raw datas are collected from real world objects in a
is to generate and prepare the data. systematic manner. Otherwise, collection of inaccurate data
(ii) Data Generation: - The way in which big data is being will lead to invalid results after the analysis phase. The
generated can be characterized by data generation rate. The collections of datas are being done from variety of sources,
reason behind this is due to increase in technological websites, click streams, images, videos, etc. As a result, this
investments. As IBM already stated that 2.5 quintillion step not only depends on the physical characteristics of the
bytes of data are generated every day which itself indicates data sources but also on the objective of the data analysis.
90% of the data has been created in the last two years. There are a lot of techniques used to collect accurate data.
x Networking Data: - Networking has been in every walks Here, only a few are discussed in the subsection.
of human life whether it is internet, mobile network, or x Log Files: - Web log file is one of the commonly used
internet of things. There are many applications related to data collection method. The purpose of this is to record
network which can be regarded as network big data sources activities of the Web user in the form of data source in a
like search, click streams, websites but not limited to these. specified file format for subsequent analysis. They are
almost common to all applications running on digital
For example, since 2007, EMC and IDC are continuously
devices. The specified file format is referred as log file
tracking the size of the 'Digital Universe', or DU (the DU is
which is used for debugging purposes. For example, the
the creation, replication and consumption of the digital data activities include hits, click streams, accessing to other
in a single year). In 2012, EMC and IDC provided an websites, and other attributes done by the Web user in a log
estimation that the amount of DU will double every two file.

77
International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)

x Sensors: - Sensors are often used to capture physical process and identification is added to the identification list.
quantities which are finally converted into different digital In this way, it greatly helps in reducing the storage space in
signals for storage and processing. It may be further big data.
categorized into sound wave, voice, vibration, automobile, (iv) Data Storage: - This step is concerned for coordinating
chemical, pressure, weather, current, and temperature. and organizing the collected information into storage
(b) Data Centre Transmission: - After raw datas are systems for analysis and value extraction. To carry out this
collected, they are transferred to a data storage infrastructure process, the data storage scheme must possess two features:
commonly in a data center for subsequent processing and x The information must be reconciled by the storage
analysis. It provides high capacity trunk medium to infrastructure in a persistent and reliable manner.
channelize big data from its source to a data center. The x A scalable access must be provided by the storage system
purpose behind this is to re-route the traffic in case of link. to analyze and process the large amount of data.
(c) Data Pre-Processing: - Since data gets collected from The functional components of the data storage system can
various sources, they may also differ in levels of quality as be further divided into two parts: - hardware infrastructure
noise, consistency, redundancy etc. The data pre-processing and data management.
techniques are designed in such a way to improve the 9 Hardware infrastructure is used for physical storage of
quality of data which improves the accuracy of analysis and collected information into the storage devices which is
reduces the storage expenses. The following three based on the specified technology.
techniques are used to process the data: - o Random Access Memory (RAM): - It stores volatile
x Integration: - This technique combines the data from type of information in its memory, which means as power
various sources to provide a unified view of data. turns off, the data gets losses. Modern RAM includes
Preferably, two commonly used traditional methods are data dynamic RAM (DRAM), static RAM (SRAM) and phase
warehousing and data federation. Data warehousing are also changed RAM (PRAM).
called as ETL which consists of three steps: extract, 9 Data Management Framework: - It concerns the way in
transform and loading. which how the data can be organized in a convinient manner
9 Extraction: - In this phase, datas are selected, collected, for efficient processing. There are three layers which are
processed and analyzed after being connected to the source used to classify the layered structure of data management
systems. framework as: -
9 Transformation: - This step involves the transformation o File Systems: - It provides the base for big data source
or conversion of extracted data into the standard format. and are widely used both in industry and academy. In the
9 Loading: - The extracted and transformed datas are below section, some examples are considered either as open
imported into a target storage infrastructure. source or designed for enterprise purpose.
The second technique is data federation, where datas are o Database Technology: - Various databse tecnologyies
being queried and aggregated from various sources to make had been used from many decades for storing of datasets in
data integration technique more dynamic using virtual storage systems and in diverse applications. Now-a-days,
database. The virtual database does not contain data itself NoSQL databases are becoming more popular to overcome
but contains information about the data or information the problem of big data with certain characteristics like
related to original data and its location. schema free, more consistent and can handle large amount
x Data Cleaning: - This technique involves the removing of data. This is the example of mordern database systems
of irrelevant data or inaccurate data to improve the data but traditional database systems cannot overcome from all
quality for accuracy, completeness and consistency. This is the variety and scale challenges of big data.
also renamed as data scrubbing or data cleansing. It can be ™Key-value stores: - This is the simplest model where
done with the help of five sub-processes: datas are stored in the form of key-value pair and a unique
9 Types of errors are defined and determined. key is generated for each client’s request. The best example
9 Identification of errors from the data. for this is Amazon’s Dynamo [32]. In Dynamo, datas are
9 Correction of errors. collected and partitioned across cluster of servers and finally
9 Documentation of error types and instances. replicated to form multiple copies.
9 Modification of data entry procedures to avoid future ™ MapReduce: - This model was developed by Google for
errors. processing and storing of large datasets on commodity
x Redundancy Elimination: - Here, many datasets hardware. The processed datasets are stored in the form of
contains repetition of data or duplication of datas called as clusters. It possess two component model functions as
redundant data. These unnecessarily increases the data map() function and reduce() function. The Map function
transmission overhead which degrades the performance of arranges all together intermediate pairs based on the same
storage systems in an manner like wastage of storage space, intermediate keys (i.e., in the form of <key, value> pairs)
inconsistent data, reduced reliability and corrupted data. To and passess it to the Reduce() function. The Reduce()
overcome from this problem, various techniques have been function receives the same intermediate key and merges
proposed by various eminent researchers like data them together to produce a smaller set of values [33]. In
compression [29], redundancy detection [30], and data de- other words, we can say that, a Map Reduce framework
duplication [31]. Data de-duplication techniques are used to works on the basis of master-slave architecture where a
remove duplicate copies of repeating data where a unique numerous slave nodes are handled by one master node. To
segment of data is allocated and stored during storage provide equal load distribution, the input datasets are

78
International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)

divided into even sized data blocks and each data block is [10]Lohr S, 2012, “The age of big data,” New York Times, PP. 11.
then further assigned to one slave node. To generate the [11]Yuki N, 2011, “The search for analysts to make sense of big data,”
result, the data blocks are processed using map and merged http://www.npr.org/2011/11/30/142893065/the-searchforanalyststo-make-
together using Reduce() function. This framework is more sense-of-big-data
scalable; monitors heterogeneous datasets; provides [12]Special online collection: dealing with big data, 2011,
distributed and parallel I/O scheduling to overcome from the http://www.sciencemag.org/site/special/data/
issues of volume aspect; more fault tolerant; more efficient [13]http://bigdatamsritise.blogspot.in/2015/12/introduction-abstract-term-
of-big-data.html
and less time is required to retrieve data. This has been
[14]Hiba Jasim Hadi, Ammar Hameed Shnain, Sarah Hadi Shaheed,
designed to overcome from the issues of large scale data Azizahbt Haji Ahmad, 1st Nov., 2014, “Big Data and Five V’s
processing. Characteristics,” Proceedings of IRF International Conference,
(v) Data Analysis: - This is the last and most important Tirupati, India, ISBN: 978-93-84209-61-2.
step of big data value chain which uses analytical tools or [15]Marko Grobelnik, 8th May, 2012, “Big-Data Tutorial,” Jozef Stefan
methods to inspect, transform and model data for extraction Institute, Ljubljana, Slovenia.
purpose. The main emphasis of this classification is to [16]Stephen Kaisler, Frank Armour, J. Alberto Espinosa, William Money,
highlight the importance of data characteristics in each area. 2013, “Big Data: Issues and Challenges Moving Forward,” 46th
The purpose of using analytical tool is to extract as much Hawaii International Conference on System Sciences, IEEE.
relevant data as possible from that subject under [17]Avita Katal, Mohammad Wazid, R H Goudar, 2013, “Big Data: Issues,
challenges, Tools and Good Practices,” IEEE.
consideration.
[18]Han Hu, Yonggang Wen, Tat-Seng Chua, Xuelong Li, 8th July 2014,
V. CONCLUSION “Toward Scalable Systems for Big Data Analytics: A Technology
Tutorial,” Digital Object Identifier 10.1109/ACCESS.2014.2332453,
In this paper, the concept of big data is presented. Big data IEEE Access, vol. 2, pp: 652-687.
is the collection of large and complex datasets which are [19]http://www.zdnet.com/article/the-internet-of-things-and-big-data-
unlocking-the-power/
generated from various sources such as email attachments,
social media comments, playing a video, etc. We have also [20]Marcia, 2012, “Data on Big Data,” http://marciaconner.com/ blog/data-
on-big-data/.
discussed three V’s of big data, i.e., volume, velocity and
[21]Facebook Statistics, 2014, http://www.statisticbrain .com/facebook-
variety. These terms played a major role in big data statistics/
analytics. We have also illustrated the concept of big data [22]K. Douglas, 2012, “Infographic: big data brings marketing big
value chain along with its four main stages like data numbers,” http://www.marketingtechblog.com/ibm-big-
generation, data acquisition, data storage and data analysis. datamarketing/
During storing and generating the data some variations may [23]B. Buxton, V. Hayward, I. Pearson et al., 2008, “Big data: the next
Google. Interview by Duncan Graham-Rowe,” Nature, vol. 455, no.
be possible, i.e., whether data is in video, audio, images, etc. 7209, pp. 8–9.
For easy understanding of big data analytics, researchers [24]P. Russom, 2011, “Big data analytics,” TDWI Best Practices Report,
have divided data into various big data applications like web Fourth Quarter.
analytics, text analytics, etc. Finally, to improve efficiency [25]S. Sagiroglu, D. Sinanc, May 2013, “Big data: a review,” in
of government sectors, industries, there is an urgent need for Proceedings of the International Conference on Collaboration
advanced data management, analysis and acquisition Technologies and Systems (CTS ’13), pp. 42–47, IEEE, San Diego,
Calif, USA.
mechanisms on typical big data applications to generate
[26]H.V. Jagadish, D. Agrawal, P. Bernstein, E. e. a. Bertino, 2015,
profit. “Challenges and Opportunities with Big Data,” The Community
Research Association.
REFERENCES [27]Cheikh Kacfah Emani, Nadine Cullot, Christophe Nicolle, 2015,
[1]Hari Kumar, Dr. P. UmaMaheswari, Nov-Dec, 2014, “Literature Survey “Understandable Big Data: A survey,” Computer Science Review I7,
on Big Data and Preserving Privacy for the Big Data in Cloud,” 70-8I.
IJTRA, ISSN: - 2320-8163, Vol. 2, Issue 6, PP. 128-133. [28]Y. Zhang, J. Callan, T. Minka, 2002, “Novelty and redundancy
[2]Apache Hive. Available at: http://hive.apache.org detection in adaptive filtering,'' in Proc. 25th Annu. Int. ACM SIGIR
[3]https://support.rackspace.com/white-paper/turning-big-data-into-big- Conf. Res. Develop. Inform. Retr., pp. 81-88.
dollars/ [29]D. Salomon, 2004, “Data Compression,” New York, NY, USA:
[4]R.Devakunchari, “Analysis on big data over the years,” IJSRP, Volume Springer-Verlag.
4, Issue 1, 1st January 2014, ISSN: 2250-3153. [30]S. Sarawagi, A. Bhamidipaty, 2002, “Interactive deduplication using
[5]World's data will grow by 50X in next decade, IDC study predicts active learning,” in Proc. 8th ACM SIGKDD Int. Conf. Knowl.,
http://www.computerworld.com/s/article/9217988/World_s_data_will Discovery Data Mining, pp. 269-278.
_grow_by_50X_in_next_decade_IDC_study_predicts [31]G. DeCandia et. al., 2007, “Dynamo: Amazon's highly available key-
[6]Min Chen, Shiwen Mao, Yunhao Liu, “Big Data: A Survey,” Mobile [32]C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, C. Kozyrakis,
Netw Appl, 19:171–209, DOI 10.1007/s11036-013-0489-0, Springer 2007, “Evaluating mapreduce for multi-core and multiprocessor
Science, Business Media New York, 2014. systems,” in: Proceedings of the 2007 IEEE 13th International
[7]http://www.how-much-data-is-on-the-internet-and-generated-online- Symposium on High Performance Computer Architecture, HPCA
every-minute/ ’07, IEEE Computer Society, Washington, DC, USA, pp. 13–24.
[8]Shilpa, Manjit Kaur, “BIG Data and Methodology-A review,” [33]J. H. Howard et. al., “Scale and performance in a distributed file
IJARCSSE, Volume 3, Issue 10, October 2013, ISSN: 2277 128X. system,” ACM Trans. Comput. Syst., vol. 6, no. 1, pp. 51–81, 1988.
[9]Drowning in numbers - digital data will flood the planet- and help us
understand it better, 2011,
http://www.economist.com/blogs/dailychart/2011/11/bigdata-0

79

View publication stats

You might also like