0% found this document useful (0 votes)

78 views

Big Data Methods 1

Big Data Methods

Uploaded by

Artur De Assis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views

Big Data Methods 1

Big Data Methods

Uploaded by

Artur De Assis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/334167196

An Outline on Big Data and Big Data Analytics

Conference Paper · October 2018

DOI: 10.1109/ICACCCN.2018.8748683

CITATIONS READS
4 1,210

4 authors, including:

Devottam Gaurav Rohit Kaliyar

10 PUBLICATIONS 32 CITATIONS
Bennett University
20 PUBLICATIONS 72 CITATIONS
SEE PROFILE
SEE PROFILE

Ayush Goyal
Texas A&M University - Kingsville
71 PUBLICATIONS 345 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Development of a Software Tool for Segmentation of Brain MRI Gray and White Matter Regions View project

Let's Save the World and Humanity View project

All content following this page was uploaded by Devottam Gaurav on 20 November 2019.

The user has requested enhancement of the downloaded file.

International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)

An Outline on Big Data and Big Data Analytics

Devottam Gaurav Jay Kant Pratap Singh Yadav Rohit Kumar Kaliyar Dr. Ayush Goyal
Research Scholar, Assistant Professor, Research Scholar, Assistant Professor,
Bennett University, NIET, Gr. Noida, Bennett University, Texas A&M,
Gr. Noida, Uttar Pradesh Uttar Pradesh Gr. Noida, Uttar Pradesh University-Kingsville, USA
[email protected] [email protected] [email protected] [email protected]

Abstract—With the rapid advancements in technological TABLE I. TERMS USED FOR BIG DATA STORAGE CAPACITY[3]
applications have led to the flooding of data from various
sources like web, social network data, business data, medical Term Size Example of Capacity
records, etc. over the preceding years. As compared to 1GB = 2 hours of CD-
1000,000,000 (1
traditional data, big data reveals a unique characteristic from Gigabyte(GB)
billion bytes)
quality audio or 7 minutes
its three V’s which means big data is unstructured. In this era, of HDTV
the emerging trend requires the involvement of advanced data Terabyte (TB) 1000,000,000,000 1TB = 2000 hours of CD-
analysis, acquisition and management techniques to mine and (one trillion bytes) quality audio or 5 days of
HDTV
collect appropriate data in a structured way. In this paper, we
Petabyte (PB) 1000,000,000,000, 1GB = 2hours of CD-
describe the definitions and the challenges of big data systems. 000 (one quality audio or 7 minutes
Next, a systematic framework decomposes the architecture of quadrillion bytes) of HDTV
big data systems into four stages like data generation, data Exabyte (EB) 1000,000,000,000, 1GB = 2hours of CD-
acquisition, data storage and data analytics. These stages form 000,000 (one quality audio or 7 minutes
the basis of big data value chain. Finally, some solutions are quintillion bytes) of HDTV
discussed to tackle the challenges of big data and future x According to the reports from The International Data
attention is required for big data systems. Corporation (IDC) study forecasts that whole data will grow
by 50 times by 2020[4], driven in large part more by
Keywords—big data, big data analytics, data acquisition embedded systems such as sensors in clothing, medical
devices etc. This study also determines that unstructured
I. INTRODUCTION
information like files, email and video will account for 90%
With the growth in technologies and services in the of all data produced over the next decade [5].
past twenty years, the large amounts of datasets are x There are 277,000 tweets per minute, 2 million queries
increasing at a rapid speed. At the start of computer era, the are searched on Google every minute and 72 hours of new
size of data was measured in KB (Kilo Byte). Later it was videos are uploaded to YouTube in every minute[6].
extended to MB (Mega Byte), GB (Giga Byte), TB (Tera Irrespective of these, more than 100 million emails are sent,
Byte), PB (Peta Byte), EB (Exa Byte) and ZB (Zeta Byte) 350 GB of data is processing on Facebook and more than
[1]. The size of databases has been rising at exponential 570 websites are created per minute [7]. In the year 2012,
rates in today‘s enterprises. Thus, it is necessary to process 2.5 quintillion bytes of data were produced every day [8].
and examine these large volumes of data for decision In spite of these, many issues of big data are covered in
making in business has also increased along with it. public media, such as The Economist [9], New York Times
In addition to these, petabytes of datas are processed in [10], and National Public Radio [11]. Among all scientific
structured ways which are produced from numerous journals, only two, namely, Nature and Science opened the
business and scientific applications. The interactions of special columns to deliberate the challenges and impacts of
millions of people using mobile devices and Internet every big data [12]. The era of big data has come beyond all
day, also leads to flooding of data. The rising volume of doubt. The rapid rise in the volume of big data is shown in
enterprise information, medical records, information related Fig.1[13]. Due to increase in large amount of data, it
to sensing mobile devices, multimedia, social media and so becomes very difficult to handle such datasets because the
on will fuel exponential growth in data in the future. This volume is increasing at a rapid speed in comparison to
huge amount of data is called as “Big Data” [2]. An computing resources.
example of such data for the size is illustrated in below When it is compared with traditional datasets, big data
Table 1[3]. contains amorphous data which require real time analysis. In
Big Data is a term coined from the need of big addition to these, big data provides new opportunities in
companies like Yahoo, Google, Facebook etc. to analyze finding new values, provides the thorough understanding of
large amount of unstructured data which is generated in the hidden values, and gives a new way to experience new
every second. Some facts are as: - challenges, i.e., how to effectively manage and organize
such datasets. These may include many challenging
problems like analysis, data privacy, security, data

International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)

management, search, sharing, capture, transfer and processing of distributed data along with its storage and
visualizations. management of systems in order to build powerful systems
level solutions to handle big data challenges. Many of the
applications related to big data can be explored with the
help of these innovative technologies. The explosion of
these technologies require a systematic framework which
involves the capturing of modern big data research and
development efforts and apply those advancement in
different area of subjects.
The remaining part of the paper is summarized as follows:
Section 2 provides definitions and characteristics of big
data. In section 3, description of big data analytics are given.
In section 4, we presented the architectural view of big data
system. In section 5, some cases of big data analytics are
discussed. Finally, Section 6 gives the view of the
Fig. 1: The continuously rising of global data [13]
conclusion.
These challenging problems also demand many prompt II. DEFINITIONS AND CHARACTERISTICS OF BIG DATA
solutions as: -
The term “Big Data” was first coined by Roger
x The latest development in information technology (IT)
Magoulas in the computing era from O’Reilly media in
field makes it easier to generate data. For example, in every
2005 in order to define a great amount of data that
minute, 72 hours of new videos are uploaded on YouTube.
traditional data management techniques unable to process
Therefore, the main challenge that we face is to collect and
and manage the data due to its complexity and size of data.
integrate the vast datas from various distributed data
There are three aspects of big datas as: - (a) numerous datas
sources.
are available. (b) Relational databases have no
x Due to increase in large datasets, the major challenge categorization of data and (c) at last, generates the data,
becomes for researchers and practitioners are that how to capture the data and process the data in a rapid fashion. In
manage and store these petabytes of datasets with latest addition to these, many things are being transformed with
technologies. Such enormous data will far surpass the the help of big data like science, healthcare, finance,
capacities of existing IT architectures and infrastructure of engineering, business and also the society. Madden
existing enterprises along with its real time requirements characterizes the huge information as “information that is
which will greatly affects the available computing capacity. too huge, too quick or too hard to exist apparatuses to
For instance, according to the estimation of Wal-Mart, it handle” [14]. The expression “Too Big” states that the
collects more than 2.5 petabytes of data every hour from its associations must manage zettabytes scale accumulations of
customer transactions. Thus, we need to design appropriate information that originates from web sources, sensors,
cloud computing platforms which can update their intensive sound, pictures, recordings, and so on. “Too Fast” means
workloads. data must be processed quickly irrespective of how much
x In such heterogeneous data, it becomes very difficult for the size of data is such as carrying out fraud detection or to
IT companies to identify the right data and determine how find an ad to display. “Too Hard” means such data cannot
to best use it. In order to utilize it in a best way, the datasets be processed by existing tools.
are mined at different levels of analysis, forecasting, Also, in 2001, industry analyst Doug Laney articulated
visualization and modeling in order to reveal its intrinsic these three Vs. of Big Data as main stream definition of big
property and make an improve in its decision making. data. The features of Big Data are shown in Fig. 2 [15].
x The technology landscape in the data world is growing at x Volume: - As day by day, data is growing everywhere
a faster rate. Leveraging data provides an innovative from MB, GB, TB, PB, EB, ZB. According to the IBM, 2.5
technology partner which helps to create the IT architecture quintillion bytes of data are generated every day which itself
in a right way and can efficiently adapt to changes made in indicates 90% of the data were created in the previous two
the landscape. years. This data comes from everywhere like climate
x At last, security concerns about data protection are a information are gathered with the help of sensors, post to
major obstacle preventing companies from taking full social media sites, images, videos, purchase transaction
advantage of their data. records etc. In order, to obtain desired results data needs to
To tackle these challenges, various solutions have been be manipulated and analyze to make it structured data from
proposed by eminent researchers for big data systems in an unstructured data. However, in manipulating and analyzing
ad-hoc manner. As such, cloud computing is considered as the large volume of data, it possesses a great challenge
one of the infrastructure for big data systems to meet because it requires a lot of resources which will eventually
requirements like reliability, compensability, ubiquitous materialize in displaying the required results.
access, and scalability. For efficient storage and
management of vast datasets, NoSQL and distributed file
systems are used. MapReduce is a software framework
which is used in processing of group-aggregation tasks like
website ranking. Hadoop has also gained importance in

75
International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)

million emails are sent, about 20 million photos are viewed

and on Flickr 30,000 photos are uploaded, nearly 3,00,000
tweets are sent and about 2.5 million queries are performed
on Google. Another source of high velocity big data is
social media. Twitter users are estimated to generate nearly
100,000 tweets every 60 seconds and 700,000 posts are
being posted on Facebook. The challenge that organizations
have to cope up is the speed of creating the data and get
utilized in real-time. In this case, velocity pertains not only
to how quickly data is generated, but also to how data is
quickly someone interprets and acts upon it. . At last, the
big data can be treated as “High Velocity Big Data”.
x Variety: - In this world, datas are available in numerous
formats like database, word, csv, stored in text file, cd,
pendrive, etc. Sometimes data may not be available in
traditional format, they may be available in form of videos,
Fig. 2. Features of Big Data [15] SMS, pdf or in some way as we might have not thought off.
The biggest challenge that we are facing are that how to
As the size of data is unlimited, it will be very difficult to arrange all these datas so that it can be read and accessed by
process these datas because our computer systems are users in a meaningful way. Online networking information
limited to the current technology in terms of speed of is additionally a decent case of the assortment of data that
processing operations as its speed is limited in some extent. portrays Big Data. Web-based social networking data, as
In order to overcome from this limitation, infrastructure about 80-90% of all information today, is unstructured.
must be developed which results in higher processing To conclude, dealing with Big Data in an effective manner
speeds but cost of developing it will be higher. Finally we requires creating a value against the features or
conclude with the fact that big data can be treated as “High characteristics of big data [16, 17].
Volume Big Data”.
x Velocity: - The information development and online III. BIG DATA ANALYTICS
networking blast have changed on what we look like at the
Big data analytics refers to the process to collect
information. There was a time when we used to believe that
organize and analyze vast data sets (called as big data) for
data of yesterday is recent. This is the fact regarding about
discovering patterns and other useful insights. The analytics
matter that the newspaper follows. However, the news and
help organizations to better understand the information
radio channels are making their systems being upgraded
contained within the data and helps in identifying the most
with the latest technology so that their news can be received
relevant data related to business. It also helps in analyzing
quickly to the people. Apart from this, people mostly rely on
better business decision in the future. Collectively, these
social media like Facebook, twitter, etc. so that they can
processes are highly integrated functions of high
update themselves with the latest news. They often discard
performance analytics. In other words, these are a specific
old messages and pay attention to latest updates. The data
form of analytics, and can be regarded as advanced
movement is now almost the real time and update window
analytics. In order to analyze such large datasets, analysts
has been reduced to fractions of the seconds. This high
usually use specialized software tools and applications
velocity represents the “Big Data”.
which are based on SQL queries, data mining, text mining,
In other words, to process huge amount of data, speed
predictive analytics, forecasting and data optimization, fast
plays an important role. Many ecommerce websites are fully
clustering. The rundown can likewise be stretched out to
dependent upon this factor. Google provided a fact that to
cover information representation, manmade brainpower,
get good ranking, the speed of a page is essential. The speed
characteristic dialect preparing, and database abilities that
factor also provides comfort to the customer/visitor to do
bolster examination, (for example, MapReduce, in-database
their task in an easiest way like shopping, surfing, etc. To
investigation, in-memory databases, columnar information
process other information, the same thing matters a lot. The
stores). The name of big data analytics can be changed to
term “velocity” is the speed at which the data is created,
discovery analytics instead of advanced analytics because
stored, analyzed and visualized. In the ancient stage, with
business analyst users are trying to discover new business
the help of batch processing, it was very easy to get an
facts which are not known by anyone before. For that,
update from the database in every week or every night.
expert needs huge volumes of information with a lot of
Servers and computers require significant time to update the
detail. This is regularly information that the undertaking has
databases or to process the data. When talking about the
not yet tapped for examination. However, big data analysis
concept of big data, creation of data takes place in real-time
is a challenge for most organizations. Firstly, there is a test
or near real-time. When devices are connected to the
in separating information storehouses to get to all
Internet, either wireless or wired machines, devices pass
information an association stores in better places and
their data the moment it is created.
frequently in various frameworks. Secondly, it faces a
The rapid increase of data is behind our thought. On
challenge in creating a platform that can pull unstructured
YouTube, 72 hours of video are uploaded in every minute.
data as compared to structured data. The volume of datasets
In addition to this, in every minute over and above 200

76
International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)

is so large that is very difficult to process using traditional years to reach 40 zettabytes (ZB) by 2020 or more upward
database and software methods. to 44ZB (i.e., equivalent to 44 trillion gigabytes) [19]. The
rate of increasing in the amount of sensors was more than
IV. BIG DATA SYSTEM ARCHITECTURE 30 percent per year.
Drastically, big data is seen as gathering, processing x Business Data: - In 2012, the amount of new data that are
and management of data for production of new information generated each day was 2.2 million TB but 90% of world
from end user’s side. Due to increase in such large datasets, wide’s current data worldwide were originated during the
the main key challenges that arose are often related to last two years [20]. However, in 2010, the market value for
storage, transportation and processing of these datas. To get Big Data was increased from $3.2 billion to $16.9 billion in
appropriate results, data needs to mined or cleaned, tagged, 2015 [21]. As per Facebook, it alone only accesses, analyzes
classified or formatted to separate the one kind of data from and stores 30 + PB of user data [22] which is quite lesser
another. In other words, big data can be said as a complex than world wide’s digital data, i.e., 2.7 ZB in 2012 [23]. The
system which includes multiple distinct phases to deal with processing of data reached to 20,000 TB of data daily by
different applications in the digital data life cycle. In Google in 2008 [24] and Wal-Mart had processed over 1
industry, a system-engineering approach is well adopted to million customer transactions which generates more than
decompose a typical big-data system, i.e., there are 4 2.5 PB of data. The amount of people doing messages, calls
consecutive phases: data generation, data acquisition, and and browsing on mobile and social devices was more than 5
data storage and data analytics. Basically, data value chain billion [25]. According to the calculated value an e-mail is
provides a framework to examine the ways in which sent in every 3.5 × 10−7 seconds. As per the data obtained
different datas can be brought together in an organized from IDC, the business data is expected to reach to a total
fashion and by making decisions to create valuable value of 40 ZB by the end of 2020 [26].
information as shown in Fig. 3 [18]. (iii) Data Acquisition: - In this step, the datas are being
acquired from different sources in digital form for storage
and analysis purposes. It is basically divided in three steps: -
data collection, data transmission and data preprocessing
[27]. The steps are shown in the below Fig. 4 [28].

Data Acquisition

Data Data Data

Collection Transmission Pre-processing
Pre processing
x Log Files x Data Centre x Integration
x Sensors Transmission x Cleaning
Fig. 3. Big Data Value Chain [18] x Redundancy Elimination
Fig. 4. Data Acquisition Steps[28].
(i) Data Discovery: - Before to perform an analysis on
data, it is necessary to know what data sources are available (a) Data Collection: - In this phase, subsequent datas or
needed for decision-making. The next task of data discovery raw datas are collected from real world objects in a
is to generate and prepare the data. systematic manner. Otherwise, collection of inaccurate data
(ii) Data Generation: - The way in which big data is being will lead to invalid results after the analysis phase. The
generated can be characterized by data generation rate. The collections of datas are being done from variety of sources,
reason behind this is due to increase in technological websites, click streams, images, videos, etc. As a result, this
investments. As IBM already stated that 2.5 quintillion step not only depends on the physical characteristics of the
bytes of data are generated every day which itself indicates data sources but also on the objective of the data analysis.
90% of the data has been created in the last two years. There are a lot of techniques used to collect accurate data.
x Networking Data: - Networking has been in every walks Here, only a few are discussed in the subsection.
of human life whether it is internet, mobile network, or x Log Files: - Web log file is one of the commonly used
internet of things. There are many applications related to data collection method. The purpose of this is to record
network which can be regarded as network big data sources activities of the Web user in the form of data source in a
like search, click streams, websites but not limited to these. specified file format for subsequent analysis. They are
almost common to all applications running on digital
For example, since 2007, EMC and IDC are continuously
devices. The specified file format is referred as log file
tracking the size of the 'Digital Universe', or DU (the DU is
which is used for debugging purposes. For example, the
the creation, replication and consumption of the digital data activities include hits, click streams, accessing to other
in a single year). In 2012, EMC and IDC provided an websites, and other attributes done by the Web user in a log
estimation that the amount of DU will double every two file.

77
International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)

x Sensors: - Sensors are often used to capture physical process and identification is added to the identification list.
quantities which are finally converted into different digital In this way, it greatly helps in reducing the storage space in
signals for storage and processing. It may be further big data.
categorized into sound wave, voice, vibration, automobile, (iv) Data Storage: - This step is concerned for coordinating
chemical, pressure, weather, current, and temperature. and organizing the collected information into storage
(b) Data Centre Transmission: - After raw datas are systems for analysis and value extraction. To carry out this
collected, they are transferred to a data storage infrastructure process, the data storage scheme must possess two features:
commonly in a data center for subsequent processing and x The information must be reconciled by the storage
analysis. It provides high capacity trunk medium to infrastructure in a persistent and reliable manner.
channelize big data from its source to a data center. The x A scalable access must be provided by the storage system
purpose behind this is to re-route the traffic in case of link. to analyze and process the large amount of data.
(c) Data Pre-Processing: - Since data gets collected from The functional components of the data storage system can
various sources, they may also differ in levels of quality as be further divided into two parts: - hardware infrastructure
noise, consistency, redundancy etc. The data pre-processing and data management.
techniques are designed in such a way to improve the 9 Hardware infrastructure is used for physical storage of
quality of data which improves the accuracy of analysis and collected information into the storage devices which is
reduces the storage expenses. The following three based on the specified technology.
techniques are used to process the data: - o Random Access Memory (RAM): - It stores volatile
x Integration: - This technique combines the data from type of information in its memory, which means as power
various sources to provide a unified view of data. turns off, the data gets losses. Modern RAM includes
Preferably, two commonly used traditional methods are data dynamic RAM (DRAM), static RAM (SRAM) and phase
warehousing and data federation. Data warehousing are also changed RAM (PRAM).
called as ETL which consists of three steps: extract, 9 Data Management Framework: - It concerns the way in
transform and loading. which how the data can be organized in a convinient manner
9 Extraction: - In this phase, datas are selected, collected, for efficient processing. There are three layers which are
processed and analyzed after being connected to the source used to classify the layered structure of data management
systems. framework as: -
9 Transformation: - This step involves the transformation o File Systems: - It provides the base for big data source
or conversion of extracted data into the standard format. and are widely used both in industry and academy. In the
9 Loading: - The extracted and transformed datas are below section, some examples are considered either as open
imported into a target storage infrastructure. source or designed for enterprise purpose.
The second technique is data federation, where datas are o Database Technology: - Various databse tecnologyies
being queried and aggregated from various sources to make had been used from many decades for storing of datasets in
data integration technique more dynamic using virtual storage systems and in diverse applications. Now-a-days,
database. The virtual database does not contain data itself NoSQL databases are becoming more popular to overcome
but contains information about the data or information the problem of big data with certain characteristics like
related to original data and its location. schema free, more consistent and can handle large amount
x Data Cleaning: - This technique involves the removing of data. This is the example of mordern database systems
of irrelevant data or inaccurate data to improve the data but traditional database systems cannot overcome from all
quality for accuracy, completeness and consistency. This is the variety and scale challenges of big data.
also renamed as data scrubbing or data cleansing. It can be Key-value stores: - This is the simplest model where
done with the help of five sub-processes: datas are stored in the form of key-value pair and a unique
9 Types of errors are defined and determined. key is generated for each client’s request. The best example
9 Identification of errors from the data. for this is Amazon’s Dynamo [32]. In Dynamo, datas are
9 Correction of errors. collected and partitioned across cluster of servers and finally
9 Documentation of error types and instances. replicated to form multiple copies.
9 Modification of data entry procedures to avoid future MapReduce: - This model was developed by Google for
errors. processing and storing of large datasets on commodity
x Redundancy Elimination: - Here, many datasets hardware. The processed datasets are stored in the form of
contains repetition of data or duplication of datas called as clusters. It possess two component model functions as
redundant data. These unnecessarily increases the data map() function and reduce() function. The Map function
transmission overhead which degrades the performance of arranges all together intermediate pairs based on the same
storage systems in an manner like wastage of storage space, intermediate keys (i.e., in the form of <key, value> pairs)
inconsistent data, reduced reliability and corrupted data. To and passess it to the Reduce() function. The Reduce()
overcome from this problem, various techniques have been function receives the same intermediate key and merges
proposed by various eminent researchers like data them together to produce a smaller set of values [33]. In
compression [29], redundancy detection [30], and data de- other words, we can say that, a Map Reduce framework
duplication [31]. Data de-duplication techniques are used to works on the basis of master-slave architecture where a
remove duplicate copies of repeating data where a unique numerous slave nodes are handled by one master node. To
segment of data is allocated and stored during storage provide equal load distribution, the input datasets are

78
International Conference on Advances in Computing, Communication Control and Networking (ICACCCN2018)

divided into even sized data blocks and each data block is [10]Lohr S, 2012, “The age of big data,” New York Times, PP. 11.
then further assigned to one slave node. To generate the [11]Yuki N, 2011, “The search for analysts to make sense of big data,”
result, the data blocks are processed using map and merged http://www.npr.org/2011/11/30/142893065/the-searchforanalyststo-make-
together using Reduce() function. This framework is more sense-of-big-data
scalable; monitors heterogeneous datasets; provides [12]Special online collection: dealing with big data, 2011,
distributed and parallel I/O scheduling to overcome from the http://www.sciencemag.org/site/special/data/
issues of volume aspect; more fault tolerant; more efficient [13]http://bigdatamsritise.blogspot.in/2015/12/introduction-abstract-term-
of-big-data.html
and less time is required to retrieve data. This has been
[14]Hiba Jasim Hadi, Ammar Hameed Shnain, Sarah Hadi Shaheed,
designed to overcome from the issues of large scale data Azizahbt Haji Ahmad, 1st Nov., 2014, “Big Data and Five V’s
processing. Characteristics,” Proceedings of IRF International Conference,
(v) Data Analysis: - This is the last and most important Tirupati, India, ISBN: 978-93-84209-61-2.
step of big data value chain which uses analytical tools or [15]Marko Grobelnik, 8th May, 2012, “Big-Data Tutorial,” Jozef Stefan
methods to inspect, transform and model data for extraction Institute, Ljubljana, Slovenia.
purpose. The main emphasis of this classification is to [16]Stephen Kaisler, Frank Armour, J. Alberto Espinosa, William Money,
highlight the importance of data characteristics in each area. 2013, “Big Data: Issues and Challenges Moving Forward,” 46th
The purpose of using analytical tool is to extract as much Hawaii International Conference on System Sciences, IEEE.
relevant data as possible from that subject under [17]Avita Katal, Mohammad Wazid, R H Goudar, 2013, “Big Data: Issues,
challenges, Tools and Good Practices,” IEEE.
consideration.
[18]Han Hu, Yonggang Wen, Tat-Seng Chua, Xuelong Li, 8th July 2014,
V. CONCLUSION “Toward Scalable Systems for Big Data Analytics: A Technology
Tutorial,” Digital Object Identifier 10.1109/ACCESS.2014.2332453,
In this paper, the concept of big data is presented. Big data IEEE Access, vol. 2, pp: 652-687.
is the collection of large and complex datasets which are [19]http://www.zdnet.com/article/the-internet-of-things-and-big-data-
unlocking-the-power/
generated from various sources such as email attachments,
social media comments, playing a video, etc. We have also [20]Marcia, 2012, “Data on Big Data,” http://marciaconner.com/ blog/data-
on-big-data/.
discussed three V’s of big data, i.e., volume, velocity and
[21]Facebook Statistics, 2014, http://www.statisticbrain .com/facebook-
variety. These terms played a major role in big data statistics/
analytics. We have also illustrated the concept of big data [22]K. Douglas, 2012, “Infographic: big data brings marketing big
value chain along with its four main stages like data numbers,” http://www.marketingtechblog.com/ibm-big-
generation, data acquisition, data storage and data analysis. datamarketing/
During storing and generating the data some variations may [23]B. Buxton, V. Hayward, I. Pearson et al., 2008, “Big data: the next
Google. Interview by Duncan Graham-Rowe,” Nature, vol. 455, no.
be possible, i.e., whether data is in video, audio, images, etc. 7209, pp. 8–9.
For easy understanding of big data analytics, researchers [24]P. Russom, 2011, “Big data analytics,” TDWI Best Practices Report,
have divided data into various big data applications like web Fourth Quarter.
analytics, text analytics, etc. Finally, to improve efficiency [25]S. Sagiroglu, D. Sinanc, May 2013, “Big data: a review,” in
of government sectors, industries, there is an urgent need for Proceedings of the International Conference on Collaboration
advanced data management, analysis and acquisition Technologies and Systems (CTS ’13), pp. 42–47, IEEE, San Diego,
Calif, USA.
mechanisms on typical big data applications to generate
[26]H.V. Jagadish, D. Agrawal, P. Bernstein, E. e. a. Bertino, 2015,
profit. “Challenges and Opportunities with Big Data,” The Community
Research Association.
REFERENCES [27]Cheikh Kacfah Emani, Nadine Cullot, Christophe Nicolle, 2015,
[1]Hari Kumar, Dr. P. UmaMaheswari, Nov-Dec, 2014, “Literature Survey “Understandable Big Data: A survey,” Computer Science Review I7,
on Big Data and Preserving Privacy for the Big Data in Cloud,” 70-8I.
IJTRA, ISSN: - 2320-8163, Vol. 2, Issue 6, PP. 128-133. [28]Y. Zhang, J. Callan, T. Minka, 2002, “Novelty and redundancy
[2]Apache Hive. Available at: http://hive.apache.org detection in adaptive filtering,'' in Proc. 25th Annu. Int. ACM SIGIR
[3]https://support.rackspace.com/white-paper/turning-big-data-into-big- Conf. Res. Develop. Inform. Retr., pp. 81-88.
dollars/ [29]D. Salomon, 2004, “Data Compression,” New York, NY, USA:
[4]R.Devakunchari, “Analysis on big data over the years,” IJSRP, Volume Springer-Verlag.
4, Issue 1, 1st January 2014, ISSN: 2250-3153. [30]S. Sarawagi, A. Bhamidipaty, 2002, “Interactive deduplication using
[5]World's data will grow by 50X in next decade, IDC study predicts active learning,” in Proc. 8th ACM SIGKDD Int. Conf. Knowl.,
http://www.computerworld.com/s/article/9217988/World_s_data_will Discovery Data Mining, pp. 269-278.
_grow_by_50X_in_next_decade_IDC_study_predicts [31]G. DeCandia et. al., 2007, “Dynamo: Amazon's highly available key-
[6]Min Chen, Shiwen Mao, Yunhao Liu, “Big Data: A Survey,” Mobile [32]C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, C. Kozyrakis,
Netw Appl, 19:171–209, DOI 10.1007/s11036-013-0489-0, Springer 2007, “Evaluating mapreduce for multi-core and multiprocessor
Science, Business Media New York, 2014. systems,” in: Proceedings of the 2007 IEEE 13th International
[7]http://www.how-much-data-is-on-the-internet-and-generated-online- Symposium on High Performance Computer Architecture, HPCA
every-minute/ ’07, IEEE Computer Society, Washington, DC, USA, pp. 13–24.
[8]Shilpa, Manjit Kaur, “BIG Data and Methodology-A review,” [33]J. H. Howard et. al., “Scale and performance in a distributed file
IJARCSSE, Volume 3, Issue 10, October 2013, ISSN: 2277 128X. system,” ACM Trans. Comput. Syst., vol. 6, no. 1, pp. 51–81, 1988.
[9]Drowning in numbers - digital data will flood the planet- and help us
understand it better, 2011,
http://www.economist.com/blogs/dailychart/2011/11/bigdata-0

View publication stats

Producing A Systematic Review
0% (1)
Producing A Systematic Review
19 pages
Class 12 Informatics Practices IP Project On CBSE Result Statistics
100% (1)
Class 12 Informatics Practices IP Project On CBSE Result Statistics
23 pages
Test 22
No ratings yet
Test 22
105 pages
Import and Export Guide: Enterprise Buildings Integrator
No ratings yet
Import and Export Guide: Enterprise Buildings Integrator
168 pages
Analysis of Micro Cracks Near Weld Line in ERW Pipe of API 5L X70M Grade
No ratings yet
Analysis of Micro Cracks Near Weld Line in ERW Pipe of API 5L X70M Grade
2 pages
Izjava o Sodelovanju KI Podpisana
No ratings yet
Izjava o Sodelovanju KI Podpisana
20 pages
15 IEEECanada Solar Panels
No ratings yet
15 IEEECanada Solar Panels
8 pages
JJ - Assessment of Some Physical and Mechanical Properties of Different Flexible Packaging Films For Packaging of Tamarind (Tamarindus Indica L.) Pulp Briquettes
No ratings yet
JJ - Assessment of Some Physical and Mechanical Properties of Different Flexible Packaging Films For Packaging of Tamarind (Tamarindus Indica L.) Pulp Briquettes
5 pages
Simulation_of_Solar-Grid_Charging_of_Electric_Vehicle_using_PI_Controller
No ratings yet
Simulation_of_Solar-Grid_Charging_of_Electric_Vehicle_using_PI_Controller
7 pages
33rd Research Paper
No ratings yet
33rd Research Paper
20 pages
ROS based Autonomous Indoor Navigation Simulation Using SLAM Algorithm
No ratings yet
ROS based Autonomous Indoor Navigation Simulation Using SLAM Algorithm
9 pages
Crash Course Big Data
From Everand
Crash Course Big Data
IntroBooks Team
No ratings yet
Computer Viruses and Detection : A review
No ratings yet
Computer Viruses and Detection : A review
3 pages
Genectic Algorithm Based Optimization
No ratings yet
Genectic Algorithm Based Optimization
10 pages
Macrocyclic Schiff Base Metal Complexes Derived From Isatin: Structural Activity Relationship and DFT Calculations
No ratings yet
Macrocyclic Schiff Base Metal Complexes Derived From Isatin: Structural Activity Relationship and DFT Calculations
13 pages
Conservation of Red Jungle Fowl
No ratings yet
Conservation of Red Jungle Fowl
10 pages
Rice
No ratings yet
Rice
14 pages
International Journal of Multidisciplinary and Current Research Detection of Water Borne Pathogenic Bacteria Where Molecular Methods Rule
No ratings yet
International Journal of Multidisciplinary and Current Research Detection of Water Borne Pathogenic Bacteria Where Molecular Methods Rule
9 pages
WEAP Dhasanriverbasin
No ratings yet
WEAP Dhasanriverbasin
15 pages
Advanced MOSFET Technologies For Next Generation C
No ratings yet
Advanced MOSFET Technologies For Next Generation C
17 pages
Movie Recommender System Using K-Means Clustering AND K-Nearest Neighbor
No ratings yet
Movie Recommender System Using K-Means Clustering AND K-Nearest Neighbor
7 pages
ThescopeandadaptationstrategyforautonomousvehiclesfromtheperspectiveofIndiansmartcity
No ratings yet
ThescopeandadaptationstrategyforautonomousvehiclesfromtheperspectiveofIndiansmartcity
23 pages
BiplabCuSc2016
No ratings yet
BiplabCuSc2016
6 pages
Application of Handheld Tele-ECG For Health Care Delivery in Rural India
No ratings yet
Application of Handheld Tele-ECG For Health Care Delivery in Rural India
8 pages
Analysisofmicrocracknearweldlinein ERWpipeof APIX70 Mgrade
No ratings yet
Analysisofmicrocracknearweldlinein ERWpipeof APIX70 Mgrade
2 pages
Automatic Generation Control of Thermal Power System Under Varying Steam Turbine Dynamic Model Parameters Based On Generation Schedules of The Plants
No ratings yet
Automatic Generation Control of Thermal Power System Under Varying Steam Turbine Dynamic Model Parameters Based On Generation Schedules of The Plants
14 pages
Shti 264 Shti190448
No ratings yet
Shti 264 Shti190448
6 pages
JMP
No ratings yet
JMP
34 pages
Irjet V7i39061
No ratings yet
Irjet V7i39061
5 pages
Fault Diagnosis of Rolling Element Bearing With Intrinsic Mode Function of Acoustic Emission Data Using APF-KNN
No ratings yet
Fault Diagnosis of Rolling Element Bearing With Intrinsic Mode Function of Acoustic Emission Data Using APF-KNN
11 pages
Journal Pone 0045439
No ratings yet
Journal Pone 0045439
8 pages
Resilient Modulus of Clayey Subgrade Soils Treated With Calcium Carbide Residue
No ratings yet
Resilient Modulus of Clayey Subgrade Soils Treated With Calcium Carbide Residue
12 pages
Cariappaetal 2021 Oo A
No ratings yet
Cariappaetal 2021 Oo A
9 pages
Internet of Robotic Things Driving Intelligent Robotics of Future - Concept, Architecture, Applications and Technologies
No ratings yet
Internet of Robotic Things Driving Intelligent Robotics of Future - Concept, Architecture, Applications and Technologies
11 pages
HimGeol-2014-Panditaetal
No ratings yet
HimGeol-2014-Panditaetal
12 pages
6 PT POCUS Lung Vs CXR in ICU Pts
No ratings yet
6 PT POCUS Lung Vs CXR in ICU Pts
14 pages
Abdominal Tuberculosis: Diagnosis and Management in 2018: Journal, Indian Academy of Clinical Medicine March 2018
No ratings yet
Abdominal Tuberculosis: Diagnosis and Management in 2018: Journal, Indian Academy of Clinical Medicine March 2018
5 pages
AnEye TrackingBasedWirelessControlSystem
No ratings yet
AnEye TrackingBasedWirelessControlSystem
12 pages
A Comparative Study of Machine Learning Techniques
No ratings yet
A Comparative Study of Machine Learning Techniques
7 pages
Geotechnical Investigation On Slopes Failures Along The Mughal Road From Ba Iaz To
No ratings yet
Geotechnical Investigation On Slopes Failures Along The Mughal Road From Ba Iaz To
8 pages
Effectiveness of Tot Workshop on Psychosocial Care on NDRF
No ratings yet
Effectiveness of Tot Workshop on Psychosocial Care on NDRF
5 pages
IP27Thiamine A Key Role in Human Health
No ratings yet
IP27Thiamine A Key Role in Human Health
11 pages
Investigation Clay - ICSMC - Pankaj
No ratings yet
Investigation Clay - ICSMC - Pankaj
8 pages
CEGH-MainThesis
No ratings yet
CEGH-MainThesis
8 pages
ASustainableVehicleRoutingProblemforIndian POMS Conf
No ratings yet
ASustainableVehicleRoutingProblemforIndian POMS Conf
6 pages
Singh Etal 2016 QI PDF
No ratings yet
Singh Etal 2016 QI PDF
13 pages
Malware_Classification_using_Bigram_BOW_Pixel_Intensity_Features_and_Multiprocessing
No ratings yet
Malware_Classification_using_Bigram_BOW_Pixel_Intensity_Features_and_Multiprocessing
6 pages
Assessment of OpenStreetMap Data - A Review
No ratings yet
Assessment of OpenStreetMap Data - A Review
5 pages
GIS-based Locational Analysis of Collection Bins in Municipal Solid Waste Management Systems
No ratings yet
GIS-based Locational Analysis of Collection Bins in Municipal Solid Waste Management Systems
6 pages
Journalof Nanosciencesin Press
No ratings yet
Journalof Nanosciencesin Press
7 pages
Gurung, K. Sharma, P. & Dhalor M. (2013) - Comparative Study of India's Organic
No ratings yet
Gurung, K. Sharma, P. & Dhalor M. (2013) - Comparative Study of India's Organic
511 pages
Evolution of Himalaya
No ratings yet
Evolution of Himalaya
18 pages
EVALI_-_E-Cigarette_or_Vaping_Product_Use-Associat
No ratings yet
EVALI_-_E-Cigarette_or_Vaping_Product_Use-Associat
9 pages
Liquid Fertilizer Much Needed Boost and Plant Drink To Nourish Todays Agriculture Compressed
No ratings yet
Liquid Fertilizer Much Needed Boost and Plant Drink To Nourish Todays Agriculture Compressed
4 pages
A Guide To The Management of Tuberculosis in Patients With Chronic Liver Disease
No ratings yet
A Guide To The Management of Tuberculosis in Patients With Chronic Liver Disease
12 pages
Ghimire 2020 J. Phys. Conf. Ser. 1608 012019
No ratings yet
Ghimire 2020 J. Phys. Conf. Ser. 1608 012019
13 pages
Mango Postharvest Biology and Biotechnology
No ratings yet
Mango Postharvest Biology and Biotechnology
22 pages
DRMANDAR1_REV_1_
No ratings yet
DRMANDAR1_REV_1_
7 pages
Rajvir: How To Generate Test Cases: A Tool Description
No ratings yet
Rajvir: How To Generate Test Cases: A Tool Description
12 pages
Assessment of Water Supply-Demand Using Water Eval
No ratings yet
Assessment of Water Supply-Demand Using Water Eval
13 pages
ArticleText 89991 2 10 202008141
No ratings yet
ArticleText 89991 2 10 202008141
10 pages
Stroke Incidence Oct2023
No ratings yet
Stroke Incidence Oct2023
13 pages
OrchidsofIndia PictorialGuide PDF
No ratings yet
OrchidsofIndia PictorialGuide PDF
9 pages
Mohanty IEEE JSEN Sensors 35599 2020 Final SCrop
No ratings yet
Mohanty IEEE JSEN Sensors 35599 2020 Final SCrop
15 pages
1 - Knowledge Management - Semantic Drift or Conceptual Shift - On JSTOR
No ratings yet
1 - Knowledge Management - Semantic Drift or Conceptual Shift - On JSTOR
3 pages
Towards Methods For Systematic Research On Big Data
No ratings yet
Towards Methods For Systematic Research On Big Data
10 pages
Big Data, Collection of (Social Media, Harvesting) : November 2017
No ratings yet
Big Data, Collection of (Social Media, Harvesting) : November 2017
19 pages
Measuring Results and Impact in The Age of Big Data by York and Bamberger March 2020
No ratings yet
Measuring Results and Impact in The Age of Big Data by York and Bamberger March 2020
88 pages
Chen - 2018 - IOP - Conf. - Ser. - Mater. - Sci. - Eng. - 322 - 062013
No ratings yet
Chen - 2018 - IOP - Conf. - Ser. - Mater. - Sci. - Eng. - 322 - 062013
6 pages
From Data Mining To Knowledge Discovery in Database
100% (1)
From Data Mining To Knowledge Discovery in Database
18 pages
SEMrushToolkit For SEO
No ratings yet
SEMrushToolkit For SEO
21 pages
RISBDC Advanced SEO Techniques For Better Rannking in 2020
No ratings yet
RISBDC Advanced SEO Techniques For Better Rannking in 2020
37 pages
Tips & Tricks On How To Increase Views and Rankings For Your Online Videos
No ratings yet
Tips & Tricks On How To Increase Views and Rankings For Your Online Videos
13 pages
Portfolio Analytics
No ratings yet
Portfolio Analytics
123 pages
Climate Change
No ratings yet
Climate Change
14 pages
Theorising Web 3.0: Icts in A Changing Society: Kreps, DGP and Kimppa, K
No ratings yet
Theorising Web 3.0: Icts in A Changing Society: Kreps, DGP and Kimppa, K
20 pages
Knowledge For Sustainable Development
No ratings yet
Knowledge For Sustainable Development
23 pages
Jennings and Zandbergen - 1995 - Ecologically Sustainable Organizations An Institu PDF
No ratings yet
Jennings and Zandbergen - 1995 - Ecologically Sustainable Organizations An Institu PDF
39 pages
Encyclopedia of Global Environmental Change Vol 5
No ratings yet
Encyclopedia of Global Environmental Change Vol 5
624 pages
Hmi Syllabus PDF
No ratings yet
Hmi Syllabus PDF
4 pages
8th World Conference On Educational Sciences PDF
No ratings yet
8th World Conference On Educational Sciences PDF
1 page
Introduction To Handling Multiple Devices: by Santhosh Pitchai
No ratings yet
Introduction To Handling Multiple Devices: by Santhosh Pitchai
8 pages
DIKW Paradigm
No ratings yet
DIKW Paradigm
10 pages
How To "MAX" Out Your Fast Refresh Materialized Views - Blue Gecko
No ratings yet
How To "MAX" Out Your Fast Refresh Materialized Views - Blue Gecko
22 pages
Emerging Trends in Information Technology
No ratings yet
Emerging Trends in Information Technology
9 pages
Assignment 1 2
No ratings yet
Assignment 1 2
4 pages
What Is Enterprise Content Management Guide To ECM 6
No ratings yet
What Is Enterprise Content Management Guide To ECM 6
21 pages
Glossary of ICT Terminology PDF
No ratings yet
Glossary of ICT Terminology PDF
4 pages
ADMINISTERING DATABASES ON SQL SERVER Level 32024
No ratings yet
ADMINISTERING DATABASES ON SQL SERVER Level 32024
33 pages
Access Full Complete Solution Manual Here: Chapter 1: Databases and Database Users
50% (2)
Access Full Complete Solution Manual Here: Chapter 1: Databases and Database Users
10 pages
How To Setup Connection To A Remote HANA System For SAP Cloud For Analytics Via SAP Web Dispatcher PDF
No ratings yet
How To Setup Connection To A Remote HANA System For SAP Cloud For Analytics Via SAP Web Dispatcher PDF
33 pages
SAP Architecture
No ratings yet
SAP Architecture
33 pages
Project Proposal Template
No ratings yet
Project Proposal Template
3 pages
Oracle Analytics For Dummies
No ratings yet
Oracle Analytics For Dummies
61 pages
Dissertationen Humboldt Uni Berlin
100% (2)
Dissertationen Humboldt Uni Berlin
7 pages
Managing Your Desktop
No ratings yet
Managing Your Desktop
9 pages
IGCSE ICT MCQ Mock 2
No ratings yet
IGCSE ICT MCQ Mock 2
7 pages
Attendance Management System Project Documentation
No ratings yet
Attendance Management System Project Documentation
21 pages
1st Lecture (Introducing Database Management System)
No ratings yet
1st Lecture (Introducing Database Management System)
23 pages
Ecdis Handbook 2
100% (1)
Ecdis Handbook 2
1 page
Q What Would You Do With Stuart in This Dispensation
No ratings yet
Q What Would You Do With Stuart in This Dispensation
2 pages
6 - PLM - PLM Systems in Different Organisational Units
No ratings yet
6 - PLM - PLM Systems in Different Organisational Units
35 pages
Seminar
No ratings yet
Seminar
21 pages
ICAICTA2018-Agung Dewandaru-Map Extraction
No ratings yet
ICAICTA2018-Agung Dewandaru-Map Extraction
23 pages
discussion forum 4
No ratings yet
discussion forum 4
2 pages
MyOra 2.0.0 UserGuide
No ratings yet
MyOra 2.0.0 UserGuide
29 pages

Big Data Methods 1

Uploaded by

Big Data Methods 1

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

An Outline on Big Data and Big Data Analytics

Conference Paper · October 2018

Devottam Gaurav Rohit Kaliyar

Let's Save the World and Humanity View project

The user has requested enhancement of the downloaded file.

An Outline on Big Data and Big Data Analytics

ISBN: 978-1-5386-4119-4/18/$31.00 ©2018 IEEE 74

million emails are sent, about 20 million photos are viewed

Data Data Data

View publication stats

You might also like