0% found this document useful (0 votes)
23 views5 pages

Big Data Concept Handling and Challenges An Overvi

This document discusses big data concepts, handling, and challenges. It defines big data using the 3Vs - volume, velocity, and variety - as well as additional characteristics including veracity, variability, visualization, and value. Common techniques for handling big data include MapReduce/Hadoop for processing, and storage in cloud, column-oriented, or schema-less databases. Analyzing and querying big data poses challenges that require tools like WibiData, PLATFORA, and PIG.

Uploaded by

Manisha Sheokand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views5 pages

Big Data Concept Handling and Challenges An Overvi

This document discusses big data concepts, handling, and challenges. It defines big data using the 3Vs - volume, velocity, and variety - as well as additional characteristics including veracity, variability, visualization, and value. Common techniques for handling big data include MapReduce/Hadoop for processing, and storage in cloud, column-oriented, or schema-less databases. Analyzing and querying big data poses challenges that require tools like WibiData, PLATFORA, and PIG.

Uploaded by

Manisha Sheokand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/274099681

Big Data: Concept, Handling and Challenges: An Overview

Article in International Journal of Computer Applications · March 2015


DOI: 10.5120/20020-1537

CITATIONS READS

13 8,945

3 authors, including:

Soumya Shukla
University of Regina
3 PUBLICATIONS 30 CITATIONS

SEE PROFILE

All content following this page was uploaded by Soumya Shukla on 05 December 2018.

The user has requested enhancement of the downloaded file.


International Journal of Computer Applications (0975 – 8887)
Volume 114 – No. 11, March 2015

Big Data: Concept, Handling and Challenges: An


Overview
Soumya Shukla Vaishnavi Kukade Sofiya Mujawar
B.E. (last year) B.E. (last year) B.E. (last year)
Computer Engineering in Computer Engineering in Computer Engineering in
MKSSS’ Cummins college of MKSSS’ Cummins college of MKSSS’ Cummins college of
Engineering, Pune Engineering, Pune Engineering, Pune

ABSTRACT
In this paper, we‘ve presented an overview of the concepts of
big data its characterization as well as the various methods of
handling big data. We have also discussed the various
challenges faced during handling of big data.

Keywords
Big data, Data analytics, Business intelligence, Data mining,
Challenges, Techniques.

1. INTRODUCTION
In today‘s world, every tiny gadget is a potential data source,
adding to the huge data bank. Also, every bit of data
generated is practically valued, be it enterprise data or
personal data, historical or transactional data. This data
generated through large customer transactions, social
networking sites is varied, voluminous and rapidly generating.
All this data prove a storage and processing crisis for the
Fig 1: Big data Definition
enterprises. The data being generated by massive web logs,
healthcare data sources, point of sale data, satellite imagery Big Data mainly involve six aspects as per the mentioned
needs to be stored and handled well. Although, this huge definition.
amount of data proves to be a very useful knowledge bank if
handled carefully. Hence big companies are investing largely Volume –Volume defines the quantity of Big Data. The size
in the research and harnessing of this data. By all the of this data ranges from terabytes and petabytes, to even
predilections today for Big Data, one can easily state Big Data Exabytes.
technology as the next best thing to learn. All the attention it Variety –Variety define data types of Big Data, which
has been getting over the past decade is but due to its includes structured and unstructured data such as text, audio,
overwhelming need in the industry. video, sensor data, posts, log files and many more.
Thus, this paper gives an overview of the key concepts in Big Velocity – As the generation of data is rapid, the process of
Data, some practiced Big Data handling techniques and the acquiring, processing and analyzing it requires fast
challenges posed by the technology itself. mechanisms. The velocity emphasizes on the real time
processing power of big data for enterprise needs.
2. CONCEPT
2.1 Big data Characteristics Veracity– Refers to the requirement of correct form of data as
The three V's" "volume, velocity and variety,‖ definition of it is relied upon for all further analysis.
big data originally coined by Doug Laney in 2001 to refer to Variability- Data can be in the same form but having different
the challenge of data management was quite in place to define semantics.
big data for a few years. It basically interpreted big data as
being a lot of data that is in a scattered form and needs to be Visualization- Data should be easy to process and interpret to
processed quickly for proper interpretation. [1] derive intelligence out of it.

In August, 2013 the definition was further enhanced to 2.2 Unique Features of Big Data:
include, "veracity, variability, visualization, and value" which Data is expanding at an astonishing rate. By 2020, experts say
gave a newer perspective to it. [2] that there will be 4300% of annual data increase. Hence it‘s
not only the size of big data that makes it unique but also its
With this new definition, big data now seem to not only
unstructured form that can cause serious issues for handling it.
describe itself according to its amount, but further was
With the way data has been expanding every minute, new
enhanced according to its interpretation and usability.
technique and analysis tools have been made to handle them.
These tools analyze large data sets simultaneously and storage
on cloud on secure data centers has made their analysis easy
and on the go. Hence, big data is not only unique in its size
and form but also in its processing and knowledge discovery.
Big data in petabytes are analyzed quickly and give more
accurate interpretation of respective queries than ever before.

6
International Journal of Computer Applications (0975 – 8887)
Volume 114 – No. 11, March 2015

3. BIG DATA HANDLING


3.1 Need of Handling Big Data?
The huge amount of data collected has the potential to reveal
useful trends and patterns. Hence it needs to be preserved and
processed. All this data is being stored in cloud or huge
secondary storages like the Hadoop File System. The
processing is done using Hadoop or Spark. The major insights
from analysis of this data are numerous business intelligence
applications, fraud detection, weather forecast, personalized
advertising etc. For analysis of this kind various machine
learning and data mining tools are used.

3.2 Big Data Handling Techniques:


Handling of Big Data is another major concern. Below are
some emerging technologies that are helping users cope with
and handle Big Data in a cost-effective manner.
Big data handling can be done with respect to following
aspects-
 Processing Big data: MapReduce, Hadoop is an
integrated framework for processing and storing Big
data
 Analysis and querying of data: WibiData,
PLATFORA, PIG
Fig 2: Big data Interpretation Insights  Business Intelligence: Hive
2.3 Big Data as an Opportunity  Storage: Cloud storage, Column-oriented databases,
Companies have worked upon its data analytics and schema-less databases
interpretation to open a new horizon of opportunities their
 Machine Learning: Apache Mahout, SkyTree
way. Big data represents both significant information and a
way with which it can be analyzed. This provides an Some of the various Big data handling techniques defined are
opportunity at every stage of knowledge discovery in big data. illustrated below[5]-
Big data offers an opportunity in many sectors as mentioned 3.2.1 MapReduce
below: MapReduce is the key algorithm that the Hadoop MapReduce
 Banking and security engine uses to distribute work around a cluster.

 Communication Media and Services Mapper function- A map transform function is provided to
transform an input data row of key and value to an output
 Education key/value:
 Government  map(key1,value) -> list<key2,value2>
 Healthcare providers That is, for an input it returns a list containing zero or more
(key, value) pairs: The output can be a different key from the
 Insurance input. The output can have multiple entries with the same key
 Manufacturing and natural resources Reduce function: A reduce transform is provided to take all
 Transportation values for a specific key, and generate a new list of
the reduced output.
 Various Traders such as : Retail, wholesalers
 reduce(key2, list<value2>) -> list<value3>
Government Sector being the highest among them, offers
wide range of opportunities for bigdata analyst and 3.2.2 Hadoop
researchers. [3] Apache Hadoop is an open source framework for distributed
storage and processing of large sets of data on commodity
2.4 Example of Big data hardware. Hadoop enables businesses to quickly gain insight
Ranging from data generated in small enterprises to IT giants, from massive amounts of structured and unstructured data. It
from social network sites to app data on cloud, bigdata is is used in maintaining, scaling and analyzing large scale of
generated in various form every day. An example of big data data. This data can be structured or unstructured.
might be petabytes (1,024 terabytes) or exabytes (1,024
petabytes) of data consisting of billions to trillions of records 3.2.3 PIG
of millions of people—all from different sources such as Web, Apache PIG is a platform for analyzing large data sets. PIG‘s
sales, customer contact center, social media, mobile data etc. language, PIG Latin, lets one specify a sequence of
The data is typically loosely structured data that is often transformation functions like merge, filter, grouping etc.
incomplete and inaccessible. Apart from built-in functions it also provides facility for user-
defined functions to do special-purpose processing. PIG‘s

7
International Journal of Computer Applications (0975 – 8887)
Volume 114 – No. 11, March 2015

language allows for query execution over data stored on a quality and/or uncertain. Extracting meaningful information
Hadoop cluster, instead of a "SQL-like" language. from such huge amounts of data of poor quality is one of the
major challenges being faced in big data. The accuracy of the
3.2.4 HIVE results monumentally depends on Data cleaning and data
Hive enables traditional BI applications to run queries against quality verification. Thus cleaning of data and it‘s quality
a Hadoop cluster. It was developed originally by Facebook, verification are critical. [6]
but has been made open source for some time now, and it's a
higher-level abstraction of the Hadoop framework that allows 4.3 Data Integration, Aggregation and
anyone to make queries against data stored in a Hadoop Representation:
cluster just as if they were manipulating a conventional data Data might not be homogenous and may have different
store. It makes Hadoop more useful for BI users. metadata. Thus Data integration requires huge human efforts.
3.2.3 Column-Oriented Databases Manual approaches fail to scale to what is required for big
data, hence the requirement of newer and better approaches
Conventional, row-oriented databases are best fit for online
arises. Also different data aggregation and representation
transaction processing with high update speeds, but they fall
strategies may be needed for different data analysis tasks. [6]
short on query performance as the data volumes grow and as
data becomes more unstructured. Column-oriented databases 4.4 Query Processing, and Analysis:
store data with a focus on columns, instead of rows, allowing
Methods suitable for big data need to be discovered and
for huge data compression and very fast querying.
evaluated for efficiency so that they are able to deal with
3.2.4 Schema-Less Databases, or NoSQL noisy, dynamic, heterogeneous, untrustworthy data. However
despite these difficulties, big data even if noisy and uncertain
Databases can be more valuable for identifying more reliable hidden
There are several database types that fit into this category, patterns and knowledge compared to tiny samples of good
such as key-value stores and document stores, which focus on data. [6]
the storage and retrieval of large volumes of unstructured,
semi-structured, or even structured data. They achieve 5. CONCLUSION AND FUTURE WORK
performance gains by doing away with some (or all) of the Due to the gargantuan increase in the amount of data in
restrictions traditionally associated with conventional various fields, it becomes a major challenge to handle the data
databases, such as read-write consistency, in exchange for efficiently. Thus to come up with plausible solutions to these
scalability and distributed processing. challenges one needs to understand the concept of big data, its
3.2.5 Using cloud for Big data handling methodologies and furthermore improve the
approaches in analyzing big data. With the advent of social
Most of the above technologies demand cloud, ie , many of media the need for handling big data has increased
the products and platforms mentioned are either entirely monumentally. If Facebook, Whatsapp, Twitter produce data
cloud-based or have cloud versions themselves. Most cloud which keeps increasing exponentially every year (or a few
vendors are already offering hosted Hadoop clusters that can years) then handling such huge data is something to be
be scaled on demand according to their user's needs. efficiently dealt with. We will need solutions to such issues
without compromising the quality of the results. Hence we
Big Data and cloud computing go hand-in-hand. Cloud
attempt to showcase basic concepts of big data that can be
computing enables companies of all sizes to get more value
used as easy referrals for literature survey of the topic.
from their data by enabling faster analytics at minimal costs.
This, in turn improves the company‘s productivity and thus 6. REFERENCES
returns the costs invested. [1] Douglas, Laney. "The Importance of 'Big Data': A
Definition". Gartner. Retrieved 21 June 2012.
4. BIG DATA CHALLENGES
Big data which is typically of the size petabyte or terabyte is [2] Blog post: Mark van Rijmenam titled "Why the 3V's Are
bound to be confronted with many theoretical, technical, Not Sufficient to Describe Big Data".
technological and practical challenges. Serious research
efforts are being invested in order to improve the efficiency of [3] Jean Yan, April 9, 2013 ―Big Data, Bigger Opportunities
storage, processing and analysis of big data. Following are the Bowman, M., Debray, S. K., and Peterson, L. L. 1993.
various challenges faced while handling big data. Reasoning about naming systems.
[4] ―Research in Big Data and Analytics: An Overview‖
4.1 Data Acquisition and Recording: International Journal of Computer Applications (0975 –
It is important to capture the context into which data has been 8887) Volume 108 – No 14, December 2014
generated and the ability to filter out the noise during pre-
processing the data and to compress data. Pre-processing of [5] Blog post: Thoran Rodrigues in Big Data Analytics,
data is complex and is time consuming thus the real challenge titled ―10 emerging technologies for Big Data‖,
is handling big volumes of unstructured and structured data December 4, 2012
continuously arriving from a large number of sources. Hence
[6] 2013 IEEE 37th Annual Computer Software and
a solution to this would require innovation of new
Applications Conference. Elisa Bertino Cyber Center,
technologies and architectures, designed to efficiently extract
CERIAS and CS Department Purdue University West
value from very large volumes of a wide variety of data, by
Lafayette, Indiana (USA) ―Big Data - Opportunities and
enabling high velocity capture, discovery and/or analysis.
Challenges Panel Position Paper‖
4.2 Information Extraction and Cleaning: [7] Wie, Jiang, Ravi V.T, and Agrawal G. "A Map-Reduce
Often data needs to be transformed in order to extract System with an Alternate API for Multi-core
information from it in order to express this information in a
form that is suitable for analysis. Data may also be of poor

8
International Journal of Computer Applications (0975 – 8887)
Volume 114 – No. 11, March 2015

Environments." Melbourne, VIC: 2010, pp. 84-93, 17-20 [10] Basic Concepts in Big Data by ChengXiang (―Cheng‖)
May 2010. Zhai
International Journal of Computer Applications (0975 –
[8] 2013 46th Hawaii International Conference on System 8887) National Level Technical Conference ―X-PLORE
Sciences. ‗Big Data: Issues and Challenges Moving 14 ‗Algorithm and Approaches to Handle Big Data‘ by
Forward‘ by Stephen Kaisler, Frank Armour, Alberto Uzma Shafaque, Parag D. Thakare, Mangesh Ghonge,
Espinosa, William Money. Milindkumar Sarode.
[9] Mobile Netw Appl (2014) 19:171–209 DOI
10.1007/s11036-013-0489-0 ‗Big Data: A Survey‘ by
Min Chen, Shiwen Mao, Yunhao Liu

IJCATM : www.ijcaonline.org 9

View publication stats

You might also like