Big Data Research Paper
Big Data Research Paper
INTRODUCTION:
Data is the gathering of values and variables related in some sense and
contrasting in some other sense. As of late the sizes of databases have
expanded quickly. This has lead to a developing enthusiasm for the
advancement of instruments able in the programmed extraction of learning
from information
[1]. Data are gathered and broke down to make data reasonable for
deciding. Thus
Data give a rich asset to learning revelation and choice backing. A database
is an organised collection of information so that it can easily be accessed,
managed and updated. Data mining is the procedure finding interesting
information, for example, association, designs, changes, irregularities and
significant structures from a large amount of information stored in
databases, data warehouses or other data stores. A widely accepeted formal
meaning of data mining is given in this manner. As indicated by this
LITERATURE REVIEW
Puneet Singh Duggal, Sanchita Paul, Big Data Analysis : Challenges and
Solutions, international Conference on Cloud, Big Data and Trust 2013, Nov
13-15, RGPV.
This paper presents various methods for handling the problems of big data
analysis through Map Reduce framework over Hadoop Distributed File
System (HDFS). Map Reduce techniques have been studied in this paper
which is implemented for Big Data analysis using HDFS.
This paper exhibits a review of different algorithm from 1994-2014 importent
for taking care of Big Data set. It gives an overview of engineering and
algorithm utilized as a part of substantial information sets. These algorithms
characterize different structures and strategies executed to handle Big Data
and this paper lists different tools that were produced for analyzing them. It
also explain about the different security issues, application and patterns took
after by a huge information set [9]. Wei Fan, Albert Bifet, "Mining Big Data:
Current Status, and Forecast to the Future", SIGKDD Investigations, Volume
14, Issue 2 The paper introduces a wide outline of the point Big Data mining,
its present status, debate, what's more, figure to what's to come. This paper
additionally covers different fascinating and cutting edge points on Big Data
mining.
Priya P. Sharma, Chandrakant P. Navdeti, "Securing Big Data Hadoop: A
Review of Security Issues, Threats and Solution", IJCSIT, Vol 5(2), 2014, 21262131
This paper examines about the enormous information security at nature
level alongside the testing of inherent insurances. It also presents some
security issues that we are managing today and propose security
arrangements and economically available procedures to address the same.
The paper likewise covers all the security answers for secure the Hadoop
biological system. Richa Gupta, Sunny Gupta, Anuradha Singhal, "Huge
Data : Overview", IJCTT, Vol 9, Number 5, Walk 2014
This paper gives an outline on Big Data, its significance in our live and a
some technology to handle Big Data. This paper likewise states how Big Data
can be connected to self-sorting out sites which can be reached out to the
field of promoting in organizations.
Big data analysis is the way toward applying progressed examination and
perception procedures to extensive information sets to reveal hidden pattern
and unknown relationships for viable decision making. The analysis of Big
Data includes numerous unmistakable stages which incorporate information
obtaining what's more, recording, data extraction and cleaning, information
coordination, accumulation and representation, inquiry preparing,
information displaying and examination and Interpretation. Each of these
stages presents challenges. Heterogeneity, scale, opportuneness, manysided quality and security are sure difficulties of Big Data mining.
Timeliness
As the size of the information sets to be prepared expands, it will take more
opportunity to dissect. In a few circumstances aftereffects of the
examination is required promptly. For instance, if a fake charge card
exchange is suspected, it ought to in a perfect world be hailed before the
exchange is finished by keeping the exchange from occurring by any means.
Clearly a full investigation of a client's buy history is not prone to be practical
progressively. So we have to create halfway results ahead of time so that a
little measure of incremental calculation with new information can be utilized
to touch base at a snappy assurance.
Given a substantial information set, it is regularly important to discover
components in it that meet a predetermined foundation. In the course of
information examination, this kind of hunt is liable to happen over and over.
Checking the whole information set to discover appropriate components is
clearly unfeasible. In such cases Index structures are made ahead of time to
allow discovering qualifying components rapidly. The issue is that every file
structure is intended to bolster just a few classes of criteria.
Big data refers to collections of data sets with sizes outside the ability of
commonly used software tools such as database management tools or
traditional data processing applications to capture, manage, and analyze
within an acceptable elapsed time. Big data sizes are constantly increasing,
ranging from a few dozen terabytes in 2012 to today many petabytes of data
in a single data set.
Big data creates tremendous opportunity for the world economy both in the
field of national security and also in areas ranging from marketing and credit
risk analysis to medical research and urban planning. The extraordinary
benefits of big data are lessened by concerns over privacy and data
protection. As big data expands the sources of data it can use, the trust
worthiness of each data source needs to be verified and techniques should
be explored in order to identify maliciously inserted data. Information
security is becoming a big data analytics problem where massive amount of
data will be correlated, analyzed and mined for meaningful patterns. Any
security control used for big data must meet the following requirements:
It must not compromise the basic functionality of the cluster.
It should scale in the same manner as the cluster.
It should not compromise essential big data characteristics.
It should address a security threat to big data environments or data
stored within the
Cluster
Unauthorized release of information, unauthorized modification of information and
denial of resources are the three categories of security violation. The following are
some of the security threats:
An unauthorized user may access files and could execute arbitrary code or carry
out
further attacks.
An unauthorized user may eavesdrop/sniff to data packets being sent to client.
An unauthorized client may read/write a data block of a file.
An unauthorized client may gain access privileges and may submit a job to a
queue or delete or change priority of the job.
Security of big data can be enhanced by using the techniques of authentication,
authorization, encryption and audit trails. There is always a possibility of occurrence
of security violations by unintended, unauthorized access or inappropriate access
by privileged users. The following are some of the methods used for protecting big
data:
Big data has great potential to produce useful information for companies
which can benefit the way they manage their problems. Big data analysis is
becoming indispensable for automatic discovering of intelligence that is
involved in the frequently occurring patterns and hidden rules.
These massive data sets are too large and complex for humans to effectively
extract useful information without the aid of computational tools. Emerging
technologies such as the Hadoop framework and MapReduce offer new and
exciting ways to process and transform big data, defined as complex,
unstructured, or large amounts of data, into meaningful knowledge.
Hadoop
Hadoop is a scalable, open source, fault tolerant Virtual Grid operating
system architecture for data storage and processing. It runs on commodity
hardware, it uses HDFS which is fault-tolerant high bandwidth clustered
storage architecture. It runs MapReduce for distributed data processing and
is works with structured and unstructured data [11]. For handling the
velocity and heterogeneity of data, tools like Hive, Pig and Mahout are used
which are parts of Hadoop and HDFS framework. Hadoop and HDFS (Hadoop
Distributed File System) by Apache is widely used for storing and managing
big data.
Hadoop consists of distributed file system, data storage and analytics
platforms and a layer that handles parallel computation, rate of flow
(workflow) and configuration administration [6].
HDFS runs across the nodes in a Hadoop cluster and together connects the
file systems on many input and output data nodes to make them into one
big file system. The present Hadoop ecosystem, as shown in Figure 1,
consists of the Hadoop kernel, MapReduce, the Hadoop distributed file
system (HDFS) and a number of related components such as Apache Hive,
HBase, Oozie, Pig and Zookeeper and these components are explained as
below:
HDFS: A highly faults tolerant distributed file system that is responsible for
storing data on the clusters.
MapReduce: A powerful parallel programming technique for distributed
processing of vast amount of dataon clusters.
HBase: A column oriented distributed NoSQL database for random
read/write access.
Pig: A high level data programming language for analyzing data of Hadoop
computation.
Hive: A data warehousing application that provides a SQL like access and
relational model.
Sqoop: A project for transferring/importing data between relational
databases and Hadoop.
Oozie: An orchestration and workflow management for dependent Hadoop
jobs.
MapReduce:
MapReduce is a programming model for processing large data sets with a
parallel, distributed algorithm on a cluster. Hadoop MapReduce is a
programming model and software framework for writing applications that
rapidly process vast amounts of data in parallel on large clusters of compute
nodes [11].
The MapReduce consists of two functions, map() and reduce(). Mapper
performs the tasks of filtering and sorting and reducer performs the tasks of
summarizing the result. There may be multiple reducers to parallelize the
aggregations [7]. Users can implement their own processing logic by
specifying a customized map() and reduce() function. The map() function
takes an input key/value pair and produces a list of intermediate key/value
pairs. The MapReduce runtime system groups together all intermediate pairs
based on the intermediate keys and passes them to reduce() function for
producing the final results. Map Reduce is widely used for the Analysis of big
data.
Large scale data processing is a difficult task. Managing hundreds or
thousands of processors and managing parallelization and distributed
environments makes it more difficult. Map Reduce provides solution to the
mentioned issues since it supports distributed and parallel I/O scheduling. It
is fault tolerant and supports scalability and it has inbuilt processes for
status and monitoring of heterogeneous and large datasets as in Big Data
[11].
CONCLUSION
The amounts of data is growing exponentially worldwide due to the
explosion of social networking sites, search and retrieval engines, media
sharing sites, stock trading sites, news sources and so on. Big Data is
becoming the new area for scientific data research and for business
applications. Big data analysis is becoming indispensable for automatic
discovering of intelligence that is involved in the frequently occurring
patterns and hidden rules. Big data analysis helps companies to take better
decisions, to predict and identify changes and to identify new opportunities.
In this paper we discussed about the issues and challenges related to big
data mining and also Big Data analysis tools like Map Reduce over Hadoop
and HDFS which helps organizations to better understand their customers
and the marketplace and to take better decisions and also helps researchers
and scientists to extract useful knowledge out of Big data. In addition to that
we introduce some big data mining tools and how to extract a significant
knowledge from the Big Data. That will help the research scholars to choose
the best mining tool for their work.
REFERENCES