0% found this document useful (0 votes)
55 views5 pages

Cloud Comp Techno

The document discusses big data and Hadoop implementation, noting that big data poses challenges for traditional databases due to its volume, velocity, and variety. It explains that Hadoop provides a scalable solution for storing and processing large amounts of unstructured data across clusters of commodity hardware. The document also outlines aspects of big data like volume and velocity, compares Hadoop to data warehouses, and provides steps for setting up a single-node Hadoop cluster on Ubuntu Linux.

Uploaded by

Subhash Malewar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views5 pages

Cloud Comp Techno

The document discusses big data and Hadoop implementation, noting that big data poses challenges for traditional databases due to its volume, velocity, and variety. It explains that Hadoop provides a scalable solution for storing and processing large amounts of unstructured data across clusters of commodity hardware. The document also outlines aspects of big data like volume and velocity, compares Hadoop to data warehouses, and provides steps for setting up a single-node Hadoop cluster on Ubuntu Linux.

Uploaded by

Subhash Malewar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Engineering Research & Technology (IJERT)

ISSN: 2278-0181
Vol. 3 Issue 4, April - 2014

Cloud Computing Techniques for Big Data and


Hadoop Implementation
Nikhil Gupta (Author) Ms. komal Saxena(Guide)
Research scholar Assistant Professor
AIIT, Amity university AIIT, Amity university
NOIDA-UP (INDIA) NOIDA- UP (INDIA)

Abstract- Big data is identically modernistic and technology, hence Big data was introduced with new
sizzling topic in today’s scenario. Big data is a set of enhancements.
data, which is larger in size that a conventional
database cannot or does not have the ability to capture, In section I the introduction is given about Big data and
store, manage and analyze the data. The big data is scalable database management system. The section II is
implemented using Hadoop and Hadoop is on demand about aspects of big data and its challenges. Limitations
in cloud now a days. and issues are in section III. The section IV helps us to
choose between the Hadoop or data warehouse. Section V
Big data business ecosystem and its trend that provide is Hadoop for Big data. Big data management ,scalability
basis for big data are explained. There is need of and performance is outlined in section VI. Big data
effective solution with issue of data volume, in order to business ecosystem is described in section VII. Running
enable the feasible ,cost effective and scalable storage Hadoop on Ubuntu Linux (Single-Node Cluster) is in
RT
and processing of enormous quantity of data, thus the section VIII. Finally the conclusion is in section XI.
big data and cloud go hand in hand and Hadoop is very
hot and enormously growing technology for BIG DATA definition- The set of data which is larger in
IJE

organizations . The steps required for setting up a size, that a traditional database cannot or does not have the
distributed ,single node Hadoop cluster backed by ability to capture, store, manage and analyze.
HDFS running on ubuntu (steps only) are given.
1 petabyte=1000 TB (terabyte)

The 2 breakthroughs have helped to adopt the solution for


Keywords: Big data, Hadoop, Map Reduce, HDFS, Big handling big data
data business ecosystem, Scalable database management
 Availability for cloud based solution.
system.
 Distribution of data over many servers.
I. INTRODUCTION
Scalable database management system[10]
Everyday 2.5 quintillion bytes[3] of data is created or
It has been the vision of database community for 3 decades
produced. Data is generated by smart phones, posts to
social media sensors, blogs etc. Now days we are looking  Distributed database- used for update intensive
big data as a business perspective. Big data and cloud go workloads.
hand in hand. The cloud computing has enabled the
 Parallel database- For analytical workload.
companies to get more value from their data even before,
by enabling fast analytics at low cost then they were Parallel database system has grown very large commercial
before. Thus companies can store even more data. Now this system but distributed system was not very successful.
more data is increasing day by day and we need new
technology and paradigms to collect, store and analyze data So, changes in data access patterns led to the birth of new
as and when needed. Now the problem with conventional system referred as key value stores. This was the adoption
database architecture is, it was not able to handle the data is of map reduce paradigm and its open source Hadoop
petabytes size. So, ultimately there is requirement of new implementation.

IJERTV3IS040952 www.ijert.org 722


International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 3 Issue 4, April - 2014

II. Aspects of big data and its challenges IV. Which one to use? Hadoop or Data warehouse

1. Volume-Large volume of data is generated in year Requirement Dataware Hadoop


2020 it is expected to store 35 zetabytes of data. house
On a daily basis the facebook generates 10 TB Low latency, OLAP, 
daily and twitter generates 7 TB of data. and interactive reports
Challenges- thus virtualization of storages in data SQL compliance is required 
(ANSI 2003)
centers, and we can make use of no SQL database
Preprocessing or expedition 
to store and query huge volumes of data.
of raw unstructured data
2. Velocity- Speed of data i.e. more and more data is
produced and must be collected in shorter time of Online archives alternative to 
frames. tape
Example- Show all people currently affected by High-quality purified and 
flood i.e. it is updated by GPS data in real time. persistent data
Challenge: Real time data processing is needed. 100s to 1000s of concurrent  
3. Variety- To handle multiple sources and format as users
Determine unexplored  
data can be a raw data, structured data,
relationships in the data
unstructured data etc.
Challenge- This is against the traditional relational Parallel complex process 
data model, the way we collect data, thus has logic
created new data stores that are able to support
flexible data model. CPU intense analysis  
4. Valu-: The main concern for every organization is System, users, and data 
how to make data useful. Main point is to convert governance
the raw data into valuable data.
RT
A number of limber 
5. Data Quality- How good data is? All the decisions
programming languages
will be made according to the quality of data. A running in parallel
IJE

good process ill make good decisions if based on Unrestricted, ungoverned 


good data quality. sand box explorations
Analysis of provisional data 
III. LIMITATIONS OF BIG DATA Huge and vast security and 
regulatory compliance
There are very limited persons who are having knowledge
Relative data loading and 1  
or highly skilled to take advantage of big data. second tactical queries
Issues

 Data Policies: As an example the storage Table 1: which one to use Hadoop or data warehouse [2]
computing, analytical software all these requires
as in new for big data. V. HADOOP FOR BIG DATA
 Technology and techniques: Privacy, Security is
Overview: In 2002, Dough Cutting develop an open
required for data.
source web crawler project, the Google published map
 Access to Data: When we have to access the data
reduced into 2006. Dough Cutting developed the open
then we need to integrate the multiple data sources
source, map reduced and HDFS.
together.

IJERTV3IS040952 www.ijert.org 723


International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 3 Issue 4, April - 2014

What is Hadoop? 2. MAP Reduce

Hadoop is a framework that allows distributed It is programming paradigm. It has 2 phases for
processing of large data sets across the clusters of solving query in HDFS:
computers using a programming language. It is open
source library & application programs written in  Map
JAVA language. Hadoop implements HDFS (Hadoop  Reduce
distributed File system).
Map is responsible for to read data from input location
Hadoop clusters running the same software can range and based on input it generate a key value pair i.e. an
the size from single server to as many of thousand intermediate output in local machine. Reduces the
servers. responsible for to process the intermediate the output
receives from mapper and generate file output.

Map reduces for data processing enables Hadoop to


process large dataset in parallel across all node in
cluster.

RT
IJE

Fig 1: Hadoop supports different types of unstructured data with


clustering.
Fig 3: working of map reduce for a single node [17].
Components of Hadoop
In many ways, Map Reduce can be seen as a
1.HDFS (Hadoop Distributed File System)
complement to an RDBMS. Map Reduce is a good fit
It is file system designed for storing very large files with for problems that need to analyze the complete dataset,
streaming data access pattern running clusters on hardware. in a batch fashion, especially for ad hoc analysis. The
HDFS is highly fault tolerant. HDFS automatically relational database is good for some queries or
distributed file across clusters and retrieves data by file updates, where the dataset has been tabulated to
name. HDFS does not change the file once it is written. deliver low-latency retrieval and update times of a
Thus if any changes has to be made the entire file must be
relatively small amount of data.
written.
Considerations Traditional MapReduce
RDBMS
Data size Gigabytes Petabytes
Access Interactive and Batch
batch
Updates Read and write Write once and
many times read many
Structure Static schema Dynamic schema
Integrity High Low
Scaling Nonlinear Linear

Table 2: difference between RDBMS and map reduce [17]

Fig 2: HDFS Architecture [16]

IJERTV3IS040952 www.ijert.org 724


International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 3 Issue 4, April - 2014

Advantages of Hadoop:

 No software license is required


 Low system acquisition cost
 Automatic, system maintenance.

Disadvantages of Hadoop:

Hadoop has a centralized metadata store i.e. namenode,


which represents a single point of failure without
availability .when the namenode is recovered it can take a
long time to get Hadoop cluster running again.

Fig 4: Big data Hadoop ecosystem

VI. Big data management and scalability and performance


with respect to Hadoop.
VIII. Steps required for setting up a distributed, single node
There is need of effective solution with issue of data Hadoop cluster, backed by HDFS.[16]
volume, in order to enable the feasible, cost effective and
scalable storage and processing of enormous quantity of step 1- Install Ubuntu
data. There are quality constraints on both:
step 2- Install Java in it
 Storage of big data: Up to which degree data has step 3- Add a dedicated Hadoop System user
to be replicated and latency requirements .
 Processing: where with, which parallel step 4- Configure SSH
RT
requirement of computing resources are required.
step 5- Now test SSH by connecting to your server
VII. Big data business ecosystem:[6]
IJE

step 6- Disable IPV6


Trends that provide basis for big data ecosystem:
step 7- Hadoop Installations
1. Data science and that associated skills are high in
demand eg. data scientists. step (7.1) -Configuring HDFS
2. Generalization of big data platform i.e. we need
step 8- Configure directory where Hadoop will stores files.
more responsive application. we need to write
models that discover patterns in near real time. step 9- Format the HDFS file system via in name node
3. Commoditization of big data platform:
Two approaches step 10- Start your own single node cluster
(a) Make big data accessible to
developers by making easy to step 11- Run a map reduced job
create applications.
step 12- Copy Local data to HDFS
(b) To find a use case for big data
like face recognition, finger step 13- Run map reduce job
print reader voice recognition
etc. IX. CONCLUSION
4. Increase cross enterprise collaboration:
Requirement is for sharing, exchanging and Big data is the next frontier for innovation competition and
managing data across platform. productivity. In horizon 2020, big data finds its place in
Industrial leadership. There is need for structuring data in
all sector of economy.
Hadoop and Cloud computing are in great demand in
several organizations. In upcoming time, Hadoop will
become one of the most required technology for Cloud
Computing. This proof is given from total number of

IJERTV3IS040952 www.ijert.org 725


International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 3 Issue 4, April - 2014

Hadoop clusters offered by cloud vendors in many


business.
Organizations are looking to expand Hadoop use cases to
include business critical, secure applications that easily
integrate with file based applications.
There is need for tools that do not require specialize skills
and programmer. New Hadoop developments must be
easier for users to operate and to get data in and out. Thus,
this includes direct access with standard protocols using
existing tool and techniques.

REFERENCES

1. Big Data Now by O’Reilly Media Published by O’Reilly


Media, Inc., 1005 Gravenstein Highway North, Sebastopol,
CA 95472.
2. Hadoop and the Data Warehouse: When to Use Which Dr.
Amr Awadallah, Founder and CTO, Cloudera, Inc., Map
Reduce and the Data Scientist”, Colin White, 2012.
3. Sam B. Siewert “Big data in the cloud”, ,Assistant
Professor,University of Alaska Anchorage,9th july 2013.
4. Dr. Satwant Kaur ,keynote at the CES show:” many trends
and new technology developments for big data “,IEEE
International Conference on Consumer Electronics (ICCE
2013).
5. Big data: The next frontier for innovation, competition, and
productivity. James Manyika, Michael Chui, Brad Brown,
Jacques Bughin, Richard Dobbs, Charles Roxburgh, and
RT
Angela Hung Byers. McKinsey Global Institute. May 2011.
6. NESSI white paper, Big data a new world of opportunities
,DEC 2012.
IJE

7. http://www.baselinemag.com/cloud-computing/managing-
big-data-in-the-cloud
8. Divyakant Agrawal ,Sudipto Das, Amr El Abbadi,
Department of Computer Science,University of California,
Santa Barbara.
9. https://cloudsecurityalliance.org/research/big-data/#_news
10. The Apache Hadoop Project.,
http://Hadoop.apache.org/core/, 2009.
11. http://www.ibm.com/developerworks/library/bd-
bigdatacloud/
12. D. Agrawal, S. Das, and A. E. Abbadi. Big data and cloud
computing: New wine or just new bottles? PVLDB,
3(2):1647–1648, 2010.
13. http://www.edupristine.com/courses/big-data-Hadoop-
program/?jscfct=1
14. http://www.edureka.in/blog/category/big-data-and-Hadoop/
15. www.forbes.com/big-data
16. www.michael-noll.com/tutorials/running-Hadoop-on-
ubuntu-linux-single-node-cluster/
17. Hadoop: The Definitive Guide, Second Edition by Tom
White, Published by O’Reilly Media, Inc., 1005 Gravenstein
Highway North, Sebastopol, CA 95472.

IJERTV3IS040952 www.ijert.org 726

You might also like