0% found this document useful (0 votes)
15 views13 pages

Unit1 Remainingtopics 6feb

Hadoop is an open-source framework that allows distributed processing of large datasets across clusters of computers using simple programming models. It was originally developed to support the Nutch search engine project, and was designed to reliably handle large amounts of data and hardware failures. Key components of Hadoop include HDFS for storage, and MapReduce for distributed processing.

Uploaded by

Prince Rathore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views13 pages

Unit1 Remainingtopics 6feb

Hadoop is an open-source framework that allows distributed processing of large datasets across clusters of computers using simple programming models. It was originally developed to support the Nutch search engine project, and was designed to reliably handle large amounts of data and hardware failures. Key components of Hadoop include HDFS for storage, and MapReduce for distributed processing.

Uploaded by

Prince Rathore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

 Hadoop is an open source framework overseen by Apache Software Foundation

which is written in Java for storing and processing of huge datasets with the cluster
of commodity hardware. 
 Hadoop was created by Doug Cutting and Mike Cafarella in 2005.
 It was originally developed to support distribution for the Nutch search
engine project.
 Apache Nutch project was the process of building a search engine system that can
index 1 billion pages. After a lot of research on Nutch, they concluded that such a
system will cost around half a million dollars in hardware, and along with a
monthly running cost of $30, 000 approximately, which is very expensive. So, they
realized that their project architecture will not be capable enough to the
workaround with billions of pages on the web. So they were looking for a feasible
solution which can reduce the implementation cost as well as the problem of
storing and processing of large datasets.
 In 2003, they came across a paper that described the architecture of Google’s
distributed file system, called GFS (Google File System) which was published by
Google, for storing the large data sets. Now they realize that this paper can solve
their problem of storing very large files which were being generated because of
web crawling and indexing processes. But this paper was just the half solution to
their problem.
 In 2004, Google published one more paper on the technique MapReduce, which
was the solution of processing those large datasets. Now this paper was another
half solution for Doug Cutting and Mike Cafarella for their Nutch project. These
both techniques (GFS & MapReduce) were just on white paper at Google. Google
didn’t implement these two techniques. Doug Cutting knew from his work on
Apache Lucene ( It is a free and open-source information retrieval software library,
originally written in Java by Doug Cutting in 1999) that open-source is a great way
to spread the technology to more people. So, together with Mike Cafarella, he
started implementing Google’s techniques (GFS & MapReduce) as open-source in
the Apache Nutch project.
 In 2005, Cutting found that Nutch is limited to only 20-to-40 node clusters. He
soon realized two problems:
(a) Nutch wouldn’t achieve its potential until it ran reliably on the larger clusters
(b) And that was looking impossible with just two people (Doug Cutting & Mike
Cafarella).
The engineering task in Nutch project was much bigger than he realized. So he
started to find a job with a company who is interested in investing in their efforts.
And he found Yahoo!.Yahoo had a large team of engineers that was eager to work
on this there project.
 So in 2006, Doug Cutting joined Yahoo along with Nutch project. He wanted to
provide the world with an open-source, reliable, scalable computing framework,
with the help of Yahoo. So at Yahoo first, he separates the distributed computing
parts from Nutch and formed a new project Hadoop 
 In 2007, Yahoo successfully tested Hadoop on a 1000 node cluster and start using
it.
 In January of 2008, Yahoo released Hadoop as an open source project to
ASF(Apache Software Foundation). And in July of 2008, Apache Software
Foundation successfully tested a 4000 node cluster with Hadoop.
 In 2009, Hadoop was successfully tested to sort a PB (PetaByte) of data in less
than 17 hours for handling billions of searches and indexing millions of web pages.
And Doug Cutting left the Yahoo and joined Cloudera to fulfill the challenge
of spreading Hadoop to other industries.
 In December of 2011, Apache Software Foundation released Apache Hadoop
version 1.0.
 And later in Aug 2013, Version 2.0.6 was available.
 And later Apache Hadoop version 3.0  released in December 2017.
 Currently we have Apache Hadoop version 3.3.4 released in August 2022

Detail all the modules

All the modules in Hadoop are designed with a fundamental assumption that hardware
failures (of individual machines, or racks of machines) are common and thus should be
automatically handled in software by the framework. Apache Hadoop's MapReduce and
HDFS components originally derived respectively from Google's MapReduce and Google
File System (GFS) papers.

Beyond HDFS, YARN and MapReduce, the entire Apache Hadoop "platform" is now
commonly considered to consist of a number of related projects as well: Apache Pig, Apache
Hive, Apache HBase, and others.

For the end-users, though MapReduce Java code is common, any programming language can
be used with "Hadoop Streaming" to implement the "map" and "reduce" parts of the user's
program

The Hadoop framework itself is mostly written in the Java programming language, with some
native code in C and command line utilities written as shell-scripts.

Analysing Data with Unix

awk options 'selection _criteria {action }' input-file > output-


file
ajay manager account 45000
sunil clerk account 25000
varun manager sales 50000
amit manager account 47000
tarun peon sales 15000
deepak clerk sales 23000
sunil peon sales 13000
satvik director purchase 80000

$ awk '{print}' employee.txt


OP:

ajay manager account 45000


sunil clerk account 25000
varun manager sales 50000
amit manager account 47000
tarun peon sales 15000
deepak clerk sales 23000
sunil peon sales 13000
satvik director purchase 80000

$ awk '/manager/ {print}' employee.txt


OP:

ajay manager account 45000


varun manager sales 50000
amit manager account 47000

Splitting a Line Into Fields : For each record i.e line, the awk command splits
the record delimited by whitespace character by default and stores it in the $n
variables. If the line has 4 words, it will be stored in $1, $2, $3 and $4
respectively. Also, $0 represents the whole line.  
$ awk '{print $1,$4}' employee.txt
Output:  
ajay 45000
sunil 25000
varun 50000
amit 47000
tarun 15000
deepak 23000
sunil 13000
satvik 80000

https://www.geeksforgeeks.org/awk-command-unixlinux-examples/

Hadoop The definitive Guide Chapter 2


substr(s, a, b) : it returns b number of chars from string s, starting at position a. The
parameter b is optional, in which case it means up to the end of the string.
For instance, consider a file with the following content:
every good
wk '{print substr($1,1,1)}' data.txt #returns e
awk '{print substr($1,3) }' data.txt #returns ery
awk '{print substr($2,3) }' data.txt #returns od
awk '{print substr($0,7,2) }' data.txt #returns go

Problem:

The complete run for the century took 42 minutes in one run on a single EC2 High-CPU
Extra Large instance.

To speed up the processing, we need to run parts of the program in parallel. In theory, this is
straightforward: we could process different years in different processes, using all the
available hardware threads on a machine. There are a few problems with this, however.

The file size for different years varies widely, so some processes will finish much earlier than
others. Even if they pick up further work, the whole run is dominated by the longest file. A
better approach, although one that requires more work, is to split the input into fixed-
size chunks and assign each chunk to a process.

Second, combining the results from independent processes may require further
processing.
In this case, the result for each year is independent of other years, and they may
be combined by concatenating all the results and sorting by year. If using the fixed-size
chunk approach, the combination is more delicate. For this example, data for a particular
year will typically be split into several chunks, each processed independently. We’ll end up
with the maximum temperature for each chunk, so the final step is to look for the
highest of these maximums for each year.

Third, you are still limited by the processing capacity of a single machine. If the best
time you can achieve is 20 minutes with the number of processors you have, then that’s it.
You can’t make it go faster. Also, some datasets grow beyond the capacity of a single
machine. When we start using multiple machines, a whole host of other factors come into
play, mainly falling into the categories of coordination and reliability. Who runs the overall
job? How do we deal with failed processes?

Analysing Data with Hadoop

MapReduce works by breaking the processing into two phases: the map phase and the
reduce phase. Each phase has key-value pairs as input and output, the types of which
may be chosen by the programmer. The programmer also specifies two functions: the
map function and the reduce function.

The input to our map phase is the raw NCDC data. We choose a text input format that
gives us each line in the dataset as a text value. The key is the offset of the beginning of the
line from the beginning of the file, but as we have no need for this, we ignore it. Our map
function is simple. We pull out the year and the air temperature, because these are the only
fields we are interested in. In this case, the map function is just a data preparation phase,
setting up the data in such a way that the reduce function can do its work on it: finding the
maximum temperature for each year. The map function is also a good place to drop bad
records: here we filter out temperatures that are missing, suspect, or erroneous.

1. Sample Lines of Input Data

2. Input to Map function is key value pairs.

3. The map function will extract year and air temperature and show them as output

4. This output from map function is processed by Map reduce framework before being
sent to the reduce function. The processing step will sort and group the key value pairs
by keys, So the reduce function will get the following input:
5. The reduce function will iterate through the list and extract the maximum reading
and give it as output

Note: Refer example 2-3, 2-4, 2-5 for Java implementation from The Hadoop: Definitive
Guide

Introduction to IBM Big Data Strategy

https://citizenchoice.in/course/big-data/Chapter%205/12-ibm-big-data-strategy

IBM Big Data Strategy :


IBM, a US-based computer hardware and software manufacturer, had implemented a
Big Data strategy.
Where the company offered solutions to store, manage, and analyze the huge amounts
of data generated daily and equipped large and small companies to make informed
business decisions.
The company believed that its Big Data and analytics products and services would
help its clients become more competitive and drive growth.
Issues :
·        Understand the concept of Big Data and its importance to large, medium, and
small companies in the current industry scenario.
·       Understand the need for implementing a Big Data strategy and the various issues
and challenges associated with this.
·       Analyze the Big Data strategy of IBM.
·       Explore ways in which IBM’s Big Data strategy could be improved further.
 
Introduction to InfoSphere :
InfoSphere Information Server provides a single platform for data integration and
governance.
The components in the suite combine to create a unified foundation for enterprise
information architectures, capable of scaling to meet any information volume
requirements.
You can use the suite to deliver business results faster while maintaining data quality
and integrity throughout your information landscape.
InfoSphere Information Server helps your business and IT personnel collaborate to
understand the meaning, structure, and content of information across a wide variety of
sources.
By using InfoSphere Information Server, your business can access and use
information in new ways to drive innovation, increase operational efficiency, and
lower risk.
BigInsights :
 
BigInsights is a software platform for discovering, analyzing, and visualizing data
from disparate sources.
The flexible platform is built on an Apache Hadoop open-source framework that runs
in parallel on commonly available, low-cost hardware.
 
Big Sheets :
 
BigSheets is a browser-based analytic tool included in the InfoSphere  BigInsights
Console that you use to break large amounts of unstructured data into consumable,
situation-specific business contexts.
These deep insights help you to filter and manipulate data from sheets even further.
 
Intro to Big SQL :        
 
 
IBM Big SQL is a high performance massively parallel processing (MPP) SQL engine
for Hadoop that makes querying enterprise data from across the organization an easy
and secure experience.
A Big SQL query can quickly access a variety of data sources including HDFS,
RDBMS, NoSQL databases, object stores, and WebHDFS by using a single database
connection or single query for best-in-class analytic capabilities.
Big SQL provides tools to help you manage your system and your databases, and you
can use popular analytic tools to visualize your data.
Big SQL's robust engine executes complex queries for relational data and Hadoop
data.
Big SQL provides an advanced SQL compiler and a cost-based optimizer for efficient
query execution.
Combining these with massive parallel processing (MPP) engine helps distribute
query execution across nodes in a cluster.

Apache Hadoop Ecosystem


Apache Hadoop is an open source framework intended to make interaction with big data
easier.

Following are the components that collectively form a Hadoop ecosystem: 


 
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
1. HDFS

 HDFS is the primary or major component of Hadoop ecosystem and is responsible


for storing large data sets of structured or unstructured data across various nodes
and thereby maintaining the metadata in the form of log files.
 HDFS consists of two core components i.e. 
1. Name node
2. Data Node

 Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data.
 The data nodes are commodity hardware in the distributed environment.
Undoubtedly, making Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.

2. YARN

 Yet Another Resource Negotiator, as the name implies, YARN is the one who helps
to manage the resources across the clusters. In short, it performs scheduling and
resource allocation for the Hadoop System.
 Consists of three major components i.e. 
1. Resource Manager
2. Nodes Manager
3. Application Manager
 Resource manager has the privilege of allocating resources for the applications in
a system.
 Node managers work on the allocation of resources such as CPU, memory,
bandwidth per machine and later on acknowledges the resource manager
 Application manager works as an interface between the resource manager and
node manager and performs negotiations as per the requirement of the two.

3. MAP REDUCE

 By making the use of distributed and parallel algorithms, MapReduce makes it


possible to carry over the processing’s logic and helps to write applications which
transform big data sets into a manageable one.
 MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is: 
1. Map() performs sorting and filtering of data and thereby organizing them in the
form of group. Map generates a key-value pair based result which is later on
processed by the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the
mapped data. In simple, Reduce() takes the output generated by Map() as input and
combines those tuples into smaller set of tuples.

4. PIG:

Pig was basically developed by Yahoo which works on a pig Latin language, which
is Query based language similar to SQL.

It is a platform for structuring the data flow, processing and analyzing huge data
sets.
Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.

5. HIVE:

With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive
Query Language).
It is highly scalable as it allows real-time processing and batch processing both.
Also, all the SQL datatypes are supported by Hive thus, making the query
processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two components:
JDBC Drivers and HIVE Command Line.
JDBC along with ODBC drivers work on establishing the data storage permissions
and connection whereas HIVE Command line helps in the processing of queries.

6. Mahout:
allows Machine Learnability to a system or application. Machine Learning, as the
name suggests helps the system to develop itself based on some patterns,
user/environmental interaction or on the basis of algorithms.
It provides various libraries or functionalities such as collaborative filtering,
clustering, and classification which are nothing but concepts of Machine learning. It
allows invoking algorithms as per our need with the help of its own libraries.

7. Apache Spark:

It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and
visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in terms
of optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for structured
data or batch processing, hence both are used in most of the companies
interchangeably.

8. Apache HBase:

It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able
to work on Big Data sets effectively.
At times where we need to search or retrieve the occurrences of something small in a
huge database, the request must be processed within a short quick span of time. At
such times, HBase comes handy as it gives us a tolerant way of storing limited data
9. Other Components: Apart from all of these, there are some other components too
that carry out a huge task in order to make Hadoop capable of processing large
datasets. They are as follows:

1. Solr, Lucene: These are the two services that perform the task of searching and
indexing with the help of some java libraries, especially Lucene is based on Java
which allows spell check mechanism, as well. However, Lucene is driven by Solr.

2. Zookeeper: There was a huge issue of management of coordination and


synchronization among the resources or the components of Hadoop which
resulted in inconsistency, often. Zookeeper overcame all the problems by performing
synchronization, inter-component based communication, grouping, and
maintenance.

3. Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit. There is two kinds of jobs .i.e Oozie
workflow and Oozie coordinator jobs.

Oozie workflow is the jobs that need to be executed in a sequentially ordered


manner whereas Oozie Coordinator jobs are those that are triggered when some
data or external stimulus is given to it.

You might also like