0% found this document useful (0 votes)
14 views

Big Data Assignment 1

Uploaded by

Ashutosh Sahni
Copyright
© © All Rights Reserved
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Big Data Assignment 1

Uploaded by

Ashutosh Sahni
Copyright
© © All Rights Reserved
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 6

Big Data

Assignment – 1

Submitted by : Aman Mishra


Roll No. 2003610100007
Question 1. Expalin Big Data problems and importance of Apache Hadoop.
Answer: Big Data Challenges include the best way of handling the numerous amount of data that
involves the process of storing, analyzing the huge set of information on various data stores. There
are various major challenges that come into the way while dealing with it which need to be taken
care of with Agility.
Some of the Big Data Problems include -
1. Finding and fixing data quality issues
2. Long response time from systems
3. Lack of understanding
4. High cost of data solutions
5. Security Gaps

Importance of Hadoop
Hadoop is a valuable technology for big data analytics for the reasons as mentioned below:

Stores and processes humongous data at a faster rate. The data may be structured, semi-structured,
or unstructured
Protects application and data processing against hardware failures. Whenever a node gets down, the
processing gets redirected automatically to other nodes and ensures running of applications
Organizations can store raw data and processor filter it for specific analytic uses as and when
required
As Hadoop is scalable, organizations can handle more data by adding more nodes into the systems
Supports real-time analytics, drives better operational decision-making and batch workloads for
historical analysis.

Question 2: Explain Map Reduce Technique with help of examples.


Answer : MapReduce is a programming model used to perform distributed processing in parallel in
a Hadoop cluster, which Makes Hadoop working so fast. When you are dealing with Big Data,
serial processing is no more of any use. MapReduce has mainly two tasks which are divided phase-
wise:
• Map Task
• Reduce Task

Let us understand it with a real-time example, and the example helps you understand Mapreduce
Programming Model in a story manner:

Suppose the Indian government has assigned you the task to count the population of India. You can
demand all the resources you want, but you have to do this task in 4 months. Calculating the
population of such a large country is not an easy task for a single person(you). So what will be your
approach?.
One of the ways to solve this problem is to divide the country by states and assign individual in-
charge to each state to count the population of that state.

Question 3: Write a short note on Hadoop Components?


Answer: There are four basic or core components:

Hadoop Common:
It is a set of common utilities and libraries which handle other Hadoop modules. It makes sure that
the hardware failures are managed by Hadoop cluster automatically.
Hadoop HDFS:
It is a Hadoop Distributed File System that stores data in the form of small memory blocks and
distributes them across the cluster. Each data is replicated multiple times to ensure data availability.
It has two daemons. One for master node 一 NameNode and their for slave nodes ―DataNode.

NameNode and DataNode


The HDFS has a Master-slave architecture. The NameNode runs on the master server. It manages
the Namespace and regulates file access by the client. The DataNode runs on slave nodes. It stores
the business data. Internally, a file gets split into a number of data blocks and stored on a group of
slave machines. NameNode manages the modifications that are done to the namespace file system.

NameNode also tracks the mapping of blocks to DataNodes. This DataNode also creates, deletes,
and replicates blocks on-demand from NameNode.

Block in HDFS
Block is the smallest unit of storage on a computer system. In Hadoop, the default block size is
128MB or 256MB.

Replication Management
The replication technique is used to provide the fault tolerance HDFS. In that, it makes copies of the
blocks and stores them in on different DataNodes. The number of copies of the blocks that get
stored is decided by the replication factor. The default value is 3 but we can configure it to any
value.

Rack Awareness
A rack contains many DataNodes machines and there are many such racks in the production. To
place the replicas of the blocks in a distributed fashion. The rack awareness algorithm provides low
latency and fault tolerance.

Hadoop YARN:
It allocates resources which in turn allow different users to execute various applications without
worrying about the increased workloads.

Question 4: Explain benefits of Map Reduce Technique.


Answer: Given below are the advantages mentioned:

1. Scalability
Hadoop is a highly scalable platform and is largely because of its ability that it stores and distributes
large data sets across lots of servers. The servers used here are quite inexpensive and can operate in
parallel. The processing power of the system can be improved with the addition of more servers.
The traditional relational database management systems or RDBMS were not able to scale to
process huge data sets.

2. Flexibility
Hadoop MapReduce programming model offers flexibility to process structure or unstructured data
by various business organizations who can use the data and operate on different types of data. Thus,
they can generate a business value out of those meaningful and useful data for the business
organizations for analysis. Irrespective of the data source, whether it be social media, clickstream,
email, etc. Hadoop offers support for a lot of languages used for data processing. Along with all
this, Hadoop MapReduce programming allows many applications such as marketing analysis,
recommendation system, data warehouse, and fraud detection.

3. Security and Authentication


If any outsider person gets access to all the data of the organization and can manipulate multiple
petabytes of the data, it can do much harm in terms of business dealing in operation to the business
organization. The MapReduce programming model addresses this risk by working with hdfs and
HBase that allows high security allowing only the approved user to operate on the stored data in the
system.

4. Cost-effective Solution
Such a system is highly scalable and is a very cost-effective solution for a business model that
needs to store data growing exponentially in line with current-day requirements. In the case of old
traditional relational database management systems, it was not so easy to process the data as with
the Hadoop system in terms of scalability. In such cases, the business was forced to downsize the
data and further implement classification based on assumptions of how certain data could be
valuable to the organization and hence removing the raw data. Here the Hadoop scaleout
architecture with MapReduce programming comes to the rescue.

5. Fast
Hadoop distributed file system HDFS is a key feature used in Hadoop, which is basically
implementing a mapping system to locate data in a cluster. MapReduce programming is the tool
used for data processing, and it is also located in the same server allowing faster processing of data.
Hadoop MapReduce processes large volumes of data that is unstructured or semi-structured in less
time.

6. Simple Model of Programming


MapReduce programming is based on a very simple programming model, which basically allows
the programmers to develop a MapReduce program that can handle many more tasks with more
ease and efficiency. MapReduce programming model is written using Java language is very popular
and very easy to learn. It is easy for people to learn Java programming and design a data processing
model that meets their business needs.

7. Parallel Processing
The programming model divides the tasks to allow the execution of the independent task in parallel.
Hence this parallel processing makes it easier for the processes to take on each of the tasks, which
helps to run the program in much less time.

8. Availability and Resilient Nature


Hadoop MapReduce programming model processes the data by sending the data to an individual
node as well as forward the same set of data to the other nodes residing in the network. As a result,
in case of failure in a particular node, the same data copy is still available on the other nodes, which
can be used whenever it is required ensuring the availability of data.
In this way, Hadoop is fault-tolerant. This is a unique functionality offered in Hadoop MapReduce
that it is able to quickly recognize the fault and apply a quick fix for an automatic recovery solution.

There are many companies across the globe using map-reduce like Facebook, Yahoo, etc.
Question 5: Write down the important features of HDFS.
Answer: Important featues of HDFS are -
1. Hadoop is Open Source
Hadoop is an open-source project, which means its source code is available free of cost for
inspection, modification, and analyses that allows enterprises to modify the code as per their
requirements.

2. Hadoop cluster is Highly Scalable


Hadoop cluster is scalable means we can add any number of nodes (horizontal scalable) or increase
the hardware capacity of nodes (vertical scalable) to achieve high computation power. This provides
horizontal as well as vertical scalability to the Hadoop framework.

3. Hadoop provides Fault Tolerance


Fault tolerance is the most important feature of Hadoop. HDFS in Hadoop 2 uses a replication
mechanism to provide fault tolerance.

It creates a replica of each block on the different machines depending on the replication factor (by
default, it is 3). So if any machine in a cluster goes down, data can be accessed from the other
machines containing a replica of the same data.

Hadoop 3 has replaced this replication mechanism by erasure coding. Erasure coding provides the
same level of fault tolerance with less space. With Erasure coding, the storage overhead is not more
than 50%.

4. Hadoop provides High Availability


This feature of Hadoop ensures the high availability of the data, even in unfavorable conditions.

Due to the fault tolerance feature of Hadoop, if any of the DataNodes goes down, the data is
available to the user from different DataNodes containing a copy of the same data.

Also, the high availability Hadoop cluster consists of 2 or more running NameNodes (active and
passive) in a hot standby configuration. The active node is the NameNode, which is active. Passive
node is the standby node that reads edit logs modification of active NameNode and applies them to
its own namespace.

If an active node fails, the passive node takes over the responsibility of the active node. Thus even if
the NameNode goes down, files are available and accessible to users.

5. Hadoop is very Cost-Effective


Since the Hadoop cluster consists of nodes of commodity hardware that are inexpensive, thus
provides a cost-effective solution for storing and processing big data. Being an open-source
product, Hadoop doesn’t need any license.

6. Hadoop is Faster in Data Processing


Hadoop stores data in a distributed fashion, which allows data to be processed distributedly on a
cluster of nodes. Thus it provides lightning-fast processing capability to the Hadoop framework.

7. Hadoop is based on Data Locality concept


Hadoop is popularly known for its data locality feature means moving computation logic to the
data, rather than moving data to the computation logic. This features of Hadoop reduces the
bandwidth utilization in a system.
To install and configure Hadoop follow this installation guide.

8. Hadoop provides Feasibility


Unlike the traditional system, Hadoop can process unstructured data. Thus provide feasibility to the
users to analyze data of any formats and size.

9. Hadoop is Easy to use


Hadoop is easy to use as the clients don’t have to worry about distributing computing. The
processing is handled by the framework itself.

10. Hadoop ensures Data Reliability


In Hadoop due to the replication of data in the cluster, data is stored reliably on the cluster machines
despite machine failures.

The framework itself provides a mechanism to ensure data reliability by Block Scanner, Volume
Scanner, Disk Checker, and Directory Scanner. If your machine goes down or data gets corrupted,
then also your data is stored reliably in the cluster and is accessible from the other machine
containing a copy of data.

You might also like