0% found this document useful (0 votes)

28 views6 pages

Big Data Assignment 1

Uploaded by

Ashutosh Sahni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views6 pages

Big Data Assignment 1

Uploaded by

Ashutosh Sahni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

You are on page 1/ 6

Big Data

Assignment – 1

Submitted by : Aman Mishra

Roll No. 2003610100007
Question 1. Expalin Big Data problems and importance of Apache Hadoop.
Answer: Big Data Challenges include the best way of handling the numerous amount of data that
involves the process of storing, analyzing the huge set of information on various data stores. There
are various major challenges that come into the way while dealing with it which need to be taken
care of with Agility.
Some of the Big Data Problems include -
1. Finding and fixing data quality issues
2. Long response time from systems
3. Lack of understanding
4. High cost of data solutions
5. Security Gaps

Importance of Hadoop
Hadoop is a valuable technology for big data analytics for the reasons as mentioned below:

Stores and processes humongous data at a faster rate. The data may be structured, semi-structured,
or unstructured
Protects application and data processing against hardware failures. Whenever a node gets down, the
processing gets redirected automatically to other nodes and ensures running of applications
Organizations can store raw data and processor filter it for specific analytic uses as and when
required
As Hadoop is scalable, organizations can handle more data by adding more nodes into the systems
Supports real-time analytics, drives better operational decision-making and batch workloads for
historical analysis.

Question 2: Explain Map Reduce Technique with help of examples.

Answer : MapReduce is a programming model used to perform distributed processing in parallel in
a Hadoop cluster, which Makes Hadoop working so fast. When you are dealing with Big Data,
serial processing is no more of any use. MapReduce has mainly two tasks which are divided phase-
wise:
• Map Task
• Reduce Task

Let us understand it with a real-time example, and the example helps you understand Mapreduce
Programming Model in a story manner:

Suppose the Indian government has assigned you the task to count the population of India. You can
demand all the resources you want, but you have to do this task in 4 months. Calculating the
population of such a large country is not an easy task for a single person(you). So what will be your
approach?.
One of the ways to solve this problem is to divide the country by states and assign individual in-
charge to each state to count the population of that state.

Question 3: Write a short note on Hadoop Components?

Answer: There are four basic or core components:

Hadoop Common:
It is a set of common utilities and libraries which handle other Hadoop modules. It makes sure that
the hardware failures are managed by Hadoop cluster automatically.
Hadoop HDFS:
It is a Hadoop Distributed File System that stores data in the form of small memory blocks and
distributes them across the cluster. Each data is replicated multiple times to ensure data availability.
It has two daemons. One for master node 一 NameNode and their for slave nodes ―DataNode.

NameNode and DataNode

The HDFS has a Master-slave architecture. The NameNode runs on the master server. It manages
the Namespace and regulates file access by the client. The DataNode runs on slave nodes. It stores
the business data. Internally, a file gets split into a number of data blocks and stored on a group of
slave machines. NameNode manages the modifications that are done to the namespace file system.

NameNode also tracks the mapping of blocks to DataNodes. This DataNode also creates, deletes,
and replicates blocks on-demand from NameNode.

Block in HDFS
Block is the smallest unit of storage on a computer system. In Hadoop, the default block size is
128MB or 256MB.

Replication Management
The replication technique is used to provide the fault tolerance HDFS. In that, it makes copies of the
blocks and stores them in on different DataNodes. The number of copies of the blocks that get
stored is decided by the replication factor. The default value is 3 but we can configure it to any
value.

Rack Awareness
A rack contains many DataNodes machines and there are many such racks in the production. To
place the replicas of the blocks in a distributed fashion. The rack awareness algorithm provides low
latency and fault tolerance.

Hadoop YARN:
It allocates resources which in turn allow different users to execute various applications without
worrying about the increased workloads.

Question 4: Explain benefits of Map Reduce Technique.

Answer: Given below are the advantages mentioned:

1. Scalability
Hadoop is a highly scalable platform and is largely because of its ability that it stores and distributes
large data sets across lots of servers. The servers used here are quite inexpensive and can operate in
parallel. The processing power of the system can be improved with the addition of more servers.
The traditional relational database management systems or RDBMS were not able to scale to
process huge data sets.

2. Flexibility
Hadoop MapReduce programming model offers flexibility to process structure or unstructured data
by various business organizations who can use the data and operate on different types of data. Thus,
they can generate a business value out of those meaningful and useful data for the business
organizations for analysis. Irrespective of the data source, whether it be social media, clickstream,
email, etc. Hadoop offers support for a lot of languages used for data processing. Along with all
this, Hadoop MapReduce programming allows many applications such as marketing analysis,
recommendation system, data warehouse, and fraud detection.

3. Security and Authentication

If any outsider person gets access to all the data of the organization and can manipulate multiple
petabytes of the data, it can do much harm in terms of business dealing in operation to the business
organization. The MapReduce programming model addresses this risk by working with hdfs and
HBase that allows high security allowing only the approved user to operate on the stored data in the
system.

4. Cost-effective Solution
Such a system is highly scalable and is a very cost-effective solution for a business model that
needs to store data growing exponentially in line with current-day requirements. In the case of old
traditional relational database management systems, it was not so easy to process the data as with
the Hadoop system in terms of scalability. In such cases, the business was forced to downsize the
data and further implement classification based on assumptions of how certain data could be
valuable to the organization and hence removing the raw data. Here the Hadoop scaleout
architecture with MapReduce programming comes to the rescue.

5. Fast
Hadoop distributed file system HDFS is a key feature used in Hadoop, which is basically
implementing a mapping system to locate data in a cluster. MapReduce programming is the tool
used for data processing, and it is also located in the same server allowing faster processing of data.
Hadoop MapReduce processes large volumes of data that is unstructured or semi-structured in less
time.

6. Simple Model of Programming

MapReduce programming is based on a very simple programming model, which basically allows
the programmers to develop a MapReduce program that can handle many more tasks with more
ease and efficiency. MapReduce programming model is written using Java language is very popular
and very easy to learn. It is easy for people to learn Java programming and design a data processing
model that meets their business needs.

7. Parallel Processing
The programming model divides the tasks to allow the execution of the independent task in parallel.
Hence this parallel processing makes it easier for the processes to take on each of the tasks, which
helps to run the program in much less time.

8. Availability and Resilient Nature

Hadoop MapReduce programming model processes the data by sending the data to an individual
node as well as forward the same set of data to the other nodes residing in the network. As a result,
in case of failure in a particular node, the same data copy is still available on the other nodes, which
can be used whenever it is required ensuring the availability of data.
In this way, Hadoop is fault-tolerant. This is a unique functionality offered in Hadoop MapReduce
that it is able to quickly recognize the fault and apply a quick fix for an automatic recovery solution.

There are many companies across the globe using map-reduce like Facebook, Yahoo, etc.
Question 5: Write down the important features of HDFS.
Answer: Important featues of HDFS are -
1. Hadoop is Open Source
Hadoop is an open-source project, which means its source code is available free of cost for
inspection, modification, and analyses that allows enterprises to modify the code as per their
requirements.

2. Hadoop cluster is Highly Scalable

Hadoop cluster is scalable means we can add any number of nodes (horizontal scalable) or increase
the hardware capacity of nodes (vertical scalable) to achieve high computation power. This provides
horizontal as well as vertical scalability to the Hadoop framework.

3. Hadoop provides Fault Tolerance

Fault tolerance is the most important feature of Hadoop. HDFS in Hadoop 2 uses a replication
mechanism to provide fault tolerance.

It creates a replica of each block on the different machines depending on the replication factor (by
default, it is 3). So if any machine in a cluster goes down, data can be accessed from the other
machines containing a replica of the same data.

Hadoop 3 has replaced this replication mechanism by erasure coding. Erasure coding provides the
same level of fault tolerance with less space. With Erasure coding, the storage overhead is not more
than 50%.

4. Hadoop provides High Availability

This feature of Hadoop ensures the high availability of the data, even in unfavorable conditions.

Due to the fault tolerance feature of Hadoop, if any of the DataNodes goes down, the data is
available to the user from different DataNodes containing a copy of the same data.

Also, the high availability Hadoop cluster consists of 2 or more running NameNodes (active and
passive) in a hot standby configuration. The active node is the NameNode, which is active. Passive
node is the standby node that reads edit logs modification of active NameNode and applies them to
its own namespace.

If an active node fails, the passive node takes over the responsibility of the active node. Thus even if
the NameNode goes down, files are available and accessible to users.

5. Hadoop is very Cost-Effective

Since the Hadoop cluster consists of nodes of commodity hardware that are inexpensive, thus
provides a cost-effective solution for storing and processing big data. Being an open-source
product, Hadoop doesn’t need any license.

6. Hadoop is Faster in Data Processing

Hadoop stores data in a distributed fashion, which allows data to be processed distributedly on a
cluster of nodes. Thus it provides lightning-fast processing capability to the Hadoop framework.

7. Hadoop is based on Data Locality concept

Hadoop is popularly known for its data locality feature means moving computation logic to the
data, rather than moving data to the computation logic. This features of Hadoop reduces the
bandwidth utilization in a system.
To install and configure Hadoop follow this installation guide.

8. Hadoop provides Feasibility

Unlike the traditional system, Hadoop can process unstructured data. Thus provide feasibility to the
users to analyze data of any formats and size.

9. Hadoop is Easy to use

Hadoop is easy to use as the clients don’t have to worry about distributing computing. The
processing is handled by the framework itself.

10. Hadoop ensures Data Reliability

In Hadoop due to the replication of data in the cluster, data is stored reliably on the cluster machines
despite machine failures.

The framework itself provides a mechanism to ensure data reliability by Block Scanner, Volume
Scanner, Disk Checker, and Directory Scanner. If your machine goes down or data gets corrupted,
then also your data is stored reliably in the cluster and is accessible from the other machine
containing a copy of data.

The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
R. Kelly Rainer - Brad Prince - Casey G. Cegielski-Introduction To Information Systems - Supporting and Transforming Business-Wiley - PDF (1) - 40275
No ratings yet
R. Kelly Rainer - Brad Prince - Casey G. Cegielski-Introduction To Information Systems - Supporting and Transforming Business-Wiley - PDF (1) - 40275
40 pages
IDS Unit3
No ratings yet
IDS Unit3
16 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
Artificial Intelligence and Internet of Things in Small and Medium-Sized EnterprisesA Survey
No ratings yet
Artificial Intelligence and Internet of Things in Small and Medium-Sized EnterprisesA Survey
11 pages
BDA unit-3
No ratings yet
BDA unit-3
63 pages
Big Data Unit II
No ratings yet
Big Data Unit II
42 pages
Bda Unit 4 Material
No ratings yet
Bda Unit 4 Material
37 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Hadoop
No ratings yet
Hadoop
13 pages
Map Reduce Features Hadoop Environment
No ratings yet
Map Reduce Features Hadoop Environment
3 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
7 Full Hadoop Performance Modeling For Job Estimation and Resource Provisioning
No ratings yet
7 Full Hadoop Performance Modeling For Job Estimation and Resource Provisioning
94 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
Seminar Report PDF
100% (2)
Seminar Report PDF
35 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
Big Data Module 2
No ratings yet
Big Data Module 2
23 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Unit-2-_Hadoop2_
No ratings yet
Unit-2-_Hadoop2_
30 pages
BDA Unit 4 PDF
No ratings yet
BDA Unit 4 PDF
31 pages
INTRODUCTION TO BUSINESS ANALYTICS, SECOND EDITION Majid Nabavi instant download
100% (1)
INTRODUCTION TO BUSINESS ANALYTICS, SECOND EDITION Majid Nabavi instant download
55 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Hadoop is an Open
No ratings yet
Hadoop is an Open
4 pages
BIGDATA FINAL
No ratings yet
BIGDATA FINAL
25 pages
Unit 5
No ratings yet
Unit 5
35 pages
Unit 3 & 4 big data
No ratings yet
Unit 3 & 4 big data
18 pages
Unit III
No ratings yet
Unit III
15 pages
Big Data notes (1)
No ratings yet
Big Data notes (1)
13 pages
BDM 2
No ratings yet
BDM 2
5 pages
132 P16cse5a-P16ite3a 2020052706582977
No ratings yet
132 P16cse5a-P16ite3a 2020052706582977
15 pages
Big Data Analysis pdf 2
No ratings yet
Big Data Analysis pdf 2
18 pages
Testing Big Data: Camelia Rad
No ratings yet
Testing Big Data: Camelia Rad
31 pages
Hadoop Features 2
No ratings yet
Hadoop Features 2
3 pages
Bigdata
No ratings yet
Bigdata
6 pages
IJECEfgfdgfdgfdgfdfgfdgfdgfdgf
No ratings yet
IJECEfgfdgfdgfdgfdfgfdgfdgfdgf
9 pages
Attachment (21)
No ratings yet
Attachment (21)
11 pages
Introduction to Hadoop - Copy
No ratings yet
Introduction to Hadoop - Copy
14 pages
Unit 5
No ratings yet
Unit 5
7 pages
Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
5 pages
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
No ratings yet
WWW Doubtly in Big Data Analytics Semester 7 Mu Ai Ds Viva Qna
7 pages
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
No ratings yet
Experiment No. 11 Part A A.1 Aim: 2 Prerequisite: A.3 Outcome: After Successful Completion of This Experiment, Students Will Be Able To
21 pages
DSCC UNIT 5 PDF
No ratings yet
DSCC UNIT 5 PDF
8 pages
Hadoop MapReduce Programming Model
No ratings yet
Hadoop MapReduce Programming Model
2 pages
CC_unit4_52e39303-d867-4b14-b5bf-38bc746359c6
No ratings yet
CC_unit4_52e39303-d867-4b14-b5bf-38bc746359c6
14 pages
Advantages of Hadoop MapReduce Programming
No ratings yet
Advantages of Hadoop MapReduce Programming
3 pages
Nptel Big Data Full PPT Book With Assignment Solution Rajiv Mishra IIT Patna 2021
100% (1)
Nptel Big Data Full PPT Book With Assignment Solution Rajiv Mishra IIT Patna 2021
1,103 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
Cloud - UNIT V
No ratings yet
Cloud - UNIT V
18 pages
FinTech East Africa
No ratings yet
FinTech East Africa
137 pages
Azure Analytics: Synapse
100% (4)
Azure Analytics: Synapse
251 pages
DA KCS051 Unit 1
No ratings yet
DA KCS051 Unit 1
26 pages
Big Data in Human Resources - Talent Analytics (People Analytics) Comes of Age PDF
No ratings yet
Big Data in Human Resources - Talent Analytics (People Analytics) Comes of Age PDF
6 pages
ssrn-2928833
No ratings yet
ssrn-2928833
39 pages
IJRRS Volume 2 Issue 1 Pages 23-34
No ratings yet
IJRRS Volume 2 Issue 1 Pages 23-34
12 pages
huawei
No ratings yet
huawei
27 pages
BDA Unit 3
No ratings yet
BDA Unit 3
6 pages
Mhealth: Mobile Technology Poised To Enable A New Era in Health Care
No ratings yet
Mhealth: Mobile Technology Poised To Enable A New Era in Health Care
54 pages
2023-24 M.SC Computer Science (Affiliated Colleges) Syllabus NEP-2020
No ratings yet
2023-24 M.SC Computer Science (Affiliated Colleges) Syllabus NEP-2020
34 pages
Info Brochure About Masters Programs
No ratings yet
Info Brochure About Masters Programs
34 pages
Ej 1416666
No ratings yet
Ej 1416666
20 pages
Business Analytics Tools
No ratings yet
Business Analytics Tools
22 pages
Features of MapReduce
No ratings yet
Features of MapReduce
4 pages
Chapter - 4 - Iot - Recording - 2
No ratings yet
Chapter - 4 - Iot - Recording - 2
16 pages
20 SHORT QUESTIONS
No ratings yet
20 SHORT QUESTIONS
11 pages
Big Data Analytics
No ratings yet
Big Data Analytics
12 pages
Krist Jayanti School,Bariya 20240624 192351 0000
No ratings yet
Krist Jayanti School,Bariya 20240624 192351 0000
9 pages
Longarela, I. R. (2017) - Explaining Vertical Gender Segregation - A Research Agenda.
No ratings yet
Longarela, I. R. (2017) - Explaining Vertical Gender Segregation - A Research Agenda.
20 pages
Big Data Analytics for Weather Prediction Research paper
No ratings yet
Big Data Analytics for Weather Prediction Research paper
8 pages
Big Data Concepts
No ratings yet
Big Data Concepts
15 pages
Analytics The Real-World Use of Big Data in Financial Services
No ratings yet
Analytics The Real-World Use of Big Data in Financial Services
16 pages
Business Analytics
No ratings yet
Business Analytics
8 pages
Big Data Analytics Syllabus
No ratings yet
Big Data Analytics Syllabus
3 pages
Big Data
No ratings yet
Big Data
5 pages
Presentation On Recent Trends in Sales Management
No ratings yet
Presentation On Recent Trends in Sales Management
6 pages
Big Data Capacity Planning
No ratings yet
Big Data Capacity Planning
7 pages
Condie 2013
No ratings yet
Condie 2013
3 pages
Que Paper Bda
No ratings yet
Que Paper Bda
1 page
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Mastering Apache Hudi: Building Real-Time Data Lakes
From Everand
Mastering Apache Hudi: Building Real-Time Data Lakes
Robert Johnson
No ratings yet
Hands-On Machine Learning Recommender Systems with Apache Spark
From Everand
Hands-On Machine Learning Recommender Systems with Apache Spark
Ernesto Lee
No ratings yet
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet

Big Data Assignment 1

Uploaded by

Big Data Assignment 1

Uploaded by

Big Data

Submitted by : Aman Mishra

Question 2: Explain Map Reduce Technique with help of examples.

Question 3: Write a short note on Hadoop Components?

NameNode and DataNode

Question 4: Explain benefits of Map Reduce Technique.

3. Security and Authentication

6. Simple Model of Programming

8. Availability and Resilient Nature

2. Hadoop cluster is Highly Scalable

3. Hadoop provides Fault Tolerance

4. Hadoop provides High Availability

5. Hadoop is very Cost-Effective

6. Hadoop is Faster in Data Processing

7. Hadoop is based on Data Locality concept

8. Hadoop provides Feasibility

9. Hadoop is Easy to use

10. Hadoop ensures Data Reliability

You might also like