Big Data Assignment 1
Big Data Assignment 1
Assignment – 1
Importance of Hadoop
Hadoop is a valuable technology for big data analytics for the reasons as mentioned below:
Stores and processes humongous data at a faster rate. The data may be structured, semi-structured,
or unstructured
Protects application and data processing against hardware failures. Whenever a node gets down, the
processing gets redirected automatically to other nodes and ensures running of applications
Organizations can store raw data and processor filter it for specific analytic uses as and when
required
As Hadoop is scalable, organizations can handle more data by adding more nodes into the systems
Supports real-time analytics, drives better operational decision-making and batch workloads for
historical analysis.
Let us understand it with a real-time example, and the example helps you understand Mapreduce
Programming Model in a story manner:
Suppose the Indian government has assigned you the task to count the population of India. You can
demand all the resources you want, but you have to do this task in 4 months. Calculating the
population of such a large country is not an easy task for a single person(you). So what will be your
approach?.
One of the ways to solve this problem is to divide the country by states and assign individual in-
charge to each state to count the population of that state.
Hadoop Common:
It is a set of common utilities and libraries which handle other Hadoop modules. It makes sure that
the hardware failures are managed by Hadoop cluster automatically.
Hadoop HDFS:
It is a Hadoop Distributed File System that stores data in the form of small memory blocks and
distributes them across the cluster. Each data is replicated multiple times to ensure data availability.
It has two daemons. One for master node 一 NameNode and their for slave nodes ―DataNode.
NameNode also tracks the mapping of blocks to DataNodes. This DataNode also creates, deletes,
and replicates blocks on-demand from NameNode.
Block in HDFS
Block is the smallest unit of storage on a computer system. In Hadoop, the default block size is
128MB or 256MB.
Replication Management
The replication technique is used to provide the fault tolerance HDFS. In that, it makes copies of the
blocks and stores them in on different DataNodes. The number of copies of the blocks that get
stored is decided by the replication factor. The default value is 3 but we can configure it to any
value.
Rack Awareness
A rack contains many DataNodes machines and there are many such racks in the production. To
place the replicas of the blocks in a distributed fashion. The rack awareness algorithm provides low
latency and fault tolerance.
Hadoop YARN:
It allocates resources which in turn allow different users to execute various applications without
worrying about the increased workloads.
1. Scalability
Hadoop is a highly scalable platform and is largely because of its ability that it stores and distributes
large data sets across lots of servers. The servers used here are quite inexpensive and can operate in
parallel. The processing power of the system can be improved with the addition of more servers.
The traditional relational database management systems or RDBMS were not able to scale to
process huge data sets.
2. Flexibility
Hadoop MapReduce programming model offers flexibility to process structure or unstructured data
by various business organizations who can use the data and operate on different types of data. Thus,
they can generate a business value out of those meaningful and useful data for the business
organizations for analysis. Irrespective of the data source, whether it be social media, clickstream,
email, etc. Hadoop offers support for a lot of languages used for data processing. Along with all
this, Hadoop MapReduce programming allows many applications such as marketing analysis,
recommendation system, data warehouse, and fraud detection.
4. Cost-effective Solution
Such a system is highly scalable and is a very cost-effective solution for a business model that
needs to store data growing exponentially in line with current-day requirements. In the case of old
traditional relational database management systems, it was not so easy to process the data as with
the Hadoop system in terms of scalability. In such cases, the business was forced to downsize the
data and further implement classification based on assumptions of how certain data could be
valuable to the organization and hence removing the raw data. Here the Hadoop scaleout
architecture with MapReduce programming comes to the rescue.
5. Fast
Hadoop distributed file system HDFS is a key feature used in Hadoop, which is basically
implementing a mapping system to locate data in a cluster. MapReduce programming is the tool
used for data processing, and it is also located in the same server allowing faster processing of data.
Hadoop MapReduce processes large volumes of data that is unstructured or semi-structured in less
time.
7. Parallel Processing
The programming model divides the tasks to allow the execution of the independent task in parallel.
Hence this parallel processing makes it easier for the processes to take on each of the tasks, which
helps to run the program in much less time.
There are many companies across the globe using map-reduce like Facebook, Yahoo, etc.
Question 5: Write down the important features of HDFS.
Answer: Important featues of HDFS are -
1. Hadoop is Open Source
Hadoop is an open-source project, which means its source code is available free of cost for
inspection, modification, and analyses that allows enterprises to modify the code as per their
requirements.
It creates a replica of each block on the different machines depending on the replication factor (by
default, it is 3). So if any machine in a cluster goes down, data can be accessed from the other
machines containing a replica of the same data.
Hadoop 3 has replaced this replication mechanism by erasure coding. Erasure coding provides the
same level of fault tolerance with less space. With Erasure coding, the storage overhead is not more
than 50%.
Due to the fault tolerance feature of Hadoop, if any of the DataNodes goes down, the data is
available to the user from different DataNodes containing a copy of the same data.
Also, the high availability Hadoop cluster consists of 2 or more running NameNodes (active and
passive) in a hot standby configuration. The active node is the NameNode, which is active. Passive
node is the standby node that reads edit logs modification of active NameNode and applies them to
its own namespace.
If an active node fails, the passive node takes over the responsibility of the active node. Thus even if
the NameNode goes down, files are available and accessible to users.
The framework itself provides a mechanism to ensure data reliability by Block Scanner, Volume
Scanner, Disk Checker, and Directory Scanner. If your machine goes down or data gets corrupted,
then also your data is stored reliably in the cluster and is accessible from the other machine
containing a copy of data.