BDA Experiment1
BDA Experiment1
– 136 | BDA
Experiment No: - 1
Theory:
Hadoop –
Hadoop is an open-source framework based on Java that manages the storage and processing of large
amounts of data for applications. Hadoop uses distributed storage and parallel processing to handle big data
and analytics jobs, breaking workloads down into smaller workloads that can be run at the same time.
Four modules comprise the primary Hadoop framework and work collectively to form the Hadoop
ecosystem:
1. Hadoop Distributed File System (HDFS): As the primary component of the Hadoop ecosystem, HDFS
is a distributed file system in which individual Hadoop nodes operate on data that resides in their local
storage. This removes network latency, providing high-throughput access to application data. In addition,
administrators don’t need to define schemas up front.
2. Yet Another Resource Negotiator (YARN): YARN is a resource-management platform responsible for
managing compute resources in clusters and using them to schedule users’ applications. It performs
scheduling and resource allocation across the Hadoop system.
3. MapReduce: MapReduce is a programming model for large-scale data processing. In the MapReduce
model, subsets of larger datasets and instructions for processing the subsets are dispatched to multiple
different nodes, where each subset is processed by a node in parallel with other processing jobs. After
processing the results, individual subsets are combined into a smaller, more manageable dataset.
4. Hadoop Common: Hadoop Common includes the libraries and utilities used and shared by other
Hadoop modules.
Beyond HDFS, YARN, and MapReduce, the entire Hadoop open-source ecosystem continues to grow and
includes many tools and applications to help collect, store, process, analyse, and manage big data. These
include Apache Pig, Apache Hive, Apache HBase, Apache Spark, Presto, and Apache Zeppelin.
8. Flume: A service designed to efficiently collect, aggregate, and move large volumes of log and event
data into Hadoop. It is widely used for streaming data ingestion.
9. Oozie: A workflow scheduling system that orchestrates Hadoop jobs and ensures they are executed in
the correct sequence. It supports time-based scheduling and chaining of tasks.
10. Zookeeper: A distributed coordination service that provides centralized management for configuration,
synchronization, and group services in distributed systems. It ensures high availability.
11. Mahout: A library of scalable machine learning algorithms built to run on Hadoop using MapReduce.
It supports tasks like clustering, classification, and collaborative filtering.
12. Spark: A fast, in-memory data processing engine for both batch and real-time data. It supports a wide
range of tasks, from data processing to machine learning, with APIs in multiple languages.
13. Kafka: A distributed messaging platform designed for handling real-time data streams. It is highly
scalable and used for building real-time pipelines and streaming applications.
14. Flink: A stream processing framework for real-time and batch data analytics. It offers low-latency
processing and fault-tolerant distributed computation.
15. Ambari: A web-based management tool for provisioning, managing, and monitoring Hadoop clusters.
It provides an intuitive interface for configuring and monitoring Hadoop services and cluster health.
Name – Dhruv Bedare | Year - BE | Batch – A1 | Roll No. – 136 | BDA
Hadoop Installation :
Step 1: Open the Oracle Virtual Box.
Step 4: Give atleast 4096 MB of Base Memory + minimum 2 CPUs and click Next.
Step 5: Select the option Use an Existing Virtual Hard Disk File and click the browse link and then
Browse and select the downloaded vmdk file, click open and click on create.
Name – Dhruv Bedare | Year - BE | Batch – A1 | Roll No. – 136 | BDA
Select the virtual machine and click Start. Wait for all configurations to setup.
Name – Dhruv Bedare | Year - BE | Batch – A1 | Roll No. – 136 | BDA
Hadoop Commands :
Conclusion:
In this experiment, we have successfully learned how to install Cloudera Hadoop, work with
the HDFS, and perform basic file operations like copying, moving, displaying, and deleting
files. HDFS proved its efficiency in handling large datasets across multiple machines with
ease, ensuring fault tolerance and high availability through its distributed nature.