0% found this document useful (0 votes)
14 views

HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop

Uploaded by

Kunal Tejwani
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop

Uploaded by

Kunal Tejwani
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

UNIT 2

What is Hadoop?
Apache Hadoop is an open source software framework used to develop data processing applications which
are executed in a distributed computing environment.

Applications built using HADOOP are run on large data sets distributed across clusters of commodity
computers. Commodity computers are cheap and widely available. These are mainly useful for achieving
greater computational power at low cost.

Similar to data residing in a local file system of a personal computer system, in Hadoop, data resides in a
distributed file system which is called as a Hadoop Distributed File system. The processing model is
based on ‘Data Locality’ concept wherein computational logic is sent to cluster nodes(server) containing
data. This computational logic is nothing, but a compiled version of a program written in a high-level
language such as Java. Such a program, processes data stored in Hadoop HDFS.

Hadoop EcoSystem and Components


Below diagram shows various components in the Hadoop ecosystem-

Apache Hadoop consists of two sub-projects –

1. Hadoop MapReduce: MapReduce is a computational model and software framework for writing
applications which are run on Hadoop. These MapReduce programs are capable of processing
enormous data in parallel on large clusters of computation nodes.
2. HDFS (Hadoop Distributed File System): HDFS takes care of the storage part of Hadoop
applications. MapReduce applications consume data from HDFS. HDFS creates multiple replicas
of data blocks and distributes them on compute nodes in a cluster. This distribution enables
reliable and extremely rapid computations.

Although Hadoop is best known for MapReduce and its distributed file system- HDFS, the term is also used
for a family of related projects that fall under the umbrella of distributed computing and large-scale data
processing. Other Hadoop-related projects at Apache include are Hive, HBase, Mahout, Sqoop, Flume,
and ZooKeeper.

HADOOP ECOSYSTEM:

The Hadoop ecosystem is a collection of open-source projects and tools that complement the
Hadoop core components, extending its capabilities and addressing various big data processing and
storage needs. The ecosystem components offer a wide range of functionalities, including data
processing, querying, data warehousing, real-time streaming, and more. Here are some key
components of the Hadoop ecosystem:

1. **Apache Hive:** A data warehousing and SQL-like query language tool that allows users to
perform ad-hoc querying and data analysis on large datasets stored in HDFS. Hive translates SQL-like
queries into MapReduce or Apache Tez jobs to process data efficiently.

2. **Apache Pig:** A high-level platform for creating MapReduce programs using a language called
Pig Latin. Pig simplifies the development of complex data transformations, making it easier for
developers to process large datasets.

3. **Apache HBase:** A distributed, scalable NoSQL database that provides real-time read/write
access to big data. HBase is built on top of HDFS and is suitable for applications that require low-
latency, random access to large datasets.

4. **Apache Spark:** An in-memory data processing engine that provides faster data processing
compared to traditional MapReduce. Spark supports batch processing, stream processing, machine
learning, and graph processing, making it a versatile tool for big data applications.

5. **Apache Sqoop:** A tool for efficiently transferring data between Hadoop and relational
databases like MySQL, Oracle, and others. Sqoop simplifies the process of importing and exporting
data to and from Hadoop.

6. **Apache Flume:** A distributed, reliable, and scalable service for efficiently collecting,
aggregating, and moving large amounts of log data and events into Hadoop for processing.
7. **Apache Kafka:** A distributed streaming platform that handles real-time data feeds and
provides a messaging system for building real-time data pipelines and streaming applications.

8. **Apache Oozie:** A workflow scheduler for managing and coordinating Hadoop jobs. Oozie
allows users to define complex workflows with dependencies between various Hadoop jobs and
other actions.

9. **Apache Mahout:** A library of machine learning algorithms optimized for scalable processing
on Hadoop. Mahout enables the development of recommendation systems, clustering, and
classification tasks.

10. **Apache ZooKeeper:** A centralized service for maintaining configuration information,


synchronization, and coordination in distributed systems. ZooKeeper is crucial for ensuring the
consistency and coordination of Hadoop cluster components.

11. **Apache Atlas:** A metadata management and governance framework for Hadoop. It provides
data lineage, security, and data classification features, helping organizations manage and secure
their data assets.

12. **Apache Ambari:** A management and monitoring platform for Apache Hadoop clusters.
Ambari simplifies cluster administration tasks and provides a web-based user interface for managing
Hadoop services.

These are just a few examples of the various components that make up the Hadoop ecosystem.
There are many other projects and tools, and the Hadoop ecosystem is continually evolving as new
technologies and innovations emerge to address different big data challenges. Organizations can
choose and combine these components to build comprehensive big data solutions tailored to their
specific needs.

Hadoop Architecture
High Level Hadoop Architecture
Hadoop has a Master-Slave Architecture for data storage and distributed data processing using
MapReduce and HDFS methods.

NameNode:

NameNode represented every files and directory which is used in the namespace

DataNode:

DataNode helps you to manage the state of an HDFS node and allows you to interacts with the blocks

MasterNode:

The master node allows you to conduct parallel processing of data using Hadoop MapReduce.

Slave node:

The slave nodes are the additional machines in the Hadoop cluster which allows you to store data to
conduct complex calculations. Moreover, all the slave node comes with Task Tracker and a DataNode.
This allows you to synchronize the processes with the NameNode and Job Tracker respectively.

In Hadoop, master or slave system can be set up in the cloud or on-premise

Features Of ‘Hadoop’
• Suitable for Big Data Analysis

As Big Data tends to be distributed and unstructured in nature, HADOOP clusters are best suited for
analysis of Big Data. Since it is processing logic (not the actual data) that flows to the computing nodes,
less network bandwidth is consumed. This concept is called as data locality concept which helps increase
the efficiency of Hadoop based applications.

• Scalability
HADOOP clusters can easily be scaled to any extent by adding additional cluster nodes and thus allows for
the growth of Big Data. Also, scaling does not require modifications to application logic.

• Fault Tolerance

HADOOP ecosystem has a provision to replicate the input data on to other cluster nodes. That way, in the
event of a cluster node failure, data processing can still proceed by using data stored on another cluster
node.

Network Topology In Hadoop


Topology (Arrangment) of the network, affects the performance of the Hadoop cluster when the size of the
Hadoop cluster grows. In addition to the performance, one also needs to care about the high availability
and handling of failures. In order to achieve this Hadoop, cluster formation makes use of network topology.

Typically, network bandwidth is an important factor to consider while forming any network. However, as
measuring bandwidth could be difficult, in Hadoop, a network is represented as a tree and distance
between nodes of this tree (number of hops) is considered as an important factor in the formation of
Hadoop cluster. Here, the distance between two nodes is equal to sum of their distance to their closest
common ancestor.

Hadoop cluster consists of a data center, the rack and the node which actually executes jobs. Here, data
center consists of racks and rack consists of nodes. Network bandwidth available to processes varies
depending upon the location of the processes. That is, the bandwidth available becomes lesser as we go
away from-

 Processes on the same node


 Different nodes on the same rack
 Nodes on different racks of the same data center
 Nodes in different data centers
HOW TO LOAD DATA INTO HADOOP:

Loading data into Hadoop typically involves ingesting data from various sources and storing it in the
Hadoop Distributed File System (HDFS) or other Hadoop-compatible storage systems. Here are
several common methods to load data into Hadoop:

1. **Hadoop Command-Line Interface (CLI):**

You can use the Hadoop command-line interface to interact with HDFS and load data manually. Use
the `hdfs dfs` command to create directories, copy files, and upload data to HDFS. For example, to
copy a local file to HDFS, you can use the following command:

```

hdfs dfs -put /path/to/local/file /path/in/hdfs/

```

2. **Hadoop File Browser:**

Some Hadoop distributions offer web-based file browsers that allow you to upload data to HDFS
directly through a graphical user interface (GUI). You can use these tools to drag and drop files from
your local machine into HDFS.

3. **Hadoop Data Ingestion Tools:**

Various Hadoop data ingestion tools like Apache Flume and Apache Sqoop are specifically designed
to ingest data from external sources into Hadoop. Flume is used for streaming data, while Sqoop is
used for transferring data between Hadoop and relational databases.

4. **Hive External Tables:**

If your data already resides in a structured storage system, such as HBase or an external relational
database, you can create external tables in Apache Hive that point to that data. Hive allows you to
query this data directly without physically moving it into HDFS.

5. **Apache Spark and Apache Pig:**

Apache Spark and Apache Pig are high-level data processing languages that allow you to write data
transformation scripts. You can use these languages to load data from different sources, process it,
and save the results back to HDFS.

6. **Apache Kafka Connect:**

If you are dealing with real-time streaming data, Apache Kafka Connect can be used to ingest data
from Kafka topics and load it into Hadoop for further processing and analysis.
7. **Custom MapReduce or Spark Jobs:**

For more complex data loading scenarios, you can write custom MapReduce or Spark jobs that read
data from external sources and write it to HDFS or other storage systems compatible with Hadoop.

8. **Third-Party Integration:**

Some commercial data integration tools also provide connectors or plugins for Hadoop, allowing you
to easily move data from various sources into Hadoop clusters.

The choice of the data loading method depends on the data sources, data formats, and the overall
data ingestion requirements of your specific use case. Whether you are dealing with batch
processing or real-time streaming data, Hadoop provides a flexible and scalable framework for
loading and processing data efficiently.

Certainly! Point number 1 refers to using the Hadoop Command-Line Interface (CLI) to interact with
HDFS and load data into it. The Hadoop CLI provides a set of commands that allow users to manage
and manipulate data stored in HDFS.

Here's a more detailed explanation of how to use the Hadoop CLI to load data into HDFS:

1. **Access Hadoop CLI:**

To use the Hadoop CLI, you need to have Hadoop installed on your machine. Once installed, open a
terminal or command prompt and navigate to the Hadoop installation directory.

2. **Check HDFS Status:**

Before loading data into HDFS, it's a good idea to check the status of HDFS to ensure it is running and
accessible. Use the following command to see the list of available commands and the overall HDFS
status:

```

hdfs dfsadmin -report

```

3. **Create HDFS Directories:**

In HDFS, data is typically organized in directories. You can create directories in HDFS using the `hdfs
dfs -mkdir` command. For example, to create a directory named "data" in the root of HDFS, use the
following command:
```

hdfs dfs -mkdir /data

```

4. **Copy Data to HDFS:**

The `hdfs dfs -put` command is used to copy data from the local file system to HDFS. For example, if
you have a file named "sample.txt" on your local machine and you want to copy it to the "data"
directory in HDFS, use the following command:

```

hdfs dfs -put /path/to/local/sample.txt /data/

```

5. **View HDFS Contents:**

You can use the `hdfs dfs -ls` command to list the contents of a directory in HDFS. For example, to
see the contents of the "data" directory, use the following command:

```

hdfs dfs -ls /data

```

6. **Copy Data from HDFS to Local File System:**

If you want to copy data from HDFS back to your local file system, you can use the `hdfs dfs -get`
command. For example, to copy the "sample.txt" file from the "data" directory in HDFS to your local
machine, use the following command:

```

hdfs dfs -get /data/sample.txt /path/to/local/

```

7. **Delete Data in HDFS:**

To remove data from HDFS, you can use the `hdfs dfs -rm` command. For example, to delete the
"sample.txt" file from the "data" directory in HDFS, use the following command:

```

hdfs dfs -rm /data/sample.txt

```
8. **Delete HDFS Directories:**

Similarly, you can delete directories in HDFS using the `hdfs dfs -rm` command with the `-r` option to
remove directories recursively. For example, to delete the "data" directory and all its contents, use
the following command:

```

hdfs dfs -rm -r /data

```

These are some of the basic commands you can use with the Hadoop CLI to load data into HDFS,
manage files and directories, and perform other file system operations. The Hadoop CLI provides a
straightforward way to interact with HDFS for simple data loading and management tasks, especially
during early stages of learning or experimentation with Hadoop.

Getting data from hadoop:

To retrieve or access data from Hadoop, you can use various tools and methods depending on your
use case and data processing requirements. Here are some common approaches for getting data
from Hadoop:

1. **Hadoop Command-Line Interface (CLI):**

The Hadoop CLI provides commands to interact with Hadoop Distributed File System (HDFS) and
retrieve data from it. You can use the `hdfs dfs -get` command to copy files from HDFS to your local
file system. For example, to retrieve a file named "output.txt" from HDFS and save it to your local
directory "/path/to/local/," use the following command:

```

hdfs dfs -get /user/myuser/output.txt /path/to/local/

```

2. **Hive Query Language:**

If your data is stored in HDFS and managed using Apache Hive, you can use Hive's SQL-like query
language (HiveQL) to retrieve data. HiveQL allows you to perform ad-hoc querying and analysis on
the data stored in HDFS. For instance, to retrieve data from a Hive table, you can use a SELECT
query:

```

SELECT * FROM my_table;

```
3. **Apache Pig:**

Apache Pig is another data processing tool that simplifies the process of working with Hadoop data.
You can use Pig Latin, Pig's high-level language, to write data transformation scripts. Pig abstracts
the complexities of MapReduce jobs and allows you to focus on the data manipulation logic. To get
data from HDFS using Pig, you can load the data from HDFS into a Pig relation and then process it
using Pig Latin scripts.

4. **Apache Spark:**

Apache Spark is a powerful data processing engine that can work with data stored in HDFS. You can
use Spark's APIs (e.g., DataFrame API, RDD API) or SQL-like queries (using Spark SQL) to read data
from HDFS, perform various data transformations, and analyze the data in a distributed manner.

5. **Third-Party Tools and Libraries:**

Many third-party tools and libraries support data retrieval from Hadoop. For example, if you want to
access HDFS data from a Java program, you can use the Hadoop Java API. Similarly, other
programming languages have Hadoop libraries and connectors that facilitate data access.

6. **Web-based File Browsers:**

Some Hadoop distributions come with web-based file browsers that allow users to browse and
download files from HDFS through a graphical user interface (GUI).

7. **Apache Drill:**

Apache Drill is a distributed SQL query engine that can directly query data stored in Hadoop,
including HDFS, without requiring a schema definition. It enables users to run SQL queries on
Hadoop data without the need for data transformation or pre-defined schema.

Remember that the specific method you choose to retrieve data from Hadoop depends on your use
case, your familiarity with the tools, and the complexity of the data processing tasks. Each approach
has its strengths and is suitable for different scenarios

You might also like