Unit 2 - Linux & Hadoop
Unit 2 - Linux & Hadoop
Contents
I. Fundamentals of Linux ..................................................................................................... 1
1. File Management in Linux .................................................................................................. 1
2. Memory Management in Linux .......................................................................................... 2
3. Process Management in Linux .......................................................................................... 3
4. Networking & Setting Up Clusters ..................................................................................... 3
2. Cluster Setup in Linux ............................................................................................................. 5
II. Introduction to Apache Hadoop: Definitions, History, and Versions....................... 6
1. Definitions of Apache Hadoop..................................................................................... 6
2. History of Apache Hadoop ........................................................................................... 6
3. Hadoop Versions ............................................................................................................. 7
4. Hadoop Characteristics/Features ............................................................................... 8
5. Hadoop Ecosystem ........................................................................................................ 9
6. Basic Hadoop Commands .......................................................................................... 10
I. Fundamentals of Linux
Linux organizes files in a hierarchical directory structure, starting from the root directory (`/`).
Here are key concepts and commands used for file management:
- Directory Structure:
- `/` : Root directory
- `/home` : User directories
- `/etc` : System configuration files
- `/usr` : User programs and data
- `/var` : Variable data files (logs, mail spools, etc.)
- File Permissions:
- `chmod <permissions> <file>` : Changes file permissions (read, write, execute).
- `chown <user>:<group> <file>` : Changes file owner and group.
Memory management in Linux handles the allocation, swapping, and release of memory to
running processes. Here are essential commands for managing memory:
- `free`: Displays memory usage, including total, used, and available memory.
```bash
free -h
```
The `-h` option shows the output in a human-readable format (e.g., GB, MB).
- `top`: Displays real-time information about system processes, memory usage, and CPU load.
```bash
top
```
The `top` command updates every few seconds and provides an interactive interface for
managing processes.
- `vmstat`: Shows detailed statistics about processes, memory, paging, block IO, traps, and
CPU activity.
```bash
vmstat 1
```
Processes are the core units of execution in an operating system. Linux provides several
commands to manage these processes.
- `ps`: Lists the running processes.
```bash
ps -e
```
- `top`: Similar to `ps`, but in real-time, showing CPU usage and memory consumption of active
processes.
- `kill <pid>`: Terminates a process by specifying its process ID (PID). For graceful shutdown,
you can use the `SIGTERM` signal, or `SIGKILL` for forced termination.
```bash
kill <pid>
kill -9 <pid>
```
Additional Resources
1. File System: Learn more about Linux file systems and structure in articles from sources like
*Linux.com* or *HowToGeek*.
2. Process Management: For an in-depth understanding of process management and
optimization, check the official Linux documentation.
3. Memory and Resource Management: *Linux Journal* and *Ubuntu’s official wiki* provide
detailed guides for memory and resource management in Linux.
Linux provides a robust set of tools for managing networking and setting up clusters, useful for
everything from simple local networking to complex distributed systems. Below is an overview of
key commands for networking and cluster management in Linux.
1. Networking Commands
- `ifconfig`: Displays or configures network interfaces.
```bash
ifconfig
```
This shows the current configuration of all network interfaces on your system. It's often
replaced by `ip` commands in newer Linux distributions.
- `ip`: A powerful utility for network configuration (modern replacement for `ifconfig`).
- To view network interfaces:
```bash
ip addr
```
- To bring an interface up or down:
```bash
ip link set eth0 up Bring interface up
ip link set eth0 down Bring interface down
```
- `ping`: Sends ICMP (Internet Control Message Protocol) Echo Requests to test connectivity.
```bash
ping 8.8.8.8
```
This can help troubleshoot network connectivity to external servers.
- `ss`: A utility to investigate sockets, a faster and more modern alternative to `netstat`.
```bash
ss -tuln
```
- `iptables`: A command line tool for configuring the Linux kernel firewall.
```bash
iptables -L
```
This lists the current firewall rules.
Setting up clusters in Linux involves various tools and commands to manage multiple machines
as a unified system. Below are some common approaches:
- `ssh` (Secure Shell): Essential for connecting to remote systems. Often used in cluster setups
to access nodes.
```bash
ssh user@remote_host
```
- `scp`: Securely copies files between systems over SSH, useful for sharing configuration files
across nodes in a cluster.
```bash
scp file.txt user@remote_host:/path/to/destination
```
- `rsync`: A utility for syncing files and directories between local and remote systems.
```bash
rsync -avz /source/ user@remote_host:/destination/
```
Apache Hadoop is a widely used open-source framework designed to process and store vast
amounts of data across distributed computing environments. It is built to scale up from a single
server to thousands of machines, each offering local computation and storage. Here's a deeper
look into Hadoop's definitions, history, and versions:
2. Key Features:
- Scalability: Hadoop scales to accommodate growing data and can handle petabytes of data.
- Fault tolerance: Data is replicated across nodes in HDFS, ensuring the system continues to
function even if one or more nodes fail.
- Cost-effective: Built to run on commodity hardware, Hadoop reduces the need for expensive
infrastructure.
- Parallel processing: By dividing the tasks among multiple machines, Hadoop allows efficient
parallel computation, significantly speeding up data processing.
1. Origins:
Apache Hadoop originated from the Google File System (GFS) and Google MapReduce. The
core idea was to replicate the functionality of these systems in an open-source framework.
Hadoop was created by Doug Cutting and Mike Cafarella in 2005. The project was named after
Doug Cutting's child's toy elephant, symbolizing the system's large-scale data processing
capabilities.
2. Key Milestones:
- 2006: Hadoop became part of the Apache Software Foundation (ASF) as an open-source
project.
- 2008: The project grew in popularity as companies such as Yahoo! and Facebook began
using Hadoop for large-scale data analysis.
- 2011: The Hadoop ecosystem expanded with additional projects like HBase, Hive, and Pig.
- 2013: Hadoop’s adoption reached new heights as organizations saw its potential for big data
analytics and data warehousing.
3. Hadoop Versions
Over time, Apache Hadoop has gone through several major versions, each bringing
improvements and new features. Below is a summary of key Hadoop versions:
References:
- Shvets, A., & Shvets, S. (2021). *Apache Hadoop: The Essential Guide for Big Data
Processing*. Springer.
- White, T. (2012). *Hadoop: The Definitive Guide*. O'Reilly Media.
4. Hadoop Characteristics/Features
Apache Hadoop is a framework designed for processing and storing large datasets in a
distributed computing environment. It enables organizations to harness the power of "big data"
by providing a reliable, scalable, and cost-effective way to manage massive volumes of data.
Below are the key features and characteristics of Hadoop:
1. Scalability:
- Hadoop is designed to scale out by adding more commodity hardware nodes. The system
can efficiently manage petabytes of data across thousands of machines without significant
changes to the underlying infrastructure.
2. Fault Tolerance:
- HDFS (Hadoop Distributed File System) automatically replicates data across multiple nodes,
ensuring that if a node fails, the data can be retrieved from another node.
3. Cost-Effectiveness:
- Hadoop can run on commodity hardware, which significantly reduces the cost of data
storage and computation. It is an open-source solution, avoiding licensing fees associated with
traditional data management systems.
4. Flexibility:
- Hadoop is capable of processing both structured and unstructured data, including data from
diverse sources like logs, social media feeds, videos, and sensor data.
5. Parallel Processing:
- Hadoop leverages MapReduce, a programming model that divides tasks into smaller sub-
tasks and processes them in parallel across a distributed environment, significantly speeding up
the computation process.
6. Data Locality:
- Hadoop processes data where it is stored (locally on the cluster nodes), minimizing the data
transfer time and enhancing performance.
- Source: Shvets & Shvets (2021), *Apache Hadoop: The Essential Guide for Big Data
Processing*.
7. High Availability:
- Through HDFS, Hadoop ensures data replication across multiple nodes, offering high
availability and reducing the risk of data loss.
8. Ecosystem Support:
- Hadoop is supported by a growing ecosystem of tools and frameworks that help with various
aspects of big data processing, such as storage, analysis, and real-time streaming.
5. Hadoop Ecosystem
The Hadoop ecosystem comprises a suite of tools and technologies that enhance the core
functionalities of Hadoop, enabling more efficient data management, processing, and analysis.
Some of the prominent components of the Hadoop ecosystem include:
2. MapReduce:
- A programming model for large-scale data processing. It divides tasks into smaller sub-tasks
and processes them in parallel across the cluster.
4. Hive:
- A data warehouse system built on top of Hadoop that allows for querying data using SQL-like
syntax. It abstracts the complexities of MapReduce.
5. Pig:
- A high-level platform for creating programs that run on Hadoop. Pig's language, Pig Latin,
simplifies the process of writing MapReduce programs.
6. HBase:
- A distributed and scalable database built on top of HDFS, which is designed to store
structured data. It is often used for real-time querying and random access to large datasets.
7. Spark:
- A fast, in-memory, distributed processing engine that is often used alongside Hadoop for
real-time analytics and machine learning tasks.
8. Oozie:
- A workflow scheduler system to manage Hadoop jobs, including complex job dependencies
and sequences.
9. ZooKeeper:
- A distributed coordination service that helps manage distributed applications and systems.
10. Flume:
- A service for collecting, aggregating, and moving large amounts of log data from multiple
sources to HDFS.
11. Sqoop:
- A tool designed for efficiently transferring bulk data between Hadoop and relational
databases.
12. Mahout:
- A machine learning library that runs on top of Hadoop for scalable machine learning
algorithms.
References:
- Gandomi, A., & Haider, M. (2015). *Beyond the Hype: Big Data Concepts, Methods, and
Analytics*. International Journal of Information Management, 35(2), 137-144.
- Shvets, A., & Shvets, S. (2021). *Apache Hadoop: The Essential Guide for Big Data
Processing*. Springer.
- White, T. (2012). *Hadoop: The Definitive Guide*. O'Reilly Media.
Hadoop commands are primarily used for managing and interacting with HDFS (Hadoop
Distributed File System) and for running MapReduce jobs. These commands are typically
executed through the command line interface (CLI) of Hadoop, and they help with tasks such as
managing files, directories, and running jobs across the Hadoop cluster.
1. HDFS Commands
- `hdfs dfs -get`: Download a file from HDFS to the local file system.
```bash
hdfs dfs -get /user/hadoop/file.txt /localpath/
```
This downloads the file `file.txt` from HDFS to the local machine.
- `hdfs dfs -du`: Display the disk usage of a file or directory in HDFS.
```bash
hdfs dfs -du /user/hadoop/
```
This shows the space used by a directory and its contents.
2. MapReduce Commands
- `yarn jar`: Run a MapReduce job.
```bash
yarn jar /path/to/mapreduce/job.jar ClassName /input /output
```
This command submits a MapReduce job to the YARN ResourceManager. The `/input` is the
input directory in HDFS, and `/output` is where the results will be stored.
- `hadoop namenode -format`: Format the HDFS namenode (used when setting up a new
Hadoop cluster).
```bash
hadoop namenode -format
```
This command is used to format the HDFS namenode before starting the Hadoop cluster.
Warning: This will delete any existing data on the HDFS.
- `start-dfs.sh`: Start the Hadoop Distributed FileSystem services (Namenode, Datanodes, and
Secondary NameNode).
```bash
start-dfs.sh
```
This script starts the HDFS services on the cluster.
4. Admin Commands
- `hadoop job -status <job_id>`: Get the status of a specific MapReduce job.
```bash
hadoop job -status job_123456789
```
This command gives detailed information about the progress and status of a particular
MapReduce job.
References:
- Shvets, A., & Shvets, S. (2021). *Apache Hadoop: The Essential Guide for Big Data
Processing*. Springer.
- White, T. (2012). *Hadoop: The Definitive Guide*. O'Reilly Media.