0% found this document useful (0 votes)
14 views14 pages

Unit 2 - Linux & Hadoop

The document provides a comprehensive overview of Linux and Hadoop fundamentals, covering essential topics such as file, memory, and process management in Linux, as well as the setup of clusters. It introduces Apache Hadoop, detailing its definitions, history, versions, features, and ecosystem components. Additionally, it includes basic commands for managing both Linux and Hadoop environments.

Uploaded by

ldoddi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views14 pages

Unit 2 - Linux & Hadoop

The document provides a comprehensive overview of Linux and Hadoop fundamentals, covering essential topics such as file, memory, and process management in Linux, as well as the setup of clusters. It introduces Apache Hadoop, detailing its definitions, history, versions, features, and ecosystem components. Additionally, it includes basic commands for managing both Linux and Hadoop environments.

Uploaded by

ldoddi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Linux & Hadoop Fundamentals

Contents
I. Fundamentals of Linux ..................................................................................................... 1
1. File Management in Linux .................................................................................................. 1
2. Memory Management in Linux .......................................................................................... 2
3. Process Management in Linux .......................................................................................... 3
4. Networking & Setting Up Clusters ..................................................................................... 3
2. Cluster Setup in Linux ............................................................................................................. 5
II. Introduction to Apache Hadoop: Definitions, History, and Versions....................... 6
1. Definitions of Apache Hadoop..................................................................................... 6
2. History of Apache Hadoop ........................................................................................... 6
3. Hadoop Versions ............................................................................................................. 7
4. Hadoop Characteristics/Features ............................................................................... 8
5. Hadoop Ecosystem ........................................................................................................ 9
6. Basic Hadoop Commands .......................................................................................... 10

I. Fundamentals of Linux

Basics of Linux: Files, Memory Management, and Process Management Commands


Linux is a powerful, open-source operating system widely used for its flexibility, efficiency, and
ease of management, especially in server environments. Below are the basics of file handling,
memory management, and process management commands.

1. File Management in Linux

Linux organizes files in a hierarchical directory structure, starting from the root directory (`/`).
Here are key concepts and commands used for file management:

- Directory Structure:
- `/` : Root directory
- `/home` : User directories
- `/etc` : System configuration files
- `/usr` : User programs and data
- `/var` : Variable data files (logs, mail spools, etc.)

- Basic File Operations:


- `ls` : Lists files and directories in a directory.
- `cd <dir>` : Changes the current directory to `<dir>`.
- `cp <source> <destination>` : Copies a file or directory.
- `mv <source> <destination>` : Moves or renames a file or directory.
- `rm <file>` : Removes a file.
- `touch <file>` : Creates an empty file or updates the file's timestamp.
- `cat <file>` : Displays the contents of a file.
- `find <dir> -name <file>` : Searches for a file by name in a directory and its subdirectories.

- File Permissions:
- `chmod <permissions> <file>` : Changes file permissions (read, write, execute).
- `chown <user>:<group> <file>` : Changes file owner and group.

2. Memory Management in Linux

Memory management in Linux handles the allocation, swapping, and release of memory to
running processes. Here are essential commands for managing memory:
- `free`: Displays memory usage, including total, used, and available memory.
```bash
free -h
```
The `-h` option shows the output in a human-readable format (e.g., GB, MB).

- `top`: Displays real-time information about system processes, memory usage, and CPU load.
```bash
top
```
The `top` command updates every few seconds and provides an interactive interface for
managing processes.
- `vmstat`: Shows detailed statistics about processes, memory, paging, block IO, traps, and
CPU activity.
```bash
vmstat 1
```

- `ps`: Displays information about running processes.


```bash
ps aux
```
The `ps aux` command shows a detailed list of processes, including their memory and CPU
usage.
- Swapping: Linux uses swap space when physical memory is full. Swap can be configured as a
separate partition or file.
- `swapon -s` : Displays active swap devices.
- `swapoff <device>` : Disables swapping on a device.

3. Process Management in Linux

Processes are the core units of execution in an operating system. Linux provides several
commands to manage these processes.
- `ps`: Lists the running processes.
```bash
ps -e
```
- `top`: Similar to `ps`, but in real-time, showing CPU usage and memory consumption of active
processes.
- `kill <pid>`: Terminates a process by specifying its process ID (PID). For graceful shutdown,
you can use the `SIGTERM` signal, or `SIGKILL` for forced termination.
```bash
kill <pid>
kill -9 <pid>
```

- `killall <process_name>`: Kills all processes with a specific name.


```bash
killall firefox
```

Additional Resources

1. File System: Learn more about Linux file systems and structure in articles from sources like
*Linux.com* or *HowToGeek*.
2. Process Management: For an in-depth understanding of process management and
optimization, check the official Linux documentation.
3. Memory and Resource Management: *Linux Journal* and *Ubuntu’s official wiki* provide
detailed guides for memory and resource management in Linux.

4. Networking & Setting Up Clusters

Linux provides a robust set of tools for managing networking and setting up clusters, useful for
everything from simple local networking to complex distributed systems. Below is an overview of
key commands for networking and cluster management in Linux.
1. Networking Commands
- `ifconfig`: Displays or configures network interfaces.
```bash
ifconfig
```
This shows the current configuration of all network interfaces on your system. It's often
replaced by `ip` commands in newer Linux distributions.

- `ip`: A powerful utility for network configuration (modern replacement for `ifconfig`).
- To view network interfaces:
```bash
ip addr
```
- To bring an interface up or down:
```bash
ip link set eth0 up Bring interface up
ip link set eth0 down Bring interface down
```

- `ping`: Sends ICMP (Internet Control Message Protocol) Echo Requests to test connectivity.
```bash
ping 8.8.8.8
```
This can help troubleshoot network connectivity to external servers.

- `netstat`: Displays network connections, routing tables, and interface statistics.


```bash
netstat -tuln
```
This command shows active network connections, listening ports, and related information.

- `ss`: A utility to investigate sockets, a faster and more modern alternative to `netstat`.
```bash
ss -tuln
```

- `traceroute`: Traces the route packets take to a network host.


```bash
traceroute google.com
```

- `curl` and `wget`: Tools for transferring data over HTTP/HTTPS.


```bash
curl http://example.com
wget http://example.com
```

- `iptables`: A command line tool for configuring the Linux kernel firewall.
```bash
iptables -L
```
This lists the current firewall rules.

2. Cluster Setup in Linux

Setting up clusters in Linux involves various tools and commands to manage multiple machines
as a unified system. Below are some common approaches:
- `ssh` (Secure Shell): Essential for connecting to remote systems. Often used in cluster setups
to access nodes.
```bash
ssh user@remote_host
```

- `scp`: Securely copies files between systems over SSH, useful for sharing configuration files
across nodes in a cluster.
```bash
scp file.txt user@remote_host:/path/to/destination
```

- `rsync`: A utility for syncing files and directories between local and remote systems.
```bash
rsync -avz /source/ user@remote_host:/destination/
```

2. Configure passwordless SSH login between cluster nodes:


- Generate SSH keys:
```bash
ssh-keygen
```
- Copy the public key to each node:
```bash
ssh-copy-id user@remote_node
```
Additional Resources
- Linux Networking: [DigitalOcean Linux Networking
Overview](https://www.digitalocean.com/community/tutorials)
- Linux Clustering: [Linux Clustering with Pacemaker](https://clusterlabs.org/pacemaker/)
- Kubernetes: [Kubernetes Documentation](https://kubernetes.io/docs/)

II. Introduction to Apache Hadoop: Definitions, History, and Versions

Apache Hadoop is a widely used open-source framework designed to process and store vast
amounts of data across distributed computing environments. It is built to scale up from a single
server to thousands of machines, each offering local computation and storage. Here's a deeper
look into Hadoop's definitions, history, and versions:

1. Definitions of Apache Hadoop

1. What is Apache Hadoop?


Apache Hadoop is a software framework for storing and processing large datasets in a
distributed manner. It is designed to work with massive amounts of data by splitting it across
many nodes and processing them in parallel. Hadoop's core components include:
- HDFS (Hadoop Distributed File System): A distributed file system that stores data across
multiple machines.
- MapReduce: A programming model used for processing large datasets by dividing the task
into smaller sub-tasks and executing them in parallel.
- YARN (Yet Another Resource Negotiator): A resource management layer that manages the
allocation of resources across the Hadoop cluster.
- Hadoop Common: A set of utilities that supports the other Hadoop modules.

2. Key Features:
- Scalability: Hadoop scales to accommodate growing data and can handle petabytes of data.
- Fault tolerance: Data is replicated across nodes in HDFS, ensuring the system continues to
function even if one or more nodes fail.
- Cost-effective: Built to run on commodity hardware, Hadoop reduces the need for expensive
infrastructure.
- Parallel processing: By dividing the tasks among multiple machines, Hadoop allows efficient
parallel computation, significantly speeding up data processing.

2. History of Apache Hadoop

1. Origins:
Apache Hadoop originated from the Google File System (GFS) and Google MapReduce. The
core idea was to replicate the functionality of these systems in an open-source framework.
Hadoop was created by Doug Cutting and Mike Cafarella in 2005. The project was named after
Doug Cutting's child's toy elephant, symbolizing the system's large-scale data processing
capabilities.

2. Key Milestones:
- 2006: Hadoop became part of the Apache Software Foundation (ASF) as an open-source
project.
- 2008: The project grew in popularity as companies such as Yahoo! and Facebook began
using Hadoop for large-scale data analysis.
- 2011: The Hadoop ecosystem expanded with additional projects like HBase, Hive, and Pig.
- 2013: Hadoop’s adoption reached new heights as organizations saw its potential for big data
analytics and data warehousing.

3. Hadoop Versions

Over time, Apache Hadoop has gone through several major versions, each bringing
improvements and new features. Below is a summary of key Hadoop versions:

1. Hadoop 1.x (Initial releases):


- Release: Around 2008.
- Key Features:
- Basic functionality of HDFS and MapReduce.
- YARN was not introduced yet, so the MapReduce job tracker handled both resource
management and job scheduling.
- Limitations: It lacked scalability and flexibility for managing resources effectively in large
clusters.

2. Hadoop 2.x (Introduction of YARN):


- Release: 2013.
- Key Features:
- YARN: Introduced as a resource management layer to separate job scheduling from
resource management.
- Improved scalability and flexibility, allowing for a wider range of workloads beyond
MapReduce, including interactive querying and stream processing.
- Support for additional programming models beyond MapReduce, such as Apache Tez and
Apache Spark.
- Hadoop 2.0 included several critical improvements that addressed the limitations of Hadoop
1.x, enabling it to scale more efficiently and run diverse workloads.

3. Hadoop 3.x (Latest Stable Version):


- Release: 2017.
- Key Features:
- HDFS Erasure Coding: A more space-efficient way of storing redundant data compared to
the traditional replication method.
- YARN Improvements: Better resource management and support for multiple resource
managers.
- Support for GPUs: Enabling accelerated data processing through GPU support.
- HDFS Federation: A feature that enhances scalability by allowing multiple namespaces in a
single cluster.
- Key Changes:
- Significant improvements in performance, scalability, and storage efficiency.
- Features aimed at improving the management of large clusters and large datasets.

References:
- Shvets, A., & Shvets, S. (2021). *Apache Hadoop: The Essential Guide for Big Data
Processing*. Springer.
- White, T. (2012). *Hadoop: The Definitive Guide*. O'Reilly Media.

4. Hadoop Characteristics/Features

Apache Hadoop is a framework designed for processing and storing large datasets in a
distributed computing environment. It enables organizations to harness the power of "big data"
by providing a reliable, scalable, and cost-effective way to manage massive volumes of data.
Below are the key features and characteristics of Hadoop:

1. Scalability:
- Hadoop is designed to scale out by adding more commodity hardware nodes. The system
can efficiently manage petabytes of data across thousands of machines without significant
changes to the underlying infrastructure.

2. Fault Tolerance:
- HDFS (Hadoop Distributed File System) automatically replicates data across multiple nodes,
ensuring that if a node fails, the data can be retrieved from another node.

3. Cost-Effectiveness:
- Hadoop can run on commodity hardware, which significantly reduces the cost of data
storage and computation. It is an open-source solution, avoiding licensing fees associated with
traditional data management systems.

4. Flexibility:
- Hadoop is capable of processing both structured and unstructured data, including data from
diverse sources like logs, social media feeds, videos, and sensor data.

5. Parallel Processing:
- Hadoop leverages MapReduce, a programming model that divides tasks into smaller sub-
tasks and processes them in parallel across a distributed environment, significantly speeding up
the computation process.

6. Data Locality:
- Hadoop processes data where it is stored (locally on the cluster nodes), minimizing the data
transfer time and enhancing performance.
- Source: Shvets & Shvets (2021), *Apache Hadoop: The Essential Guide for Big Data
Processing*.

7. High Availability:
- Through HDFS, Hadoop ensures data replication across multiple nodes, offering high
availability and reducing the risk of data loss.

8. Ecosystem Support:
- Hadoop is supported by a growing ecosystem of tools and frameworks that help with various
aspects of big data processing, such as storage, analysis, and real-time streaming.

5. Hadoop Ecosystem

The Hadoop ecosystem comprises a suite of tools and technologies that enhance the core
functionalities of Hadoop, enabling more efficient data management, processing, and analysis.
Some of the prominent components of the Hadoop ecosystem include:

1. HDFS (Hadoop Distributed File System):


- The foundational storage system of Hadoop, designed to store vast amounts of data across
multiple machines with fault tolerance.

2. MapReduce:
- A programming model for large-scale data processing. It divides tasks into smaller sub-tasks
and processes them in parallel across the cluster.

3. YARN (Yet Another Resource Negotiator):


- The resource management layer that allocates resources to different jobs and ensures job
scheduling in the Hadoop ecosystem.

4. Hive:
- A data warehouse system built on top of Hadoop that allows for querying data using SQL-like
syntax. It abstracts the complexities of MapReduce.

5. Pig:
- A high-level platform for creating programs that run on Hadoop. Pig's language, Pig Latin,
simplifies the process of writing MapReduce programs.

6. HBase:
- A distributed and scalable database built on top of HDFS, which is designed to store
structured data. It is often used for real-time querying and random access to large datasets.

7. Spark:
- A fast, in-memory, distributed processing engine that is often used alongside Hadoop for
real-time analytics and machine learning tasks.
8. Oozie:
- A workflow scheduler system to manage Hadoop jobs, including complex job dependencies
and sequences.

9. ZooKeeper:
- A distributed coordination service that helps manage distributed applications and systems.

10. Flume:
- A service for collecting, aggregating, and moving large amounts of log data from multiple
sources to HDFS.

11. Sqoop:
- A tool designed for efficiently transferring bulk data between Hadoop and relational
databases.

12. Mahout:
- A machine learning library that runs on top of Hadoop for scalable machine learning
algorithms.

References:
- Gandomi, A., & Haider, M. (2015). *Beyond the Hype: Big Data Concepts, Methods, and
Analytics*. International Journal of Information Management, 35(2), 137-144.
- Shvets, A., & Shvets, S. (2021). *Apache Hadoop: The Essential Guide for Big Data
Processing*. Springer.
- White, T. (2012). *Hadoop: The Definitive Guide*. O'Reilly Media.

6. Basic Hadoop Commands

Hadoop commands are primarily used for managing and interacting with HDFS (Hadoop
Distributed File System) and for running MapReduce jobs. These commands are typically
executed through the command line interface (CLI) of Hadoop, and they help with tasks such as
managing files, directories, and running jobs across the Hadoop cluster.

Here is a list of some commonly used basic Hadoop commands:

1. HDFS Commands

- `hdfs dfs -ls`: List files and directories in HDFS.


```bash
hdfs dfs -ls /user/hadoop/
```
This command lists the contents of the specified directory in HDFS, similar to the `ls`
command in Linux.
- `hdfs dfs -put`: Upload a file from the local file system to HDFS.
```bash
hdfs dfs -put localfile.txt /user/hadoop/
```
This copies the file `localfile.txt` from the local filesystem into the specified HDFS directory.

- `hdfs dfs -get`: Download a file from HDFS to the local file system.
```bash
hdfs dfs -get /user/hadoop/file.txt /localpath/
```
This downloads the file `file.txt` from HDFS to the local machine.

- `hdfs dfs -mkdir`: Create a directory in HDFS.


```bash
hdfs dfs -mkdir /user/hadoop/newdir
```
This creates a new directory named `newdir` inside `/user/hadoop/` on HDFS.

- `hdfs dfs -rm`: Remove a file or directory from HDFS.


```bash
hdfs dfs -rm /user/hadoop/file.txt
```
This deletes the specified file from HDFS. To delete a directory, use `-r` for recursive deletion.

- `hdfs dfs -cat`: Display the contents of a file in HDFS.


```bash
hdfs dfs -cat /user/hadoop/file.txt
```
This prints the contents of `file.txt` stored in HDFS to the console.

- `hdfs dfs -cp`: Copy a file or directory within HDFS.


```bash
hdfs dfs -cp /user/hadoop/file1.txt /user/hadoop/file2.txt
```
This command copies a file from one location in HDFS to another.

- `hdfs dfs -du`: Display the disk usage of a file or directory in HDFS.
```bash
hdfs dfs -du /user/hadoop/
```
This shows the space used by a directory and its contents.

2. MapReduce Commands
- `yarn jar`: Run a MapReduce job.
```bash
yarn jar /path/to/mapreduce/job.jar ClassName /input /output
```
This command submits a MapReduce job to the YARN ResourceManager. The `/input` is the
input directory in HDFS, and `/output` is where the results will be stored.

- `hadoop jar`: Run a MapReduce job using a jar file.


```bash
hadoop jar myMapReduce.jar input_dir output_dir
```
Similar to the `yarn jar` command, but the `hadoop jar` command directly runs the job without
passing through YARN.

- `yarn application -status`: Check the status of a running application.


```bash
yarn application -status <Application_ID>
```
This command provides details about the status of a MapReduce job or other YARN-based
applications.

- `mapred job -list`: List all running jobs.


```bash
mapred job -list
```
This shows a list of all jobs currently running or completed in Hadoop.

3. Cluster Management Commands

- `hadoop fsck`: Perform a file system check on HDFS.


```bash
hadoop fsck /user/hadoop/
```
This checks the health of the HDFS filesystem and reports any issues such as missing blocks
or corrupted files.

- `hadoop namenode -format`: Format the HDFS namenode (used when setting up a new
Hadoop cluster).
```bash
hadoop namenode -format
```
This command is used to format the HDFS namenode before starting the Hadoop cluster.
Warning: This will delete any existing data on the HDFS.
- `start-dfs.sh`: Start the Hadoop Distributed FileSystem services (Namenode, Datanodes, and
Secondary NameNode).
```bash
start-dfs.sh
```
This script starts the HDFS services on the cluster.

- `stop-dfs.sh`: Stop the Hadoop Distributed FileSystem services.


```bash
stop-dfs.sh
```
This script stops the HDFS services.

- `start-yarn.sh`: Start the YARN ResourceManager and NodeManager services.


```bash
start-yarn.sh
```
This starts the YARN services on the cluster.

- `stop-yarn.sh`: Stop the YARN services.


```bash
stop-yarn.sh
```
This stops the YARN services.

4. Admin Commands

- `hdfs dfsadmin -report`: Show HDFS cluster status.


```bash
hdfs dfsadmin -report
```
This command provides a report on the health of the HDFS cluster, including information about
the datanodes and the overall space usage.

- `hadoop version`: Display the current version of Hadoop.


```bash
hadoop version
```
This shows the version of Hadoop installed on your system.

5. Monitoring and Debugging

- `jps`: List all running Java processes in the Hadoop cluster.


```bash
jps
```
This command shows a list of all Java processes running on the current machine. It helps in
debugging and checking if Hadoop daemons (like `NameNode`, `DataNode`,
`ResourceManager`, etc.) are running.

- `hadoop job -status <job_id>`: Get the status of a specific MapReduce job.
```bash
hadoop job -status job_123456789
```
This command gives detailed information about the progress and status of a particular
MapReduce job.

References:
- Shvets, A., & Shvets, S. (2021). *Apache Hadoop: The Essential Guide for Big Data
Processing*. Springer.
- White, T. (2012). *Hadoop: The Definitive Guide*. O'Reilly Media.

You might also like