0% found this document useful (0 votes)

14 views14 pages

Unit 2 - Linux & Hadoop

The document provides a comprehensive overview of Linux and Hadoop fundamentals, covering essential topics such as file, memory, and process management in Linux, as well as the setup of clusters. It introduces Apache Hadoop, detailing its definitions, history, versions, features, and ecosystem components. Additionally, it includes basic commands for managing both Linux and Hadoop environments.

Uploaded by

ldoddi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views14 pages

Unit 2 - Linux & Hadoop

Uploaded by

ldoddi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Linux & Hadoop Fundamentals

Contents
I. Fundamentals of Linux ..................................................................................................... 1
1. File Management in Linux .................................................................................................. 1
2. Memory Management in Linux .......................................................................................... 2
3. Process Management in Linux .......................................................................................... 3
4. Networking & Setting Up Clusters ..................................................................................... 3
2. Cluster Setup in Linux ............................................................................................................. 5
II. Introduction to Apache Hadoop: Definitions, History, and Versions....................... 6
1. Definitions of Apache Hadoop..................................................................................... 6
2. History of Apache Hadoop ........................................................................................... 6
3. Hadoop Versions ............................................................................................................. 7
4. Hadoop Characteristics/Features ............................................................................... 8
5. Hadoop Ecosystem ........................................................................................................ 9
6. Basic Hadoop Commands .......................................................................................... 10

I. Fundamentals of Linux

Basics of Linux: Files, Memory Management, and Process Management Commands

Linux is a powerful, open-source operating system widely used for its flexibility, efficiency, and
ease of management, especially in server environments. Below are the basics of file handling,
memory management, and process management commands.

1. File Management in Linux

Linux organizes files in a hierarchical directory structure, starting from the root directory (`/`).
Here are key concepts and commands used for file management:

- Directory Structure:
- `/` : Root directory
- `/home` : User directories
- `/etc` : System configuration files
- `/usr` : User programs and data
- `/var` : Variable data files (logs, mail spools, etc.)

- Basic File Operations:

- `ls` : Lists files and directories in a directory.
- `cd <dir>` : Changes the current directory to `<dir>`.
- `cp <source> <destination>` : Copies a file or directory.
- `mv <source> <destination>` : Moves or renames a file or directory.
- `rm <file>` : Removes a file.
- `touch <file>` : Creates an empty file or updates the file's timestamp.
- `cat <file>` : Displays the contents of a file.
- `find <dir> -name <file>` : Searches for a file by name in a directory and its subdirectories.

- File Permissions:
- `chmod <permissions> <file>` : Changes file permissions (read, write, execute).
- `chown <user>:<group> <file>` : Changes file owner and group.

2. Memory Management in Linux

Memory management in Linux handles the allocation, swapping, and release of memory to
running processes. Here are essential commands for managing memory:
- `free`: Displays memory usage, including total, used, and available memory.
```bash
free -h
```
The `-h` option shows the output in a human-readable format (e.g., GB, MB).

- `top`: Displays real-time information about system processes, memory usage, and CPU load.
```bash
top
```
The `top` command updates every few seconds and provides an interactive interface for
managing processes.
- `vmstat`: Shows detailed statistics about processes, memory, paging, block IO, traps, and
CPU activity.
```bash
vmstat 1
```

- `ps`: Displays information about running processes.

```bash
ps aux
```
The `ps aux` command shows a detailed list of processes, including their memory and CPU
usage.
- Swapping: Linux uses swap space when physical memory is full. Swap can be configured as a
separate partition or file.
- `swapon -s` : Displays active swap devices.
- `swapoff <device>` : Disables swapping on a device.

3. Process Management in Linux

Processes are the core units of execution in an operating system. Linux provides several
commands to manage these processes.
- `ps`: Lists the running processes.
```bash
ps -e
```
- `top`: Similar to `ps`, but in real-time, showing CPU usage and memory consumption of active
processes.
- `kill <pid>`: Terminates a process by specifying its process ID (PID). For graceful shutdown,
you can use the `SIGTERM` signal, or `SIGKILL` for forced termination.
```bash
kill <pid>
kill -9 <pid>
```

- `killall <process_name>`: Kills all processes with a specific name.

```bash
killall firefox
```

Additional Resources

1. File System: Learn more about Linux file systems and structure in articles from sources like
*Linux.com* or *HowToGeek*.
2. Process Management: For an in-depth understanding of process management and
optimization, check the official Linux documentation.
3. Memory and Resource Management: *Linux Journal* and *Ubuntu’s official wiki* provide
detailed guides for memory and resource management in Linux.

4. Networking & Setting Up Clusters

Linux provides a robust set of tools for managing networking and setting up clusters, useful for
everything from simple local networking to complex distributed systems. Below is an overview of
key commands for networking and cluster management in Linux.
1. Networking Commands
- `ifconfig`: Displays or configures network interfaces.
```bash
ifconfig
```
This shows the current configuration of all network interfaces on your system. It's often
replaced by `ip` commands in newer Linux distributions.

- `ip`: A powerful utility for network configuration (modern replacement for `ifconfig`).
- To view network interfaces:
```bash
ip addr
```
- To bring an interface up or down:
```bash
ip link set eth0 up Bring interface up
ip link set eth0 down Bring interface down
```

- `ping`: Sends ICMP (Internet Control Message Protocol) Echo Requests to test connectivity.
```bash
ping 8.8.8.8
```
This can help troubleshoot network connectivity to external servers.

- `netstat`: Displays network connections, routing tables, and interface statistics.

```bash
netstat -tuln
```
This command shows active network connections, listening ports, and related information.

- `ss`: A utility to investigate sockets, a faster and more modern alternative to `netstat`.
```bash
ss -tuln
```

- `traceroute`: Traces the route packets take to a network host.

```bash
traceroute google.com
```

- `curl` and `wget`: Tools for transferring data over HTTP/HTTPS.

```bash
curl http://example.com
wget http://example.com
```

- `iptables`: A command line tool for configuring the Linux kernel firewall.
```bash
iptables -L
```
This lists the current firewall rules.

2. Cluster Setup in Linux

Setting up clusters in Linux involves various tools and commands to manage multiple machines
as a unified system. Below are some common approaches:
- `ssh` (Secure Shell): Essential for connecting to remote systems. Often used in cluster setups
to access nodes.
```bash
ssh user@remote_host
```

- `scp`: Securely copies files between systems over SSH, useful for sharing configuration files
across nodes in a cluster.
```bash
scp file.txt user@remote_host:/path/to/destination
```

- `rsync`: A utility for syncing files and directories between local and remote systems.
```bash
rsync -avz /source/ user@remote_host:/destination/
```

2. Configure passwordless SSH login between cluster nodes:

- Generate SSH keys:
```bash
ssh-keygen
```
- Copy the public key to each node:
```bash
ssh-copy-id user@remote_node
```
Additional Resources
- Linux Networking: [DigitalOcean Linux Networking
Overview](https://www.digitalocean.com/community/tutorials)
- Linux Clustering: [Linux Clustering with Pacemaker](https://clusterlabs.org/pacemaker/)
- Kubernetes: [Kubernetes Documentation](https://kubernetes.io/docs/)

II. Introduction to Apache Hadoop: Definitions, History, and Versions

Apache Hadoop is a widely used open-source framework designed to process and store vast
amounts of data across distributed computing environments. It is built to scale up from a single
server to thousands of machines, each offering local computation and storage. Here's a deeper
look into Hadoop's definitions, history, and versions:

1. Definitions of Apache Hadoop

1. What is Apache Hadoop?

Apache Hadoop is a software framework for storing and processing large datasets in a
distributed manner. It is designed to work with massive amounts of data by splitting it across
many nodes and processing them in parallel. Hadoop's core components include:
- HDFS (Hadoop Distributed File System): A distributed file system that stores data across
multiple machines.
- MapReduce: A programming model used for processing large datasets by dividing the task
into smaller sub-tasks and executing them in parallel.
- YARN (Yet Another Resource Negotiator): A resource management layer that manages the
allocation of resources across the Hadoop cluster.
- Hadoop Common: A set of utilities that supports the other Hadoop modules.

2. Key Features:
- Scalability: Hadoop scales to accommodate growing data and can handle petabytes of data.
- Fault tolerance: Data is replicated across nodes in HDFS, ensuring the system continues to
function even if one or more nodes fail.
- Cost-effective: Built to run on commodity hardware, Hadoop reduces the need for expensive
infrastructure.
- Parallel processing: By dividing the tasks among multiple machines, Hadoop allows efficient
parallel computation, significantly speeding up data processing.

2. History of Apache Hadoop

1. Origins:
Apache Hadoop originated from the Google File System (GFS) and Google MapReduce. The
core idea was to replicate the functionality of these systems in an open-source framework.
Hadoop was created by Doug Cutting and Mike Cafarella in 2005. The project was named after
Doug Cutting's child's toy elephant, symbolizing the system's large-scale data processing
capabilities.

2. Key Milestones:
- 2006: Hadoop became part of the Apache Software Foundation (ASF) as an open-source
project.
- 2008: The project grew in popularity as companies such as Yahoo! and Facebook began
using Hadoop for large-scale data analysis.
- 2011: The Hadoop ecosystem expanded with additional projects like HBase, Hive, and Pig.
- 2013: Hadoop’s adoption reached new heights as organizations saw its potential for big data
analytics and data warehousing.

3. Hadoop Versions

Over time, Apache Hadoop has gone through several major versions, each bringing
improvements and new features. Below is a summary of key Hadoop versions:

1. Hadoop 1.x (Initial releases):

- Release: Around 2008.
- Key Features:
- Basic functionality of HDFS and MapReduce.
- YARN was not introduced yet, so the MapReduce job tracker handled both resource
management and job scheduling.
- Limitations: It lacked scalability and flexibility for managing resources effectively in large
clusters.

2. Hadoop 2.x (Introduction of YARN):

- Release: 2013.
- Key Features:
- YARN: Introduced as a resource management layer to separate job scheduling from
resource management.
- Improved scalability and flexibility, allowing for a wider range of workloads beyond
MapReduce, including interactive querying and stream processing.
- Support for additional programming models beyond MapReduce, such as Apache Tez and
Apache Spark.
- Hadoop 2.0 included several critical improvements that addressed the limitations of Hadoop
1.x, enabling it to scale more efficiently and run diverse workloads.

3. Hadoop 3.x (Latest Stable Version):

- Release: 2017.
- Key Features:
- HDFS Erasure Coding: A more space-efficient way of storing redundant data compared to
the traditional replication method.
- YARN Improvements: Better resource management and support for multiple resource
managers.
- Support for GPUs: Enabling accelerated data processing through GPU support.
- HDFS Federation: A feature that enhances scalability by allowing multiple namespaces in a
single cluster.
- Key Changes:
- Significant improvements in performance, scalability, and storage efficiency.
- Features aimed at improving the management of large clusters and large datasets.

References:
- Shvets, A., & Shvets, S. (2021). *Apache Hadoop: The Essential Guide for Big Data
Processing*. Springer.
- White, T. (2012). *Hadoop: The Definitive Guide*. O'Reilly Media.

4. Hadoop Characteristics/Features

Apache Hadoop is a framework designed for processing and storing large datasets in a
distributed computing environment. It enables organizations to harness the power of "big data"
by providing a reliable, scalable, and cost-effective way to manage massive volumes of data.
Below are the key features and characteristics of Hadoop:

1. Scalability:
- Hadoop is designed to scale out by adding more commodity hardware nodes. The system
can efficiently manage petabytes of data across thousands of machines without significant
changes to the underlying infrastructure.

2. Fault Tolerance:
- HDFS (Hadoop Distributed File System) automatically replicates data across multiple nodes,
ensuring that if a node fails, the data can be retrieved from another node.

3. Cost-Effectiveness:
- Hadoop can run on commodity hardware, which significantly reduces the cost of data
storage and computation. It is an open-source solution, avoiding licensing fees associated with
traditional data management systems.

4. Flexibility:
- Hadoop is capable of processing both structured and unstructured data, including data from
diverse sources like logs, social media feeds, videos, and sensor data.

5. Parallel Processing:
- Hadoop leverages MapReduce, a programming model that divides tasks into smaller sub-
tasks and processes them in parallel across a distributed environment, significantly speeding up
the computation process.

6. Data Locality:
- Hadoop processes data where it is stored (locally on the cluster nodes), minimizing the data
transfer time and enhancing performance.
- Source: Shvets & Shvets (2021), *Apache Hadoop: The Essential Guide for Big Data
Processing*.

7. High Availability:
- Through HDFS, Hadoop ensures data replication across multiple nodes, offering high
availability and reducing the risk of data loss.

8. Ecosystem Support:
- Hadoop is supported by a growing ecosystem of tools and frameworks that help with various
aspects of big data processing, such as storage, analysis, and real-time streaming.

5. Hadoop Ecosystem

The Hadoop ecosystem comprises a suite of tools and technologies that enhance the core
functionalities of Hadoop, enabling more efficient data management, processing, and analysis.
Some of the prominent components of the Hadoop ecosystem include:

1. HDFS (Hadoop Distributed File System):

- The foundational storage system of Hadoop, designed to store vast amounts of data across
multiple machines with fault tolerance.

2. MapReduce:
- A programming model for large-scale data processing. It divides tasks into smaller sub-tasks
and processes them in parallel across the cluster.

3. YARN (Yet Another Resource Negotiator):

- The resource management layer that allocates resources to different jobs and ensures job
scheduling in the Hadoop ecosystem.

4. Hive:
- A data warehouse system built on top of Hadoop that allows for querying data using SQL-like
syntax. It abstracts the complexities of MapReduce.

5. Pig:
- A high-level platform for creating programs that run on Hadoop. Pig's language, Pig Latin,
simplifies the process of writing MapReduce programs.

6. HBase:
- A distributed and scalable database built on top of HDFS, which is designed to store
structured data. It is often used for real-time querying and random access to large datasets.

7. Spark:
- A fast, in-memory, distributed processing engine that is often used alongside Hadoop for
real-time analytics and machine learning tasks.
8. Oozie:
- A workflow scheduler system to manage Hadoop jobs, including complex job dependencies
and sequences.

9. ZooKeeper:
- A distributed coordination service that helps manage distributed applications and systems.

10. Flume:
- A service for collecting, aggregating, and moving large amounts of log data from multiple
sources to HDFS.

11. Sqoop:
- A tool designed for efficiently transferring bulk data between Hadoop and relational
databases.

12. Mahout:
- A machine learning library that runs on top of Hadoop for scalable machine learning
algorithms.

References:
- Gandomi, A., & Haider, M. (2015). *Beyond the Hype: Big Data Concepts, Methods, and
Analytics*. International Journal of Information Management, 35(2), 137-144.
- Shvets, A., & Shvets, S. (2021). *Apache Hadoop: The Essential Guide for Big Data
Processing*. Springer.
- White, T. (2012). *Hadoop: The Definitive Guide*. O'Reilly Media.

6. Basic Hadoop Commands

Hadoop commands are primarily used for managing and interacting with HDFS (Hadoop
Distributed File System) and for running MapReduce jobs. These commands are typically
executed through the command line interface (CLI) of Hadoop, and they help with tasks such as
managing files, directories, and running jobs across the Hadoop cluster.

Here is a list of some commonly used basic Hadoop commands:

1. HDFS Commands

- `hdfs dfs -ls`: List files and directories in HDFS.

```bash
hdfs dfs -ls /user/hadoop/
```
This command lists the contents of the specified directory in HDFS, similar to the `ls`
command in Linux.
- `hdfs dfs -put`: Upload a file from the local file system to HDFS.
```bash
hdfs dfs -put localfile.txt /user/hadoop/
```
This copies the file `localfile.txt` from the local filesystem into the specified HDFS directory.

- `hdfs dfs -get`: Download a file from HDFS to the local file system.
```bash
hdfs dfs -get /user/hadoop/file.txt /localpath/
```
This downloads the file `file.txt` from HDFS to the local machine.

- `hdfs dfs -mkdir`: Create a directory in HDFS.

```bash
hdfs dfs -mkdir /user/hadoop/newdir
```
This creates a new directory named `newdir` inside `/user/hadoop/` on HDFS.

- `hdfs dfs -rm`: Remove a file or directory from HDFS.

```bash
hdfs dfs -rm /user/hadoop/file.txt
```
This deletes the specified file from HDFS. To delete a directory, use `-r` for recursive deletion.

- `hdfs dfs -cat`: Display the contents of a file in HDFS.

```bash
hdfs dfs -cat /user/hadoop/file.txt
```
This prints the contents of `file.txt` stored in HDFS to the console.

- `hdfs dfs -cp`: Copy a file or directory within HDFS.

```bash
hdfs dfs -cp /user/hadoop/file1.txt /user/hadoop/file2.txt
```
This command copies a file from one location in HDFS to another.

- `hdfs dfs -du`: Display the disk usage of a file or directory in HDFS.
```bash
hdfs dfs -du /user/hadoop/
```
This shows the space used by a directory and its contents.

2. MapReduce Commands
- `yarn jar`: Run a MapReduce job.
```bash
yarn jar /path/to/mapreduce/job.jar ClassName /input /output
```
This command submits a MapReduce job to the YARN ResourceManager. The `/input` is the
input directory in HDFS, and `/output` is where the results will be stored.

- `hadoop jar`: Run a MapReduce job using a jar file.

```bash
hadoop jar myMapReduce.jar input_dir output_dir
```
Similar to the `yarn jar` command, but the `hadoop jar` command directly runs the job without
passing through YARN.

- `yarn application -status`: Check the status of a running application.

```bash
yarn application -status <Application_ID>
```
This command provides details about the status of a MapReduce job or other YARN-based
applications.

- `mapred job -list`: List all running jobs.

```bash
mapred job -list
```
This shows a list of all jobs currently running or completed in Hadoop.

3. Cluster Management Commands

- `hadoop fsck`: Perform a file system check on HDFS.

```bash
hadoop fsck /user/hadoop/
```
This checks the health of the HDFS filesystem and reports any issues such as missing blocks
or corrupted files.

- `hadoop namenode -format`: Format the HDFS namenode (used when setting up a new
Hadoop cluster).
```bash
hadoop namenode -format
```
This command is used to format the HDFS namenode before starting the Hadoop cluster.
Warning: This will delete any existing data on the HDFS.
- `start-dfs.sh`: Start the Hadoop Distributed FileSystem services (Namenode, Datanodes, and
Secondary NameNode).
```bash
start-dfs.sh
```
This script starts the HDFS services on the cluster.

- `stop-dfs.sh`: Stop the Hadoop Distributed FileSystem services.

```bash
stop-dfs.sh
```
This script stops the HDFS services.

- `start-yarn.sh`: Start the YARN ResourceManager and NodeManager services.

```bash
start-yarn.sh
```
This starts the YARN services on the cluster.

- `stop-yarn.sh`: Stop the YARN services.

```bash
stop-yarn.sh
```
This stops the YARN services.

4. Admin Commands

- `hdfs dfsadmin -report`: Show HDFS cluster status.

```bash
hdfs dfsadmin -report
```
This command provides a report on the health of the HDFS cluster, including information about
the datanodes and the overall space usage.

- `hadoop version`: Display the current version of Hadoop.

```bash
hadoop version
```
This shows the version of Hadoop installed on your system.

5. Monitoring and Debugging

- `jps`: List all running Java processes in the Hadoop cluster.

```bash
jps
```
This command shows a list of all Java processes running on the current machine. It helps in
debugging and checking if Hadoop daemons (like `NameNode`, `DataNode`,
`ResourceManager`, etc.) are running.

- `hadoop job -status <job_id>`: Get the status of a specific MapReduce job.
```bash
hadoop job -status job_123456789
```
This command gives detailed information about the progress and status of a particular
MapReduce job.

References:
- Shvets, A., & Shvets, S. (2021). *Apache Hadoop: The Essential Guide for Big Data
Processing*. Springer.
- White, T. (2012). *Hadoop: The Definitive Guide*. O'Reilly Media.

Linux For Network Engineer - Linux Basics
No ratings yet
Linux For Network Engineer - Linux Basics
14 pages
GCP Associate Cloud Engineer Master Cheatsheet
No ratings yet
GCP Associate Cloud Engineer Master Cheatsheet
45 pages
Stata User Guide Release 18 - Data Management, Bysort
No ratings yet
Stata User Guide Release 18 - Data Management, Bysort
4 pages
Linux Basics Learned
No ratings yet
Linux Basics Learned
2 pages
Redhat Mod-1&2
No ratings yet
Redhat Mod-1&2
10 pages
Devops Shack: Linux Commands Documentation
No ratings yet
Devops Shack: Linux Commands Documentation
7 pages
Linux Command
No ratings yet
Linux Command
8 pages
Linux Fundamentals
100% (3)
Linux Fundamentals
86 pages
Linux Beginner's Comprehensive Guide
No ratings yet
Linux Beginner's Comprehensive Guide
5 pages
Top 100 Linux Commands PDF
No ratings yet
Top 100 Linux Commands PDF
22 pages
Linux Commands
No ratings yet
Linux Commands
6 pages
Devops Sheet
No ratings yet
Devops Sheet
286 pages
Advanced Linux Summary 1723323191
No ratings yet
Advanced Linux Summary 1723323191
22 pages
Linux For Network
No ratings yet
Linux For Network
14 pages
Linux Essential Operations - CheatSheet
No ratings yet
Linux Essential Operations - CheatSheet
9 pages
Linux Command
No ratings yet
Linux Command
8 pages
Linux Commands-2
No ratings yet
Linux Commands-2
16 pages
Kali Linux Commands
No ratings yet
Kali Linux Commands
16 pages
????? ????? ???????? ??? ??????-1
No ratings yet
????? ????? ???????? ??? ??????-1
4 pages
DevOps Cheat Sheet
No ratings yet
DevOps Cheat Sheet
297 pages
Redhat Linux Expertise Commands
No ratings yet
Redhat Linux Expertise Commands
18 pages
Lista de Comandos Uteis Linux (Linux Usefull Commands)
No ratings yet
Lista de Comandos Uteis Linux (Linux Usefull Commands)
5 pages
Linux Lecture4
No ratings yet
Linux Lecture4
17 pages
Linux Lecture5
No ratings yet
Linux Lecture5
15 pages
Day 2
No ratings yet
Day 2
5 pages
Linux Book
No ratings yet
Linux Book
28 pages
Study Notes GNU Linux Command Line Interface
No ratings yet
Study Notes GNU Linux Command Line Interface
107 pages
Linux Commands
No ratings yet
Linux Commands
9 pages
Linux Fundamentals
No ratings yet
Linux Fundamentals
33 pages
Linux Scenario - Based Interview Q&A
No ratings yet
Linux Scenario - Based Interview Q&A
25 pages
Lec3 Intro To Linux
No ratings yet
Lec3 Intro To Linux
29 pages
Linux Basics
No ratings yet
Linux Basics
6 pages
Lym
No ratings yet
Lym
88 pages
Linux Basic Commands 1748453726
No ratings yet
Linux Basic Commands 1748453726
8 pages
Linux Commands Reference
No ratings yet
Linux Commands Reference
3 pages
Linux Command
No ratings yet
Linux Command
4 pages
Linux Ubuntu
No ratings yet
Linux Ubuntu
4 pages
Linux Command Line For You and Me Documentation: Release 0.1
No ratings yet
Linux Command Line For You and Me Documentation: Release 0.1
84 pages
Production Eng
No ratings yet
Production Eng
7 pages
Linux Command-Line Interface (CLI) Commands
No ratings yet
Linux Command-Line Interface (CLI) Commands
12 pages
Module+6 +Mastering+More+Commands+In+Linux+-+Summary+Notes
No ratings yet
Module+6 +Mastering+More+Commands+In+Linux+-+Summary+Notes
2 pages
4.1-Essential Linux Commands
No ratings yet
4.1-Essential Linux Commands
5 pages
1
No ratings yet
1
3 pages
Untitled Document
No ratings yet
Untitled Document
34 pages
CS558 Lab 1
No ratings yet
CS558 Lab 1
10 pages
Linux Basics
No ratings yet
Linux Basics
7 pages
Basic Linux Command
No ratings yet
Basic Linux Command
5 pages
Linux Commands End To End
No ratings yet
Linux Commands End To End
9 pages
"Linux Commands": Diploma in Computer Engineering
No ratings yet
"Linux Commands": Diploma in Computer Engineering
8 pages
100 Linux Commands LinuxBlog Io
No ratings yet
100 Linux Commands LinuxBlog Io
5 pages
Linux Commands
No ratings yet
Linux Commands
23 pages
COSA Linux
No ratings yet
COSA Linux
19 pages
Linux Learning Guide
No ratings yet
Linux Learning Guide
4 pages
02 - Linux Running Notes
No ratings yet
02 - Linux Running Notes
24 pages
LINUX LAB Assignment 2 MILAN VAS
No ratings yet
LINUX LAB Assignment 2 MILAN VAS
29 pages
Linux
No ratings yet
Linux
36 pages
100 Essential Linux Commands
No ratings yet
100 Essential Linux Commands
8 pages
Unix Commands List
No ratings yet
Unix Commands List
6 pages
Linux for Beginners: Linux Command Line, Linux Programming and Linux Operating System
From Everand
Linux for Beginners: Linux Command Line, Linux Programming and Linux Operating System
Steve Will
4.5/5 (3)
Linux Services Deployment
From Everand
Linux Services Deployment
Fabian Mestre
No ratings yet
Easy Linux For Beginners
From Everand
Easy Linux For Beginners
Felix Cannon
2/5 (1)
Ubuntu Linux Toolbox: 1000+ Commands for Power Users
From Everand
Ubuntu Linux Toolbox: 1000+ Commands for Power Users
Christopher Negus
3/5 (2)
Human Resource Management Systems Extra
No ratings yet
Human Resource Management Systems Extra
5 pages
Basic Knowledge
No ratings yet
Basic Knowledge
4 pages
Ib Project Final
No ratings yet
Ib Project Final
8 pages
What Is Keka and Its Info
100% (1)
What Is Keka and Its Info
15 pages
Homestay Marketplace Business Plan
No ratings yet
Homestay Marketplace Business Plan
24 pages
Unit 1-Introduction To BDA
No ratings yet
Unit 1-Introduction To BDA
24 pages
Unit 3 Mapreduce
No ratings yet
Unit 3 Mapreduce
14 pages
Unit IV Spark
No ratings yet
Unit IV Spark
23 pages
Assignment-1 Theory
No ratings yet
Assignment-1 Theory
3 pages
Chapter 4 Software Project Planning
No ratings yet
Chapter 4 Software Project Planning
206 pages
Informatics Practices
No ratings yet
Informatics Practices
9 pages
ISAM Assignment #3
No ratings yet
ISAM Assignment #3
6 pages
Salesforce Master Certification Training
No ratings yet
Salesforce Master Certification Training
22 pages
Python Functions and Array - List - Set - Tuples Programs
No ratings yet
Python Functions and Array - List - Set - Tuples Programs
59 pages
Design Patterns Lecture
No ratings yet
Design Patterns Lecture
50 pages
Sta Ref Text Book U5
No ratings yet
Sta Ref Text Book U5
369 pages
Reporting Help Topics For Printing
No ratings yet
Reporting Help Topics For Printing
77 pages
Intro To Arduino Tutorial No Hardware or Experience Required 1
No ratings yet
Intro To Arduino Tutorial No Hardware or Experience Required 1
37 pages
Project Proposal
0% (1)
Project Proposal
7 pages
Lovely Professional University: Term Paper Reports ON CSE Topic
No ratings yet
Lovely Professional University: Term Paper Reports ON CSE Topic
15 pages
Tutorial - Learn Python in 10 Minutes
No ratings yet
Tutorial - Learn Python in 10 Minutes
13 pages
6-How To Take Approver's Input - Providing A Drop Down Field - Shareapps4u
No ratings yet
6-How To Take Approver's Input - Providing A Drop Down Field - Shareapps4u
9 pages
2-d Arrays
No ratings yet
2-d Arrays
40 pages
How To Download Only Subtitles of Videos Using Youtube-Dl - Super User
No ratings yet
How To Download Only Subtitles of Videos Using Youtube-Dl - Super User
4 pages
Categories of End-Users (Contd..) : - Sophisticated User
No ratings yet
Categories of End-Users (Contd..) : - Sophisticated User
26 pages
Azure & Co: Installing Sharepoint 2010 On Windows 2008 Server R2
No ratings yet
Azure & Co: Installing Sharepoint 2010 On Windows 2008 Server R2
103 pages
JMS Interview Questions
No ratings yet
JMS Interview Questions
4 pages
Practical Fpga Programming Inc: David Pellerin Scott Thibault
No ratings yet
Practical Fpga Programming Inc: David Pellerin Scott Thibault
6 pages
A Project Report - Sms (Tanmay D. Shinde - Ty - BSC.CS)
No ratings yet
A Project Report - Sms (Tanmay D. Shinde - Ty - BSC.CS)
59 pages
DSDV M1
No ratings yet
DSDV M1
21 pages
SAP Interview Questions and Answers
No ratings yet
SAP Interview Questions and Answers
86 pages
Microsoft 365, Office 365, Enterprise Mobility + Security, and Windows 10 Subscriptions For Education
No ratings yet
Microsoft 365, Office 365, Enterprise Mobility + Security, and Windows 10 Subscriptions For Education
5 pages
KPMG Data Extraction Tool For SAP Execution Guide v8.2
No ratings yet
KPMG Data Extraction Tool For SAP Execution Guide v8.2
48 pages
Zoon Functionality - Personalization PDF
No ratings yet
Zoon Functionality - Personalization PDF
32 pages
GU - SAP S4 HANA - Output Management Via BRF+
No ratings yet
GU - SAP S4 HANA - Output Management Via BRF+
59 pages
Automatic Translation of First-Order Predicate Logic To SQL
No ratings yet
Automatic Translation of First-Order Predicate Logic To SQL
31 pages

Unit 2 - Linux & Hadoop

Uploaded by

Unit 2 - Linux & Hadoop

Uploaded by

Linux & Hadoop Fundamentals

Basics of Linux: Files, Memory Management, and Process Management Commands

1. File Management in Linux

- Basic File Operations:

2. Memory Management in Linux

- `ps`: Displays information about running processes.

3. Process Management in Linux

- `killall <process_name>`: Kills all processes with a specific name.

4. Networking & Setting Up Clusters

- `netstat`: Displays network connections, routing tables, and interface statistics.

- `traceroute`: Traces the route packets take to a network host.

- `curl` and `wget`: Tools for transferring data over HTTP/HTTPS.

2. Cluster Setup in Linux

2. Configure passwordless SSH login between cluster nodes:

II. Introduction to Apache Hadoop: Definitions, History, and Versions

1. Definitions of Apache Hadoop

1. What is Apache Hadoop?

2. History of Apache Hadoop

1. Hadoop 1.x (Initial releases):

2. Hadoop 2.x (Introduction of YARN):

3. Hadoop 3.x (Latest Stable Version):

1. HDFS (Hadoop Distributed File System):

3. YARN (Yet Another Resource Negotiator):

6. Basic Hadoop Commands

Here is a list of some commonly used basic Hadoop commands:

- `hdfs dfs -ls`: List files and directories in HDFS.

- `hdfs dfs -mkdir`: Create a directory in HDFS.

- `hdfs dfs -rm`: Remove a file or directory from HDFS.

- `hdfs dfs -cat`: Display the contents of a file in HDFS.

- `hdfs dfs -cp`: Copy a file or directory within HDFS.

- `hadoop jar`: Run a MapReduce job using a jar file.

- `yarn application -status`: Check the status of a running application.

- `mapred job -list`: List all running jobs.

3. Cluster Management Commands

- `hadoop fsck`: Perform a file system check on HDFS.

- `stop-dfs.sh`: Stop the Hadoop Distributed FileSystem services.

- `start-yarn.sh`: Start the YARN ResourceManager and NodeManager services.

- `stop-yarn.sh`: Stop the YARN services.

- `hdfs dfsadmin -report`: Show HDFS cluster status.

- `hadoop version`: Display the current version of Hadoop.

5. Monitoring and Debugging

- `jps`: List all running Java processes in the Hadoop cluster.

You might also like