BDA Practicalfile
BDA Practicalfile
CMCSC18
Hadoop is an open source framework from Apache and is used to store, process and analyse
data which are very huge in volume. Hadoop is written in Java and is not OLAP (online
analytical processing). It is used for batch/offline processing.
Step 1. Download and install Java: Hadoop is built on Java, so Java 8 was installed on PC.
Step 2. Download Hadoop: Hadoop was downloaded from the Apache Hadoop website.
Step 4. Setup Hadoop: Configuration of Hadoop in this phase was done by modifying
several configuration files. Following four files were configured present in hadoop/etc folder:
In core-site.xml:
In hdfs-site.xml:
In mapred-site.xml:
In yarn-site.xml:
Step 6. Start Hadoop: To start Hadoop, the following command was run on command
prompt: start-all.cmd. This command will start all the required Hadoop services, including
the NameNode, DataNode, and JobTracker.
Step 7. Verify Hadoop Installation: Namenode web interface was started using localhost
9870 and resource manager using localhost 8088.
2. File management tasks in Hadoop
Adding files and directories- To add a directory in Hadoop cluster, following command was
run:
hadoop fs -mkdir /directory_name/
File from HDFS was moved into the local system using get command:
hadoop fs -get /user/output/ /home/hadoop_tp/
The input dataset was a simple text file with some words or sentences written in it.
Step 1. Create a directory in the Hadoop HDFS by using the following command:
hdfs fs -mkdir /input_wordCount
Step 2. Copy the text file in the created directory from the local directory using -ls option.
For example:
hdfs fs -put C:/Hello.txt /input
Output
The number represents the total occurrence of that word in the entire text file. By using Map
Reduce, we can reduce the computing time by a considerable amount. It is much faster than
the traditional algorithm used in programming languages.
Step2: Use the following command to run the matrix multiplication program and storing it in
the output directory.
hadoop jar jar_file_location package_name/class_name /input_matrix/* /output_matrix
The output can be shown using the cat command or on the localhost 9870.
5. Installing PIG and running PIG scripts to sort,
group, join and filter data.
Apache Pig is a data manipulation tool that is built over Hadoop’s MapReduce. Pig provides
us with a scripting language for easier and faster data manipulation. This scripting language
is called Pig Latin.
Apache Pig scripts can be executed in 3 ways as follows:
● Using Grunt Shell (Interactive Mode) – Write the commands in the grunt shell and get
the output there itself using the DUMP command.
● Using Pig Scripts (Batch Mode) – Write the pig latin commands in a single file with
.pig extension and execute the script on the prompt.
● Using User-Defined Functions (Embedded Mode) – Write your own Functions in
languages like Java and then use them in the scripts.
Step 1. Download the Pig version 0.17.0 tar file from the official Apache pig site. Download
the file ‘pig-0.17.0.tar.gz’ from the website.
Then extract this tar file using the 7-Zip tool (First we extract the .tar.gz file by right-clicking
on it and clicking on ‘7-Zip → Extract Here’. Then we extract the .tar file in the same way).
Sorting:
Filtering:
Grouping:
Join operation:
WordCount using PIG Latin scripts:
Step 1: Load the input file on which word count needs to be performed.
Step 2:
Step 3:
Step 4:
Step 5:
6. Installing HIVE and using it to create, alter and
drop databases, tables and views.
Hive is a database which runs on top of Hadoop and provides functionalities like data
warehouse and data analysis. It provides an SQL-like interface to interact with the databases.
Step 1. Download Apache Derby Binaries: Hive requires a relational database like Apache
Derby to create a Metastore and store all metadata.
Step 2. Download Hive binaries: Download Hive binaries from the following
link:https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
Step 5. Configuring Hive: Copy Derby Libraries, Copy all the jar files stored in Derby
library files stored in:C:\hadoop-3.3.6\db-derby-10.14.2.0-bin\lib\ And paste them in Hive
libraries directory: C:\hadoop-3.3.6\apache-hive-3.1.2-bin\lib
Step 6. Configuring Hive-site.xml: Create a new file with the name hive-site.xml in
C:\hadoop-3.3.6\apache-hive-3.1.2-bin\conf
Step 7. Starting Services: Start Hadoop Services: Change the directory in terminal to the
location where Hadoop is stored and give the following command: start-all.cmd
Start Derby Network Server: Start the Derby Network Server with the following command:
StartNetworkServer -h 0.0.0.0