0% found this document useful (0 votes)
6 views

BDA Practicalfile

The document discusses the installation and configuration of Hadoop and its components like HDFS, MapReduce etc. It also discusses tasks like file management in HDFS, word count using MapReduce, matrix multiplication using MapReduce, installing and running Pig scripts for data manipulation operations like sort, group, join etc. Finally, it discusses the installation of Hive, creation and management of databases and tables in Hive.

Uploaded by

hereforpractice
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

BDA Practicalfile

The document discusses the installation and configuration of Hadoop and its components like HDFS, MapReduce etc. It also discusses tasks like file management in HDFS, word count using MapReduce, matrix multiplication using MapReduce, installing and running Pig scripts for data manipulation operations like sort, group, join etc. Finally, it discusses the installation of Hive, creation and management of databases and tables in Hive.

Uploaded by

hereforpractice
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

BIG DATA ANALYTICS PRACTICAL FILE

CMCSC18

NAME : Adesh Sharma


ROLL NO. : 2021UCM2816
1. Installation and configuration of Hadoop

Hadoop is an open source framework from Apache and is used to store, process and analyse
data which are very huge in volume. Hadoop is written in Java and is not OLAP (online
analytical processing). It is used for batch/offline processing.

Step 1. Download and install Java: Hadoop is built on Java, so Java 8 was installed on PC.

Step 2. Download Hadoop: Hadoop was downloaded from the Apache Hadoop website.

Step 3. Set Environment Variables: Environment variables were configured after


downloading and unpacking Hadoop. HADOOP_HOME was set in the system variables and
bin and sbin path were added.

Step 4. Setup Hadoop: Configuration of Hadoop in this phase was done by modifying
several configuration files. Following four files were configured present in hadoop/etc folder:

In core-site.xml:
In hdfs-site.xml:

In mapred-site.xml:

In yarn-site.xml:

Changes in each file were saved.


Step 5. Format Hadoop NameNode: NameNode was formatted by navigating to the
Hadoop bin folder using a command prompt and executing the command: hadoop
namenode-format.

Step 6. Start Hadoop: To start Hadoop, the following command was run on command
prompt: start-all.cmd. This command will start all the required Hadoop services, including
the NameNode, DataNode, and JobTracker.

Step 7. Verify Hadoop Installation: Namenode web interface was started using localhost
9870 and resource manager using localhost 8088.
2. File management tasks in Hadoop
Adding files and directories- To add a directory in Hadoop cluster, following command was
run:
hadoop fs -mkdir /directory_name/

Files can be added to the directory using the command:


Hadoop fs -put “file_location” /directory_name/

Retrieving files- Data in files was viewed using cat command:


hadoop fs -cat /user/output/outfile

File from HDFS was moved into the local system using get command:
hadoop fs -get /user/output/ /home/hadoop_tp/

Deleting files and directories- Deletion was done using rm command.


3. Word count using mapreduce
MapReduce Word Count is a framework which splits the chunk of data, sorts the map outputs
and input to reduce tasks. A File-system stores the output and input of jobs. Re-execution of
failed tasks, scheduling them and monitoring them is the task of the framework.

The input dataset was a simple text file with some words or sentences written in it.

Step 1. Create a directory in the Hadoop HDFS by using the following command:
hdfs fs -mkdir /input_wordCount

Step 2. Copy the text file in the created directory from the local directory using -ls option.
For example:
hdfs fs -put C:/Hello.txt /input

Step 3. Wordcount in the file was done using the command:


hadoop jar
C:\Users\HP\OneDrive\Desktop\hadoop\share\hadoop\mapreduce\hadoop-mapreduce-exampl
es-3.3.6 wordcount /input/inputfile.txt /out1

Output
The number represents the total occurrence of that word in the entire text file. By using Map
Reduce, we can reduce the computing time by a considerable amount. It is much faster than
the traditional algorithm used in programming languages.

Output can also be viewed in localhost 9870.


4. Implementing matrix multiplication using
MapReduce.
MapReduce is a technique in which a huge program is subdivided into small tasks and
run parallelly to make computation faster, save time, and mostly used in distributed systems.
It has 2 important parts:
● Mapper: It takes raw data input and organises it into key, value pairs. For example, In
a dictionary, you search for the word “Data” and its associated meaning is “facts and
statistics collected together for reference or analysis”. Here the Key is Data and the
Value associated with is facts and statistics collected together for reference or
analysis.
● Reducer: It is responsible for processing data in parallel and producing final output.

Input matrices used were as follows:


Jar file was created by running the following code in Elcipse IDE for mapper and reducer
class.
Step1. Upload the input files on the Hadoop cluster using the put command.

Step2: Use the following command to run the matrix multiplication program and storing it in
the output directory.
hadoop jar jar_file_location package_name/class_name /input_matrix/* /output_matrix

The output can be shown using the cat command or on the localhost 9870.
5. Installing PIG and running PIG scripts to sort,
group, join and filter data.
Apache Pig is a data manipulation tool that is built over Hadoop’s MapReduce. Pig provides
us with a scripting language for easier and faster data manipulation. This scripting language
is called Pig Latin.
Apache Pig scripts can be executed in 3 ways as follows:
● Using Grunt Shell (Interactive Mode) – Write the commands in the grunt shell and get
the output there itself using the DUMP command.
● Using Pig Scripts (Batch Mode) – Write the pig latin commands in a single file with
.pig extension and execute the script on the prompt.
● Using User-Defined Functions (Embedded Mode) – Write your own Functions in
languages like Java and then use them in the scripts.

Step 1. Download the Pig version 0.17.0 tar file from the official Apache pig site. Download
the file ‘pig-0.17.0.tar.gz’ from the website.
Then extract this tar file using the 7-Zip tool (First we extract the .tar.gz file by right-clicking
on it and clicking on ‘7-Zip → Extract Here’. Then we extract the .tar file in the same way).

Step 2: Add the path variables of PIG_HOME and PIG_HOME\bin

Step 3: To run PIG in local mode run pig -x local command.

To load the input file:


Input file:

Sorting:

Filtering:
Grouping:

Join operation:
WordCount using PIG Latin scripts:

Step 1: Load the input file on which word count needs to be performed.

Step 2:

Step 3:

Step 4:

Step 5:
6. Installing HIVE and using it to create, alter and
drop databases, tables and views.
Hive is a database which runs on top of Hadoop and provides functionalities like data
warehouse and data analysis. It provides an SQL-like interface to interact with the databases.

Step 1. Download Apache Derby Binaries: Hive requires a relational database like Apache
Derby to create a Metastore and store all metadata.

Step 2. Download Hive binaries: Download Hive binaries from the following
link:https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz

Step 3: Add the following variables in the system environment:


HIVE_HOME: E:\hadoop-3.1.0\apache-hive-3.1.2-bin
DERBY_HOME: E:\hadoop-3.1.0\db-derby-10.14.2.0-bin
HIVE_LIB: E:\hadoop-3.1.0\apache-hive-3.1.2-bin\lib
HIVE_BIN: E:\hadoop-3.1.0\apache-hive-3.1.2-bin\bin
HADOOP_USER_CLASSPATH_FIRST: true

Step 5. Configuring Hive: Copy Derby Libraries, Copy all the jar files stored in Derby
library files stored in:C:\hadoop-3.3.6\db-derby-10.14.2.0-bin\lib\ And paste them in Hive
libraries directory: C:\hadoop-3.3.6\apache-hive-3.1.2-bin\lib

Step 6. Configuring Hive-site.xml: Create a new file with the name hive-site.xml in
C:\hadoop-3.3.6\apache-hive-3.1.2-bin\conf
Step 7. Starting Services: Start Hadoop Services: Change the directory in terminal to the
location where Hadoop is stored and give the following command: start-all.cmd
Start Derby Network Server: Start the Derby Network Server with the following command:
StartNetworkServer -h 0.0.0.0

Creation of database: CREATE DATABASE [IF NOT EXISTS] userdb;

Dropping database: DROP DATABASE IF EXISTS userdb;

Creation of table: CREATE TABLE table_name


[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]

Altering a table: ALTER TABLE name DROP [COLUMN] column_name

Dropping a table: DROP TABLE [IF EXISTS] table_name;


Creating a view: CREATE VIEW [IF NOT EXISTS] view_name [(column_name
[COMMENT column_comment], ...) ]
[COMMENT table_comment]
AS SELECT ...

Dropping a view: DROP VIEW view_name

These commands with the respective outputs are shown below:

You might also like