0% found this document useful (0 votes)

11 views

BDA Practicalfile

The document discusses the installation and configuration of Hadoop and its components like HDFS, MapReduce etc. It also discusses tasks like file management in HDFS, word count using MapReduce, matrix multiplication using MapReduce, installing and running Pig scripts for data manipulation operations like sort, group, join etc. Finally, it discusses the installation of Hive, creation and management of databases and tables in Hive.

Uploaded by

hereforpractice

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

BDA Practicalfile

Uploaded by

hereforpractice

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

BIG DATA ANALYTICS PRACTICAL FILE

CMCSC18

NAME : Adesh Sharma

ROLL NO. : 2021UCM2816
1. Installation and configuration of Hadoop

Hadoop is an open source framework from Apache and is used to store, process and analyse
data which are very huge in volume. Hadoop is written in Java and is not OLAP (online
analytical processing). It is used for batch/offline processing.

Step 1. Download and install Java: Hadoop is built on Java, so Java 8 was installed on PC.

Step 2. Download Hadoop: Hadoop was downloaded from the Apache Hadoop website.

Step 3. Set Environment Variables: Environment variables were configured after

downloading and unpacking Hadoop. HADOOP_HOME was set in the system variables and
bin and sbin path were added.

Step 4. Setup Hadoop: Configuration of Hadoop in this phase was done by modifying
several configuration files. Following four files were configured present in hadoop/etc folder:

In core-site.xml:
In hdfs-site.xml:

In mapred-site.xml:

In yarn-site.xml:

Changes in each file were saved.

Step 5. Format Hadoop NameNode: NameNode was formatted by navigating to the
Hadoop bin folder using a command prompt and executing the command: hadoop
namenode-format.

Step 6. Start Hadoop: To start Hadoop, the following command was run on command
prompt: start-all.cmd. This command will start all the required Hadoop services, including
the NameNode, DataNode, and JobTracker.

Step 7. Verify Hadoop Installation: Namenode web interface was started using localhost
9870 and resource manager using localhost 8088.
2. File management tasks in Hadoop
Adding files and directories- To add a directory in Hadoop cluster, following command was
run:
hadoop fs -mkdir /directory_name/

Files can be added to the directory using the command:

Hadoop fs -put “file_location” /directory_name/

Retrieving files- Data in files was viewed using cat command:

hadoop fs -cat /user/output/outfile

File from HDFS was moved into the local system using get command:
hadoop fs -get /user/output/ /home/hadoop_tp/

Deleting files and directories- Deletion was done using rm command.

3. Word count using mapreduce
MapReduce Word Count is a framework which splits the chunk of data, sorts the map outputs
and input to reduce tasks. A File-system stores the output and input of jobs. Re-execution of
failed tasks, scheduling them and monitoring them is the task of the framework.

The input dataset was a simple text file with some words or sentences written in it.

Step 1. Create a directory in the Hadoop HDFS by using the following command:
hdfs fs -mkdir /input_wordCount

Step 2. Copy the text file in the created directory from the local directory using -ls option.
For example:
hdfs fs -put C:/Hello.txt /input

Step 3. Wordcount in the file was done using the command:

hadoop jar
C:\Users\HP\OneDrive\Desktop\hadoop\share\hadoop\mapreduce\hadoop-mapreduce-exampl
es-3.3.6 wordcount /input/inputfile.txt /out1

Output
The number represents the total occurrence of that word in the entire text file. By using Map
Reduce, we can reduce the computing time by a considerable amount. It is much faster than
the traditional algorithm used in programming languages.

Output can also be viewed in localhost 9870.

4. Implementing matrix multiplication using
MapReduce.
MapReduce is a technique in which a huge program is subdivided into small tasks and
run parallelly to make computation faster, save time, and mostly used in distributed systems.
It has 2 important parts:
● Mapper: It takes raw data input and organises it into key, value pairs. For example, In
a dictionary, you search for the word “Data” and its associated meaning is “facts and
statistics collected together for reference or analysis”. Here the Key is Data and the
Value associated with is facts and statistics collected together for reference or
analysis.
● Reducer: It is responsible for processing data in parallel and producing final output.

Input matrices used were as follows:

Jar file was created by running the following code in Elcipse IDE for mapper and reducer
class.
Step1. Upload the input files on the Hadoop cluster using the put command.

Step2: Use the following command to run the matrix multiplication program and storing it in
the output directory.
hadoop jar jar_file_location package_name/class_name /input_matrix/* /output_matrix

The output can be shown using the cat command or on the localhost 9870.
5. Installing PIG and running PIG scripts to sort,
group, join and filter data.
Apache Pig is a data manipulation tool that is built over Hadoop’s MapReduce. Pig provides
us with a scripting language for easier and faster data manipulation. This scripting language
is called Pig Latin.
Apache Pig scripts can be executed in 3 ways as follows:
● Using Grunt Shell (Interactive Mode) – Write the commands in the grunt shell and get
the output there itself using the DUMP command.
● Using Pig Scripts (Batch Mode) – Write the pig latin commands in a single file with
.pig extension and execute the script on the prompt.
● Using User-Defined Functions (Embedded Mode) – Write your own Functions in
languages like Java and then use them in the scripts.

Step 1. Download the Pig version 0.17.0 tar file from the official Apache pig site. Download
the file ‘pig-0.17.0.tar.gz’ from the website.
Then extract this tar file using the 7-Zip tool (First we extract the .tar.gz file by right-clicking
on it and clicking on ‘7-Zip → Extract Here’. Then we extract the .tar file in the same way).

Step 2: Add the path variables of PIG_HOME and PIG_HOME\bin

Step 3: To run PIG in local mode run pig -x local command.

To load the input file:

Input file:

Sorting:

Filtering:
Grouping:

Join operation:
WordCount using PIG Latin scripts:

Step 1: Load the input file on which word count needs to be performed.

Step 2:

Step 3:

Step 4:

Step 5:
6. Installing HIVE and using it to create, alter and
drop databases, tables and views.
Hive is a database which runs on top of Hadoop and provides functionalities like data
warehouse and data analysis. It provides an SQL-like interface to interact with the databases.

Step 1. Download Apache Derby Binaries: Hive requires a relational database like Apache
Derby to create a Metastore and store all metadata.

Step 2. Download Hive binaries: Download Hive binaries from the following
link:https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz

Step 3: Add the following variables in the system environment:

HIVE_HOME: E:\hadoop-3.1.0\apache-hive-3.1.2-bin
DERBY_HOME: E:\hadoop-3.1.0\db-derby-10.14.2.0-bin
HIVE_LIB: E:\hadoop-3.1.0\apache-hive-3.1.2-bin\lib
HIVE_BIN: E:\hadoop-3.1.0\apache-hive-3.1.2-bin\bin
HADOOP_USER_CLASSPATH_FIRST: true

Step 5. Configuring Hive: Copy Derby Libraries, Copy all the jar files stored in Derby
library files stored in:C:\hadoop-3.3.6\db-derby-10.14.2.0-bin\lib\ And paste them in Hive
libraries directory: C:\hadoop-3.3.6\apache-hive-3.1.2-bin\lib

Step 6. Configuring Hive-site.xml: Create a new file with the name hive-site.xml in
C:\hadoop-3.3.6\apache-hive-3.1.2-bin\conf
Step 7. Starting Services: Start Hadoop Services: Change the directory in terminal to the
location where Hadoop is stored and give the following command: start-all.cmd
Start Derby Network Server: Start the Derby Network Server with the following command:
StartNetworkServer -h 0.0.0.0

Creation of database: CREATE DATABASE [IF NOT EXISTS] userdb;

Dropping database: DROP DATABASE IF EXISTS userdb;

Creation of table: CREATE TABLE table_name

[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[ROW FORMAT row_format]
[STORED AS file_format]

Altering a table: ALTER TABLE name DROP [COLUMN] column_name

Dropping a table: DROP TABLE [IF EXISTS] table_name;

Creating a view: CREATE VIEW [IF NOT EXISTS] view_name [(column_name
[COMMENT column_comment], ...) ]
[COMMENT table_comment]
AS SELECT ...

Dropping a view: DROP VIEW view_name

These commands with the respective outputs are shown below:

Big Data & Analytics Lab Manual
No ratings yet
Big Data & Analytics Lab Manual
51 pages
Big Data PPT 55b0fc01e7543
No ratings yet
Big Data PPT 55b0fc01e7543
31 pages
Unit Iv Mapreduce Applications
No ratings yet
Unit Iv Mapreduce Applications
70 pages
CCS334 BDA Lab Manual
No ratings yet
CCS334 BDA Lab Manual
35 pages
Big Data Lab Manual Printout Copy
No ratings yet
Big Data Lab Manual Printout Copy
51 pages
bda lab
No ratings yet
bda lab
4 pages
@bigdatalabfile 09
No ratings yet
@bigdatalabfile 09
35 pages
Ba Lab Record-It b2022-26
No ratings yet
Ba Lab Record-It b2022-26
43 pages
Data Science
No ratings yet
Data Science
82 pages
Bda Lab
No ratings yet
Bda Lab
94 pages
Notes
No ratings yet
Notes
53 pages
Big Data Analytics IT
No ratings yet
Big Data Analytics IT
55 pages
BDA
No ratings yet
BDA
30 pages
Big Data Journal
No ratings yet
Big Data Journal
50 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Big Data and Hadoop: by - Ujjwal Kumar Gupta
No ratings yet
Big Data and Hadoop: by - Ujjwal Kumar Gupta
57 pages
Big Data Testing
100% (1)
Big Data Testing
34 pages
CCS334 Bda
No ratings yet
CCS334 Bda
23 pages
bda lab s
No ratings yet
bda lab s
92 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
59 pages
BDA LAB RECORD
No ratings yet
BDA LAB RECORD
32 pages
Big Data Manual
No ratings yet
Big Data Manual
82 pages
Bda Lab Manual 2024
No ratings yet
Bda Lab Manual 2024
45 pages
Hadoop Administrator Training - Lab Hand Book
No ratings yet
Hadoop Administrator Training - Lab Hand Book
12 pages
BDA Record (1)
No ratings yet
BDA Record (1)
34 pages
2020300053_BDA_EXP1_CHINMAY
No ratings yet
2020300053_BDA_EXP1_CHINMAY
13 pages
Bad601 Lab Maual
No ratings yet
Bad601 Lab Maual
34 pages
Prachi 20CS111 BDALab File
No ratings yet
Prachi 20CS111 BDALab File
20 pages
Big Data Record 2024-25
No ratings yet
Big Data Record 2024-25
46 pages
Big Data Analytics Lab
No ratings yet
Big Data Analytics Lab
18 pages
bigdatamanual(2)
No ratings yet
bigdatamanual(2)
45 pages
20dce017 Bda Pracfil
No ratings yet
20dce017 Bda Pracfil
41 pages
Cloud PDF
No ratings yet
Cloud PDF
47 pages
Final Copy - BDA LAB Record
No ratings yet
Final Copy - BDA LAB Record
44 pages
Practical-1: Aim: Hadoop Configuration and Single Node Cluster Setup and Perform File Management Task in
No ratings yet
Practical-1: Aim: Hadoop Configuration and Single Node Cluster Setup and Perform File Management Task in
61 pages
C21053 Jay Vijay Karwatkar-Big Data Analytics & Visualization
No ratings yet
C21053 Jay Vijay Karwatkar-Big Data Analytics & Visualization
210 pages
BIG data file
No ratings yet
BIG data file
28 pages
BDA RECORD (24-25)
No ratings yet
BDA RECORD (24-25)
50 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
27 pages
Big Data Analytics Lab Manual(BE AI&DS)
No ratings yet
Big Data Analytics Lab Manual(BE AI&DS)
29 pages
BDT Lab Manual
No ratings yet
BDT Lab Manual
48 pages
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
Bigdata Lab
No ratings yet
Bigdata Lab
55 pages
MapReduce Merged
No ratings yet
MapReduce Merged
18 pages
Bda Record 18071a0597-1
No ratings yet
Bda Record 18071a0597-1
28 pages
3600236e-f317-4b6a-8329-dfdcdf876be5
No ratings yet
3600236e-f317-4b6a-8329-dfdcdf876be5
10 pages
BDC Final Record
No ratings yet
BDC Final Record
36 pages
Course: Big Data Analytics Lab Scheme: 2017
No ratings yet
Course: Big Data Analytics Lab Scheme: 2017
25 pages
CT2 BDTT
No ratings yet
CT2 BDTT
6 pages
BDA Lab Manual_organized (2) (1) - Copy
No ratings yet
BDA Lab Manual_organized (2) (1) - Copy
69 pages
CS-702 (D) BigData
No ratings yet
CS-702 (D) BigData
61 pages
Hadoop Course Outline UPDATED SURESH
No ratings yet
Hadoop Course Outline UPDATED SURESH
5 pages
Ccs334 Bda Lab Ex
No ratings yet
Ccs334 Bda Lab Ex
45 pages
BIGDATALABCURRENT
No ratings yet
BIGDATALABCURRENT
54 pages
HadoopExercises July2011 PDF
No ratings yet
HadoopExercises July2011 PDF
26 pages
Bda Record
No ratings yet
Bda Record
83 pages
BDA Lab ManuaL[1]
No ratings yet
BDA Lab ManuaL[1]
83 pages
Big data analytics lab-JD
No ratings yet
Big data analytics lab-JD
49 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
45 pages
Unit 3 BDA
No ratings yet
Unit 3 BDA
4 pages
BDA Journal
No ratings yet
BDA Journal
52 pages
A concise guide to PHP MySQL and Apache
From Everand
A concise guide to PHP MySQL and Apache
alasdair gilchrist
4/5 (2)
Apache Flume: Distributed Log Collection For Hadoop - Second Edition - Sample Chapter
No ratings yet
Apache Flume: Distributed Log Collection For Hadoop - Second Edition - Sample Chapter
13 pages
Improved Job Scheduling For Achieving Fairness On Apache Hadoop YARN
No ratings yet
Improved Job Scheduling For Achieving Fairness On Apache Hadoop YARN
6 pages
Module 3 (Part-1) - Big Data
No ratings yet
Module 3 (Part-1) - Big Data
46 pages
Data Analytics Lab Manual
No ratings yet
Data Analytics Lab Manual
23 pages
Partitioning
No ratings yet
Partitioning
37 pages
Event Driven I:o
No ratings yet
Event Driven I:o
12 pages
Apache Hadoop Training For Developers-2013 (Course Content) PDF
No ratings yet
Apache Hadoop Training For Developers-2013 (Course Content) PDF
4 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
100% (1)
Ebin - Pub Hands On Guide To Apache Spark 3 Build Scalable Computing Engines For Batch and Stream Data Processing 1nbsped 1484293797 9781484293799
307 pages
Data - Analytics - Unit - I - III MCA'A'
No ratings yet
Data - Analytics - Unit - I - III MCA'A'
207 pages
CS8791 Cloud Computing
No ratings yet
CS8791 Cloud Computing
9 pages
Download Complete (Ebook) Big Data Analytics in Cybersecurity by Deng, Julia; Savas, Onur ISBN 9781498772167, 1498772161 PDF for All Chapters
100% (4)
Download Complete (Ebook) Big Data Analytics in Cybersecurity by Deng, Julia; Savas, Onur ISBN 9781498772167, 1498772161 PDF for All Chapters
67 pages
2018, Sriyanong - A Text Preprocessing Framework for Text Mining on Big Data Infrastructure
No ratings yet
2018, Sriyanong - A Text Preprocessing Framework for Text Mining on Big Data Infrastructure
5 pages
Big Data Analytics (BDA) UNIT 1: Introduction To Big Data
No ratings yet
Big Data Analytics (BDA) UNIT 1: Introduction To Big Data
3 pages
Getting Started With Hadoop
No ratings yet
Getting Started With Hadoop
47 pages
Resume
No ratings yet
Resume
1 page
Unit II Big Data Final PDF
No ratings yet
Unit II Big Data Final PDF
25 pages
Big Data Analytics For Gold Price Forecasting Based On Decision Tree Algorithm and Support Vector Regression (SVR)
No ratings yet
Big Data Analytics For Gold Price Forecasting Based On Decision Tree Algorithm and Support Vector Regression (SVR)
6 pages
Indicative List of Topics For Short Term Training Programmes
No ratings yet
Indicative List of Topics For Short Term Training Programmes
7 pages
YARN (Yet Another Resource Negotiator) : Apache Hadoop in A Nutshell
No ratings yet
YARN (Yet Another Resource Negotiator) : Apache Hadoop in A Nutshell
2 pages
Big Data Management
No ratings yet
Big Data Management
5 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Seminar Topic Nosql
No ratings yet
Seminar Topic Nosql
73 pages
Hortonworks Data Platform Installing HDP On Windows
No ratings yet
Hortonworks Data Platform Installing HDP On Windows
84 pages
Introductory Big Data
No ratings yet
Introductory Big Data
34 pages
Chapter 10: Big Data: Database System Concepts, 7 Ed
No ratings yet
Chapter 10: Big Data: Database System Concepts, 7 Ed
14 pages
5-Practicas+BigData Trabajar Hdfs
No ratings yet
5-Practicas+BigData Trabajar Hdfs
10 pages
Top 200 Data Engineer Interview Question PDF
100% (4)
Top 200 Data Engineer Interview Question PDF
482 pages

BDA Practicalfile

Uploaded by

BDA Practicalfile

Uploaded by

BIG DATA ANALYTICS PRACTICAL FILE

NAME : Adesh Sharma

Step 3. Set Environment Variables: Environment variables were configured after

Changes in each file were saved.

Files can be added to the directory using the command:

Retrieving files- Data in files was viewed using cat command:

Deleting files and directories- Deletion was done using rm command.

Step 3. Wordcount in the file was done using the command:

Output can also be viewed in localhost 9870.

Input matrices used were as follows:

Step 2: Add the path variables of PIG_HOME and PIG_HOME\bin

Step 3: To run PIG in local mode run pig -x local command.

To load the input file:

Step 3: Add the following variables in the system environment:

Creation of database: CREATE DATABASE [IF NOT EXISTS] userdb;

Dropping database: DROP DATABASE IF EXISTS userdb;

Creation of table: CREATE TABLE table_name

Altering a table: ALTER TABLE name DROP [COLUMN] column_name

Dropping a table: DROP TABLE [IF EXISTS] table_name;

Dropping a view: DROP VIEW view_name

These commands with the respective outputs are shown below:

You might also like