0% found this document useful (0 votes)
17 views65 pages

Bda Final1 CSBS

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views65 pages

Bda Final1 CSBS

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

DEPARTMENT OF

COMPUTER SCIENCE AND BUSINESS SYSTEMS

LABORATORY RECORD

CCS334 - BIG DATA ANALYTICS LABORATORY


V-Semester (REGULATION 2021)

NAME : …………………………………………………….

REG NO : …………………………………………………….

YEAR/SEM : …………………………………………………….
DEPARTMENT OF
COMPUTER SCIENCE AND BUSINESS SYSTEMS

LABORATORY RECORD

CCS334 - BIG DATA ANALYTICS LABORATORY


V-Semester (REGULATION 2021)

NAME : …………………………………………………….

REG NO : …………………………………………………….

YEAR/SEM : …………………………………………………….
BONAFIDE CERTIFICATE

This is to certify that this Big Data Analytics Laboratory record work

done by Mr./Ms ............................................….for the course B.Tech Computer Science and

Business Systems during...............................…..(year/sem) of Academic year 2024 – 2025 in

bonafide.

Staff - in - charge H.O.D

Reg. No …………………………………………………

This record is submitted for .................................. Semester B.Tech Practical Examination


Conducted on ………………………

______________________________________________________________________________________

Internal Examiner External Examiner


VISION

To emerge as a Premier Institute for developing industry ready engineers with competency,
initiative and character to meet the challenges in global environment.

MISSION

 To impart state-of-the-art engineering and professional education through strong theoretical


basics and hands on training to students in their choice of field.
 To serve our students by teaching them leadership, entrepreneurship, teamwork, values, quality,
ethics and respect for others.
 To provide opportunities for long-term interaction with academia and industry.
 To create new knowledge through innovation and research.

DEPARTMENT OF
COMPUTER SCIENCE AND BUSINESS SYSTEMS

VISION:

To produce industry ready technologists with computer science and business system knowledge
and human values to contribute globally to the society at large.

MISSION:

• To develop students ability thereby to compete globally through excellence ineducation.


• To inculcate varied skill sets that meets industry standards and to provide qualitylearning and
environment.
• To educate student to lead and to serve the society.
PROGRAM EDUCATIONAL OBJECTIVES (PEOs)

PEO1. To ensure graduates will be proficient in utilizing the fundamental knowledge of basic
sciences, mathematics, Computer Science and Business systems for the applications relevant to
various streams of Engineering and Technology.
PEO2.To enrich graduates with the core competencies necessary for applying knowledge of computer
science and Data analytics tools to store, retrieve, implement and analyze data in the context of
business enterprise
PEO3.To enable graduates to gain employment in organizations and establish themselves as
professionals by applying their technical skills and leadership qualities to solve real world problems
and meet the diversified needs of industry, academia and research
PEO4.To equip the graduates with entrepreneurial skills and qualities which help them to perceive the
functioning of business, diagnose business problems, explore the entrepreneurial opportunities and
prepare them to manage business efficiently.

PROGRAMME OUTCOMES (PO)

Engineering Graduates will be able to:

1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals,


and an engineering specialization to the solution of complex engineering problems.
2. Problem analysis: Identify, formulate, review research literature, and analyze complex engineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences,
and engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering problems and design
system components or processes that meet the specified needs with appropriate consideration for the
public health and safety, and the cultural, societal, and environmental considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and research methods
including design of experiments, analysis and interpretation of data, and synthesis of the information
to provide valid conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities with an

understanding of the limitations.


6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant
to the professional engineering practice.
7. Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and
need for sustainable development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member or leader
in diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and
write effective reports and design documentation, make effective presentations, and give
and receive clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one‟s own work, as a member
and leader in a team, to manage projects and in multidisciplinary environments.
12. Life-long learning: Recognize the need for, and have the preparation and ability to engage
in independent and life-long learning in the broadest context of technological change.

PROGRAM SPECIFICOUTCOMES (PSOs)

PSO1: To create, select, and apply appropriate techniques, resources, modern engineering
and business tools including prediction and data analytics to complex engineering
activities and businesssolutions
PSO2: To evolve computer science domain specific methodologies for effective decision
making inseveral critical problem domains of the real world.
PSO3: To be able to apply entrepreneurial skills and management tools for identifying,
analyzing andcreating business opportunities with smart business ideas.
CCS334 DEEP LEARNING LABORATORY L T PC
0 0 4 2
COURSE OBJECTIVES:
 To understand the tools and techniques to implement deep neural networks
 To apply different deep learning architectures for solving problems
 To implement generative models for suitable applications
 To learn to build and validate different models

LIST OF EXPERIMENTS:

1. Downloading and installing Hadoop; Understanding different hadoop modes. Startup


scripts Character recognition using CNN
2. Hadoop implementation of the file management tasks, such as adding files and
directories, retrieving files and deleting files Language modeling using RNN
3. Installation of Hive along with practice examples.Machine
4. Installation of Hbase, Installing thrift along with practice examples Mini-project on real
world applications
5. Run a basic Word count map reduce program tounderstand map reduce paradigm.
6. Implementation of matrix multiplication with Hadoop mapreduce
7. Practice importing and exporting data from variousdatabases

TOTAL: 60 PERIODS

COURSE OUTCOMES:
After the completion of this course, students will be able to:
CO1:Describe big data and use cases from selected business domains.
CO2:Explain No SQL big data management.
CO3:Install, configure, and run Hadoop and HDFS.
CO4:Perform Map-reduce analytics using Hadoop.
CO5:Use Hadoop-related tools such as HBase,Cassandra, Pig,and Hive for bigdata analytics.
Branch and Year : CSBS/III Year Semester : V

Subject Code & Name : CCS334 & Big Data Analytics Laboratory

LIST OF EXPERIMENTS

Name of the Experiment


Sl. No.

Downloading and installing Hadoop; Understandingdifferent hadoop


1
modes. Startup scripts

Hadoop implementation of the file management tasks,such as adding


2
files and directories, retrieving files and deleting files.

3 Installation of Hive along with practice examples.

4 Installation of Hbase, Installing thrift along with practiceexamples.

5 Run a basic Word count map reduce program tounderstand map


reduce paradigm.

6 Implementation of matrix multiplication with Hadoop mapreduce

Practice importing and exporting data from various databases.


7
INDEX

Ex. Page
Date List of Experiments Marks Sign
No. No

Downloading and installing Hadoop;


1 Understanding different hadoop modes. Startup
scripts

Hadoop implementation of the file management


2 tasks, such as adding files and directories,
retrieving files and deleting files.

3 Installation of Hive along with practice examples.

Installation of Hbase, Installing thrift along with


4
practice examples.

Run a basic Word count map reduce


5 program to understand map reduce
paradigm.

Implementation of matrix multiplication with


6
Hadoop mapreduce

Practice importing and exporting data from


7
variousdatabases.

CONTENT BEYOND THE SYLLABUS

8 Installation of Pig along with practice examples.

SOFWARE AND HARDWARE REQUIREMENTS

S/Ws: Cassandra, Hadoop, Java, Pig, Hive and HbaseH/Ws: The machine must have 4GB RAM and
minimum 60 GB hard disk for better performance.
Exp.no: 1
Downloading and installing of Hadoop
Date: / / 20

Aim:

To downloading and installing Hadoop: Understating different Hadoop modes.


Startup scripts,configuration files.

Procedure:

1. Installation of Hadoop:
Hadoop software can be installed in three modes of operation:
• Stand Alone Mode: Hadoop is distributed software and is designed to run on a
commodity of machines. However, we can install it on a single node in stand-alone mode. In this
mode, Hadoop software runs as a single monolithic java process. This mode is extremely useful
for debugging purpose. You can first test run your Map-Reduce application in this mode on
small data, before actually executing it on cluster with big data.
• Pseudo Distributed Mode: In this mode also, Hadoop software is installed on a Single
Node. Various daemons of Hadoop will run on the same machine as separate java processes.
Hence all the daemons namely, NameNode, DataNode, SecondaryNameNode, JobTracker,
TaskTracker run on single machine.
• Fully Distributed Mode: In Fully Distributed Mode, the daemons NameNode,
JobTracker, SecondaryNameNode (Optional and can be run on a separate node) run on the
Master Node.
The daemons DataNode and TaskTracker run on the Slave Node. Hadoop Installation:
UbuntuOperating System in stand-alone mode

Step 1: Install Java


a. Download Java and extract in Downloads folder
b. In Terminal
>gedit ~/.bashrc
> Paste the following lines (by replacing the java path alone)

#--insert JAVA_HOME
JAVA_HOME=
/opt/jdk1.8.0_05
#--in PATH variable just append at the end of the
linePATH=$PATH:$JAVA_HOME/bin
#--Append JAVA_HOME at end of the export
statementexport PATH JAVA_HOME

c. save by giving command as


> source ~/.bashrc
>java -version

2. Install ssh for passwordless authentication


a. >sudo apt-get update

b. >sudo apt-get install openssh-server


sudo aptitude install openssh-server
c. >ssh localhost

d. >ssh-keygen

e. >ssh-copy-id -i localhost (Don't mention any path during key generation)


3. Install Hadoop
a. Extract hadoop at java location itself
b. >gedit ~/.bashrc
paste the following lines below java path (change the
path)#--insert HADOOP_PREFIX
HADOOP_PREFIX=/opt/hadoop-2.7.0
#--in PATH variable just append at the end of the
linePATH=$PATH:$HADOOP_PREFIX/bin
#--Append HADOOP_PREFIX at end of the export statement
export PATH JAVA_HOME HADOOP_PREFIX

c. save it
>source ~/.bashrc
> echo $HADOOP_PREFIX (to check the path)

> cd $HADOOP_PREFIX
> bin/hadoop version

4. Modifying the Hadoop configuration files


(i) cd $HADOOP_PREFIX/etc/hadoop
>gedit hadoop-env.sh
(paste the java n hadoop path as the first two lines)
export
JAVA_HOME=/usr/local/jdk1.8.0_05
export HADOOP_PREFIX=/opt/hadoop-
2.7.0

(ii) Modify the core-site.xml


>gedit core-site.xml
Paste the line within <configuration></configuration>

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

(iii) Modify the hdfs-site.xml


>gedit hdfs-
site.xml
Paste
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

(iv) modify the mapred-site.xml


> cpmapred-site.xml.template mapred-site.xml
>gedit mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

(v) Modiy yarn-site.xml


>gedit yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

6. Formatting the HDFS file-system via the NameNode


> cd $HADOOP_PREFIX
> bin/hadoopnamenode -format

After formatting is over start the services

7. Starting the services


> sbin/start-dfs.sh

> sbin/start-yarn.sh
else
>jps

8. Stopping Services
> sbin/stop-dfs.sh
> sbin/stop-yarn.sh

(or)
>sbin/stop-all.sh

12. To view in Browser


localhost:5007
0
localhost:8088
Contents Marks
Aim and Algorithm 30
Program and Execution 30
Output and Result 30
Viva 10
Total 100

Result:
Thus, the downloading and installing of Hadoop in three modes has been successfully
completed.
Exp.no: 2
File Management tasks in Hadoop
Date: / / 20

Aim:

To implementation of the file management task, such as Adding files and directories,
retrieving files and deleting files.

Procedure:

Pre-requisite o Java Installation - Check whether the Java is installed or not using the following
command.
java -version
o Hadoop Installation - Check whether the Hadoop is installed or not using the following
command.
hadoop version

1. Listing Files/Directories
The `-ls` command allows you to list all the files and directories in HDFS. The syntax is
similar tothe UNIX ls command.
hdfs dfs -ls /path

2. Creating Directories
You can create directories in HDFS using the `-mkdir`
command.hdfs dfs -mkdir /path/to/directory

3. Deleting Files/Directories

To remove a file or directory in HDFS, you can use the `-rm` command. To remove a
directory, you would need to use the `-r` (recursive) option.
hdfs dfs -rm /path/to/file
hdfsdfs -rm -r
/path/to/directory

4. Moving Files/Directories
The `-mv` command allows you to move files or directories from one location to another
withinHDFS.
hdfs dfs -mv /source/path /destination/path

5. Copying Files/Directories
To copy files or directories within HDFS, use the `-cp`
command.hdfs dfs -cp /source/path /destination/path

6. Displaying the Content of a File


You can display the contents of a file in HDFS using the `-cat`
command.hdfs dfs -cat /path/to/file
7. Copying Files to HDFS
To copy files from the local filesystem to HDFS, use the `-put` or `-copyFromLocal`
command.hdfs dfs -put localfile /path/in/hdfs

8. Copying Files from HDFS


To copy files from HDFS to the local filesystem, use the `-get` or `-copyToLocal`
command.hdfs dfs -get /path/in/hdfs localfile

hdfs dfs -copyFromLocal localfile


/path/in/hdfshdfs dfs -copyToLocal
/path/in/hdfs localfile

9. File/Directory Permissions

HDFS commands for file or directory permissions mirror the chmod, chown, and chgrp
commandsin UNIX.
hdfs dfs -chmod 755 /path/to/file
hdfsdfs -chown user:group
/path/to/file hdfs dfs -chgrp group
/path/to/file
10. Checking Disk Usage
The `-du` command displays the size of a directory or file, and `-dus` displays a summary of
thedisk usage.
hdfs dfs -du
/path/to/directory hdfs dfs -
dus /path/to/directory

Contents Marks
Aim and Algorithm 30
Program and Execution 30
Output and Result 30
Viva 10
Total 100
Result:

Thus, the all file management task, such as Adding files and directories, retrieving
files and deleting files are created and executed successfully.
Exp.no: 3
Installation of Hive
Date: / / 20

Aim:
To installation of Hive along with practice examples.
Procedure:

1. Pre-requisite o Java Installation - Check whether the Java is installed or not using the following
command.
java -version
o Hadoop Installation - Check whether the Hadoop is installed or not using the following
command.
hadoop -version
2. Starting all the services > sbin/start-dfs.sh
> sbin/start-yarn.sh
> sbin/start-all.sh
else
>jps

3. Install Apache Hive on Ubuntu


To configure Apache Hive, first you need to download and unzip Hive. Then you need to
customizethe following files and settings:
• Edit .bashrc file
• Edit hive-config.sh file
• Create Hive directories in HDFS
• Configure hive-site.xml file
• Initiate Derby database
Step 1: Download and Untar Hive

Step 2: Configure Hive Environment Variables (bashrc)


>gedit ~/.bashrc
b. >gedit ~/.bashrc
Paste the following lines at the end of the file(change the
path)#--insert HIVE_PREFIX
HIVE_PREFIX=/opt/hadoop-2.7.0
#--in PATH variable just append at the end of the
linePATH=$PATH:$HIVE_PREFIX/bin
#--Append HIVE_PREFIX at end of the export statement export
PATH JAVA_HOME HADOOP_PREFIX HIVE_PREFIX
c. save it
>source ~/.bashrc
Step 3: Edit hive-config.sh file Change directory to
bin:ubuntu@ubuntu:~/Downloads/apache-hive-
3.1.2- bin/bin$
> gedit hive-config.sh
Add the HADOOP_PREFIX variable and the full path to your Hadoop directory: at end of file.
export HADOOP_PREFIX=/home/ubuntu/Downloads/hadoop-2.7.0
Save the edits and exit the hive-config.sh file.
Step 4: Create Hive Directories in HDFS
Create two separate directories to store data in the HDFS layer:
• The temporary, tmp directory is going to store the intermediate results of Hive processes.
• The warehouse directory is going to store the Hive related tables.
a.Create tmp Directory:

hdfs dfs -mkdir /tmp


hdfsdfs -chmod g+w
/tmp hdfs dfs -ls /
b.Create warehouse Directory

hdfs dfs -mkdir -p /user/hive/warehouse


hdfs dfs -chmod g+w
/user/hive/warehouse

Use the following command to locate the correct


file:cd $HIVE_PREFIX/conf
Use the hive-default.xml.template to create the hive-site.xml file:
cp hive-default.xml.template hive-site.xml
Access the hive-site.xml file using the gedit text
editor:Step 6: Initiate Derby Database
Apache Hive uses the Derby database to store metadata. Initiate the Derby database, from the
Hivebin directory using the schematool command:

$HIVE_PREFIX/bin/ schematool -dbType derby -initSchema


While you run this command you get the following error
occur

Change the directory to hive/conf $ gedit hive-site.xml open in editor goto line 3215 and remove
“&#8” then save it and exit.

~/Downloads/apache-hive-3.1.2-bin$ cd ~/Downloads/apache-hive-3.1.2-bin
~/Downloads/apache-hive-3.1.2-bin$ rm -r metastore_db

Special step:
#Fix guava Incompatibility Error in Hive. The guava version has to be same as in Hadoop.

$ rm $HIVE_PREFIX/lib/guava-19.0.jar
$ cp $HADOOP_PREFIX/share/hadoop/hdfs/lib/guava-11.0.2.jar $HIVE_PREFIX/lib/
$HIVE_PREFIX/lib/

#Remember to use the schematool command once again to initiate the Derby database:
$HIVE_PREFIX/bin/ schematool -dbType derby -initSchema

Step 7: Launch Hive Client Shell on Ubuntu

Start the Hive command-line interface using the following


commands:cd $HIVE_PREFIX/bin hive
hive>show database;
hive>create table emp(id
int); hive>show tables;
Contents Marks
Aim and Algorithm 30
Program and Execution 30
Output and Result 30
Viva 10
Total 100

Result:
Thus the installation of Hive along with practice examples successfully installed and
executed.
Exp.no: 4
Installation of Hbase, Installing thrift along with practice
Date: / / 20

Aim:
To installation of Hbase, Installing thrift along with practice examples.
Procedure:
1. Pre-requisite o Java Installation - Check whether the Java is installed or not using the following
command. java -version
o Hadoop Installation - Check whether the Hadoop is installed or not using the following
command. hadoop -version

HBase Pseudo Standard alone Mode of Installation:


Step 1) How to Download HBase tar file stable version:
a) Go to the link here to download HBase. It will open a webpage as shown below.

b) Select stable version as shown below 1.1.2 version


How to Download Hbase
c) Click on the hbase-2.2.2-bin.tar.gz. It will download tar file. Copy the tar file into an
installation
location.
Downloading HBase tar file stable version
How To Install HBase in Ubuntu with Standalone Mode:
Here is the step by step process of HBase standalone mode installation in Ubuntu:

d) Place the below command


Place hbase-2.2.2-bin.tar.gz in /home/vrsoopslab/Downloads
Unzip it by executing command $tar -xvfhbase-2.2.2-
bin.tar.gz.
It will unzip the contents, and it will create hbase-2.2.2in the location
/home/vrsoopslab/Downloads

Step 2) Change directory cd


/home/vrsbdalab/Downloads/hbase/conf

Step 3) Open hbase-env.sh


Open gedit hbase-env.sh as below and mention JAVA_HOME path in the location.
Step 4) Open the file and mention the path
Open gedit ~/.bashrc file and mention HBASE_HOME path as shown in below

#--insert Hbase_PREFIX
Hbase_PREFIX=/home/vrsoopslab/Downloads/hb/hba
se#--in PATH variable just append at the end of the
line PATH=$PATH:$Hbase_PREFIX/bin
#--Append Hbase_PREFIX at end of the export statement
export PATH JAVA_HOME HADOOP_PREFIX HIVE_PREFIX Hbase_PREFIX

Step 5) Add properties in the file

Open gedit hbase-site.xml and place the following properties inside the file
vrsoopslab@ubuntu:~ /Downloads/hb/hbasebin$ gedit hbase-site.xml(code as below)
<property>

<name>hbase.rootdir</name>

<value>file:///home/vrsoopslab/Downloads/hb/hbase</value>
</property>

<property>

<name>hbase.zookeeper.property.dataDir</name>

<value>/home/vrsoopslab/Downloads/hb/hbase/zookeeper</value>

</property>

Here we are placing two properties

• One for HBase root directory and


• Second one for data directory correspond to ZooKeeper.

All HMaster and ZooKeeper activities point out to this hbase-

site.xml.Step 6) Mention the Ips

Open hosts file present in /etc. location and mention the IPs as shown in below.
Step 7) Hadoop Run command: sbin/start-all.sh

First Run Hadoop , After that execute below command


vrsoopslab@ubuntu:~/Downloads/Hadoop $jps
Step 8) Start the Hbase:
vrsoopslab@ubuntu:~/Downloads/hb/hbase/bin$ start-hbase.sh
vrsoopslab@ubuntu:~/Downloads/hb/hbase/bin$ hbase shell
http://localhost:16010
Step 9) Mention the following command to start Hbase thrift

server:bin/hbase-daemon.sh start thrift

Contents Marks
Aim and Algorithm 30
Program and Execution 30
Output and Result 30
Viva 10
Total 100

Result:

Thus the installation of Hbase, installation of thrift along with the examples are
successfullyinstalled and executed.
Exp.no: 5
Run a basic word count MapReduce program
Date: / / 20

Aim:
To run a basic word count MapReduce program to understand MapReduce paradigm

Procedure:

Pre-requisite steps:
o Java Installation - Check whether the Java is installed or not using the following command.
java -version
o Hadoop Installation - Check whether the Hadoop is installed or not using the following
command.
hadoop version
o To install Python on Ubuntu, follow these steps:
1. Update Package Lists: $ sudo apt update
2. Install Python: Ubuntu usually comes with Python pre-installed, but you can install
the latest version using the following command
$ sudo apt install python3
3. Check Installation: After installation, you can verify if Python is installed by typing
the following in the terminal:
$ python3 –version
4. Install pip (Python Package Manager): Pip is a package manager for Python that
allows you to easily install and manage Python libraries.
$ sudo apt install python3-pip
5. Check pip Installation: After installing pip, you can check if it's installed by running:
$pip3 –version

Hadoop Streaming is a feature that comes with Hadoop and allows users or developers to use
various different languages for writing MapReduce programs like Python, C++, Ruby, etc. It
supports all the languages that can read from standard input and write to standard output. We
will beimplementing Python with Hadoop Streaming and will observe how it works. We will
implement the word count problem in python to understand Hadoop Streaming. We will be
creating mapper.pyand reducer.py to perform map and reduce tasks.

Let‟s create one file which contains multiple words that we can count.

Step 1: Create a input file with the name input-wc.txt and add some data
to it.ubuntu@ubuntu:~$ cd wordcount
ubuntu@ubuntu:~/wordcount$ ls
-ltotal 12
-rw-rw-r-- 1 ubuntu ubuntu 125 Aug 16 22:14 input-wc.txt

-rwxrwxrwx 1 ubuntu ubuntu 168 Aug 16 22:21 mapper.py


-rwxrwxrwx 1 ubuntu ubuntu 1067 Aug 16 22:22
reducer.pyubuntu@ubuntu:~/wordcount$ nano input-
wc.txt ubuntu@ubuntu:~/wordcount$ cat input-wc.txt

Step 2: Create a mapper.py file that implements the mapper logic. It will read the data from
STDINand will split the lines into words, and will generate an output of each word with its
individual count.

ubuntu@ubuntu:~/wordcount$ nano
mapper.pyubuntu@ubuntu:~/wordcount$ cat
mapper.py

Mapper.py code:

#!/usr/bin/env

pythonimport sys

for line in sys.stdin:


line = line.strip()
words =
line.split()

for word in words:


print('%s\t%s' % (word,
1))

Let‟s test our mapper.py locally that it is working fine or not.

ubuntu@ubuntu:~/wordcount$ cat input-wc.txt | python3

mapper.py
Step 3: Create a reducer.py file that implements the reducer logic. It will read the output of
mapper.py from STDIN (standard input) and will aggregate the occurrence of each word and
willwrite the final output to STDOUT.
ubuntu@ubuntu:~/wordcount$ nano
reducer.pyubuntu@ubuntu:~/wordcount$ cat
reducer.py

Reducer.py code:

#!/usr/bin/env python

from operator import


itemgetterimport sys

current_word =
Nonecurrent_count
= 0 word = None

# read the entire line from


STDINfor line in sys.stdin:
# remove leading and trailing
whitespaceline = line.strip()
# splitting the data on the basis of tab we have provided in
mapper.pyword, count = line.split('\t', 1)
# convert count (currently a string) to
inttry:
count =
int(count)except
ValueError:
# count was not a number, so silently
# ignore/discard this
linecontinue

# this IF-switch only works because Hadoop sorts map


output# by key (here: word) before it is passed to the
reducer if current_word == word: current_count
+= count else: if
current_word:# write result to
STDOUT
print('%s\t%s' % (current_word,
current_count))current_count = count
current_word = word

# do not forget to output the last word if


needed!if current_word == word:
print('%s\t%s' % (current_word, current_count))

Now let‟s check our reducer code reducer.py with mapper.py is it working properly or not with
thehelp of the below command.

ubuntu@ubuntu:~/wordcount$ cat input-wc.txt | python3 mapper.py | sort -k1,1 | python3


reducer.py

We can see that our reducer is also working fine in our local system.
Step 4: Now let‟s start all our Hadoop daemons with the below command.

ubuntu@ubuntu:~/wordcount$ cd ubuntu@ubuntu:~$ cd $HADOOP_PREFIX


ubuntu@ubuntu:~/Downloads/hadoop-2.7.0$ sbin/start-yarn.sh starting yarn
daemons starting resourcemanager, logging to /home/ubuntu/Downloads/hadoop-
2.7.0/logs/yarn-ubunturesourcemanager-ubuntu.out localhost: starting
nodemanager, logging to
/home/ubuntu/Downloads/hadoop-2.7.0/logs/yarnubuntu-nodemanager-
ubuntu.out ubuntu@ubuntu:~/Downloads/hadoop-2.7.0$ jps
4016 DataNode
4545 NodeManager
4579 Jps
4405 ResourceManager
3877 NameNode
4190 SecondaryNameNode
ubuntu@ubuntu:~/Downloads/hadoop-2.7.0$
cdubuntu@ubuntu:~$ cd
Let‟s give executable permission to our mapper.py and reducer.py with the help of below command.

ubuntu@ubuntu:~/wordcount$ chmod 777 mapper.py


reducer.pyubuntu@ubuntu:~/wordcount$ hdfs dfs -ls /
Found 1 items
drwxr-xr-x - ubuntu supergroup 0 2023-08-17 10:20
/wordcountubuntu@ubuntu:~/wordcount$ ls -l
total 12
-rw-rw-r-- 1 ubuntu ubuntu 125 Aug 17 09:45 input-wc.txt
-rwxrwxrwx 1 ubuntu ubuntu 168 Aug 16 22:21 mapper.py
-rwxrwxrwx 1 ubuntu ubuntu 1067 Aug 16 22:22 reducer.py

Create an input directory in HDFS and upload your input data file:

hdfs dfs -mkdir /input hdfs dfs -put


/home/ubuntu/input-wc.txt /input/

Prepare your Python scripts and package them into a JAR file for running with Hadoop
Streaming
ubuntu@ubuntu:~/wordcount$ cat mapper.py | gzip > mapper.py.gz
ubuntu@ubuntu:~/wordcount$ cat reducer.py | gzip > reducer.py.gz
ubuntu@ubuntu:~/wordcount$ jar cvf pythonWordCount.jar mapper.py.gz
reducer.py.gz added manifest adding: mapper.py.gz(in = 136) (out=
139)(deflated -2%) adding: reducer.py.gz(in = 531) (out= 536)(deflated 0%)
ubuntu@ubuntu:~/wordcount$

Step 5: Now download the latest hadoop-streaming jar file from this Link. Then place, this
Hadoop,- streaming jar file to a place from you can easily access it. In my case, I am placing it
to /wordcount folder where mapper.py and reducer.py file is present.
Now let‟s run our python files with the help of the Hadoop streaming utility as shown below.

hadoop jar /home/ubuntu/Downloads/hadoop-2.7.0/share/hadoop/tools/lib/hadoop-


streaming-2.7.0.jar
\
> -files /home/ubuntu/wordcount/mapper.py,/home/ubuntu/wordcount/reducer.py \
> -mapper "python3 mapper.py" \
> -reducer "python3 reducer.py" \
> -input /input \
> -output /output

In the above command in -output, we will specify the location in HDFS where we want our
output to be stored. So let‟s check our output in output file at location /wordcount/output/* in
my case. Wecan check results by manually vising the location in HDFS or with the help of cat
command as shown below.

To view output:
ubuntu@ubuntu:~/wordcount$ hdfs dfs -cat /output/part-00000

To view in Browser:
localhost:5007
0
localhost:8088
39
Contents Marks
Aim and Algorithm 30
Program and Execution 30
Output and Result 30
Viva 10
Total 100

Result:

Thus the basic word count MapReduce program to understand MapReduce paradigm was
successfully executed.
Exp.no: 6
Implement of matrix multiplication with MapReduce
Date: / / 20

Aim:
To implement of matrix multiplication with hadoop MapReduce MapReduce program

Procedure:
Step 1: Pre-requisite o Java Installation - Check whether the Java is installed or not using the following
command.
java -version
o Hadoop Installation - Check whether the Hadoop is installed or not using the following
command.
hadoop version

Step 2. Prepare Input Data Files:


Depending on your use case, you might have input data in various formats. Hadoop Streaming
expects data to be in plain text format, where each line represents a record. For matrix
multiplication, you might have two text files, each representing a matrix. Each line in the file
could contain commaseparated values for each row of the matrix.
For example, if you're performing matrix multiplication and your matrices are:

Check the input files were created to the following command:


ubuntu@ubuntu:~$ ls
Step 3. Upload Input Data to HDFS (if applicable):
If you're using HDFS to store your input data, you need to upload your input data files to
HDFS using the hdfs dfs -put command. For example:cable): # create destination
directory ifit doesn't exist hdfs dfs -mkdir -p /matrix/
After create the directory put input files into HDFS for using the following commands:
ubuntu@ubuntu:~/Downloads/hadoop-2.7.0$ hdfs dfs -put /home/ubuntu/matrix/matrix1.txt
/matrix/ buntu@ubuntu:~/Downloads/hadoop-2.7.0$ hdfs dfs -put /home/ubuntu/matrix/matrix2.txt
/matrix/
Replace /home/ubuntu /matrix1.txt and /home/ubuntu /matrix2.txt with the paths to your local
inputdata files, and /matrix/ with the desired directory path on HDFS

Step 4: Write Map and Reduce Functions


Create your Python scripts for the map and reduce
functionsMapper.py
#!/usr/bin/python
3import sys m_r
= 2m_c = 3 n_r =
3 n_c = 2 i = 0 for
line in sys.stdin:
el = list(map(int,
line.split())) if i <
m_r:
for j in range(len(el)):
for k in range(n_c):
print("{0}\t{1}\t{2}\t{3}".format(i, k, j, el[j]))
else:
for j in range(len(el)):
for k in range(m_r):
print("{0}\t{1}\t{2}\t{3}".format(k, j, i - m_r, el[j]))
i += 1

Let‟s test our mapper.py locally that it is working fine or


notTo using the following commands:
$ cat *.txt | python mapper.py
Reducer.py
#!/usr/bin/python3
import sys m_r = 2
m_c = 3 n_r = 3
n_c
= 2 matrix = []
forrow in
range(m_r):
r = [] for col in
range(n_c): s=
0for el in
range(m_c):
mul = 1
fo
rnum in range(2):
line = sys.stdin.readline()
n = list(map(int, line.split('\t')))[-1] # Corrected line to ensure 'map' result is converted to
a list
mul *= n
s +=
mul
r.append(s)
matrix.append(r
)
print('\n'.join([str(x) for x in matrix]))
Now let‟s check our reducer code reducer.py with mapper.py is it working properly or not with
thehelp of the below command.

ubuntu@ubuntu:~/matrix$ cat *.txt | python3 mapper.py | python3 reducer.

We can
see that our reducer is also working fine in our local system.

Using the following command to change the execution permission for mapper.py and
reducer.pyubuntu@ubuntu:~/matrix$ chmod 777 mapper.py reducer.py
ubuntu@ubuntu:~/matrix$ ls –l

You can run the map reduce job and view the result by the following step (considering you
have already put input files in HDFS)

Step 5: Running the Map-Reduce Job on Hadoop

Now download the latest hadoop-streaming jar file from this Link. Then place, this
Hadoop,streaming jar file to a place from you can easily access it. In my case, I am placing it to
/matrix folder where mapper.py and reducer.py file is present.
Now let‟s run our python files with the help of the Hadoop streaming utility as shown below.

ubuntu@ubuntu:~/Downloads/hadoop-2.7.0$ hadoop jar


/home/ubuntu/Downloads/hadoop-2.7.0/share/hadoop/tools/lib/hadoop-streaming-
2.7.0.jar \
> -files /home/ubuntu/matrix/mapper.py,/home/ubuntu/matrix/reducer.py \
> -mapper /home/ubuntu/matrix/mapper.py -mapper "python3 mapper.py" \
> -reducer /home/ubuntu/matrix/reducer.py -reducer "python3 reducer.py" \
> -input /matrix \
> -output /output_matrix
Here are the steps to view the output:
If you specified an HDFS output directory (e.g., -output /output_matrix), you can use the following
command to list the files in that directory:
$ hdfs dfs -ls /output_matrix
To view the contents of a specific output file, you can use:
$ hdfs dfs -cat /output_matrix/part-00000

To view in Browser:
localhost:5007
0
localhost:8088

Contents Marks
Aim and Algorithm 30
Program and Execution 30
Output and Result 30
Viva 10
Total 100

Result:
Thus the implementation of matrix multiplication with hadoop map reduce was successfully
executed.
Exp.no: 7
Practice importing and exporting data from various database
Date: / / 20

Aim:
To practice importingand exporting data from various database.
.

Procedure:

1. Pre-requisite o Java Installation - Check whether the Java is installed or not using the following
command.
java -version
o Hadoop Installation - Check whether the Hadoop is installed or not using the following
command. hadoop -version

Importing data from MySQL to HDFS

In order to store data into HDFS, we make use of Apache Hive which provides an SQL-like
interface between the user and the Hadoop distributed file system (HDFS) which integrates
Hadoop. We perform the following steps:
Step 1: Login into MySQL mysql
-u root –pcloudera

Step 2: Create a database and table and insert data. create database geeksforgeeeks; create
table geeksforgeeeks.geeksforgeeks(author_name varchar(65), total_no_of_articles int,
phone_no int, address varchar(65));
insert into geeksforgeeks values(“Rohan”,10,123456789,”Lucknow”);
Database Name : geeksforgeeeks and Table Name : geeksforgeeks

Step 3: Create a database and table in the hive where data should be imported.
create table geeks_hive_table(name string, total_articles int, phone_no int, address string) row
formatdelimited fields terminated by „,‟;
Hive Database : geeks_hive and Hive Table : geeks_hive_table

Step 4: Run below the import command on Hadoop.


sqoopimport --connect \
jdbc:mysql://127.0.0.1:3306/database_name_in_mysql \
--username root --password cloudera \
--table table_name_in_mysql \
--hive-import --hive-table database_name_in_hive.table_name_in_hive \
--m 1

SQOOP Command to import

DataIn the above code following things should be noted.


• 127.0.0.1 is localhost IP address.
• 3306 is the port number for MySQL.
• m is the number of mappers
Step 5: Check-in hive if data is imported successfully or not.
Data imported into hive successfully.

Exporting data from HDFS to MySQL

To export data into MySQL from HDFS, perform the following steps:
Step 1: Create a database and table in the hive.
create table hive_table_export(name string,company string, phone int, age int) row format delimited
fields terminated by „,‟;

Hive Database : hive_export and Hive Table : hive_table_export

Step 2: Insert data into the hive table. insert into hive_table_export
values("Ritik","Amazon",234567891,35);

Data in Hive table


Step 3: Create a database and table in MySQL in which data should be exported.

MySQL Database : mysql_export and MySQL Table : mysql_table_export

Step 4: Run the following command on Hadoop. sqoop


export --connect \
jdbc:mysql://127.0.0.1:3306/database_name_in_mysq
l\
--table table_name_in_mysql \
--username root --password cloudera \
--export-dir /user/hive/warehouse/hive_database_name.db/table_name_in_hive \ -
-m1 \
-- driver com.mysql.jdbc.Driver
--input-fields-terminated-by ','

SQOOP command to export

dataIn the above code following things should be noted.


• 127.0.0.1 is the localhost IP address.
• 3306 is the port number for MySQL.
• In the case of exporting data, the entire path to the table should be specified
• m is the number of mappers
Step 5: Check-in MySQL if data is exported successfully or not.

Contents Marks
Aim and Algorithm 30
Program and Execution 30
Output and Result 30
Viva 10
Total 100

Result:
Thus the practice of imported and exported of data from various databases was successful.
CONTENT BEYOND SYLLABUS
Exp.no: 8
Installation of Pig
Date: / / 20

Aim:

To installation of pig along simple.

Procedure:

1. Pre-requisite o Java Installation - Check whether the Java is installed or not using the following
command.
java -version
o Hadoop Installation - Check whether the Hadoop is installed or not using the following
command.
hadoop -version

In order to install Apache Pig, you must have Hadoop and Java installed on your system.
Step 1: Download the new release of Apache Pig from this Link. In my case I have downloaded the
pig0.17.0.tar.gz version of Pig which is latest and about 220MB in size.
Step 2: Now move the downloaded Pig tar file to your desired location. In my case I am Moving it to
my /Documents folder.

Step 3: Now we extract this tar file with the help of below command (make sure to check your tar
filename):
tar -xvf pig-0.17.0.tar.gz

Step 4: Once it is installed it‟s time for us to switch to our Hadoop user. In my case it is hadoopusr. If
you have not created the separate dedicated user for Hadoop then, in that case, no need to move that file
and set the path according to your PIG PATH in the .bashrc file. To switch user you can use below
command or you can also switch manually by switch user settings.
su - hadoopusr
Step 5: Now we need to move this extracted folder to the hadoopusr user. For that, use the below
command(make sure name of your extracted folder is pig-0.17.0 otherwise change it accordingly)
sudo mv pig-0.17.0 /usr/local/

Step 6: Now once we moved it we need to change the environment variable for Pig‟s location. For that
open the bashrc file with below command.
sudo gedit ~/.bashrc

Once the file open save the below path inside this bashrc file.

#Pig location export PIG_INSTALL=/usr/local/pig-0.17.0

export PATH=$PATH:/usr/local/pig-0.17.0/bin

Step 7: Then check whether you have configured it correctly or not using the below command:
source ~/.bashrc

Step 8: Once you get it correct that‟s it we have successfully install pig to our Hadoop single node setup,
now we start pig with below pig command.
pig
Step 9: You can check your pig version with the below command.
pig –version

Contents Marks
Aim and Algorithm 30
Program and Execution 30
Output and Result 30
Viva 10
Total 100

Result:

Thus the installation procedure of Pig was successfully completed.

You might also like