BigData_Lab_Manual
BigData_Lab_Manual
DEPARTMENT OF ARTIFICIAL
INTELLIGENCE AND DATA SCIENCE
LAB MANUAL
for
CCS334 – BIGDATA ANALYTICS
(Regulation 2021, V Semester, Theory cum Practical
Course)
Output:
Step 5 : Switch user
Switch to the newly created hadoop user:
$ su - hadoop
Step 6 : Configure SSH
Now configure password-less SSH access for the newly created hadoop user, so didn‟t enter
the key to save file and passphrase. Generate an SSH keypair (generate Public and Private
Key Pairs)first
$ssh-keygen -t rsa
Next, you will need to configure Hadoop and Java Environment Variables on your system.
Open the ~/.bashrc file in your favorite text editor. Use nano editior , to pasting the code we
use ctrl+shift+v for saving the file ctrl+x and ctrl+y ,then hit enter:
Next, you will need to configure Hadoop and Java Environment Variables on your system.
Open the ~/.bashrc file in your favorite text editor:
$ nano ~/.bashrc
Append the below lines to file.
Save and close the file. Then, activate the environment variables with the following
command:
s$ source ~/.bashrc
Next, open the Hadoop environment variable file:
$ nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Search for the “export JAVA_HOME” and configure it.
JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Next, edit the core-site.xml file and update with your system hostname:
$nano $HADOOP_HOME/etc/hadoop/core-site.xml
Change the following name as per your system hostname:
$ jps
http://192.168.16:8088
Let‟s create some directories in the HDFS filesystem using the following command:
Also, put some files to hadoop file system. For the example, putting log files from host
machine to hadoop file system.
You can also verify the above files and directory in the Hadoop Namenode web interface.
Go to the web interface, click on the Utilities => Browse the file system. You should see your
directories which you have created earlier in the following screen:
$ stop-all.sh
Result:
The step-by-step installation and configuration of Hadoop on Ubutu linux system have been
successfully completed.
Exp 2: Hadoop Implementation of file management tasks, such as Adding files
and directories, retrieving files and Deleting files
AIM:
To implement the file management tasks, such as Adding files and directories, retrieving files
and Deleting files.
DESCRIPTION: -
HDFS is a scalable distributed filesystem designed to scale to petabytes of data while running
on top of the underlying filesystem of the operating system. HDFS keeps track of where the
data resides in a network by associating the name of its rack (or network switch) with the
dataset. This allows Hadoop to efficiently schedule tasks to those nodes that contain data, or
which are nearest to it, optimizing bandwidth utilization. Hadoop provides a set of command
line utilities that work similarly to the Linux file commands and serve as your primary
interface with HDFS.
PROCEDURE:
Step-1: Adding Files and Directories to HDFS Before you can run Hadoop programs on data
stored in HDFS, you„ll need to put the data into HDFS first. Let„s create a directory and put a
file in it. HDFS has a default working directory of /user/$USER, where $USER is your login
username. This directory isn„t automatically created for you, though, so let„s create it with
the mkdir command. Login with your hadoop user
start-all.sh
For the purpose of illustration, we use chuck. You should substitute your user name in
the example commands.
Step-2 : Retrieving Files from HDFS The Hadoop command get copies files from HDFS
back to the local filesystem. To retrieve example.txt, we can run the following command.
hadoop fs –cat /chuck/example.txt
AIM:
To implement the Matrix Multiplication with Hadoop Map Reduce
Procedure:
Step 1: Before writing the code let‟s first create matrices and put them in HDFS.
● Create two files M1, M2 and put the matrix values. (seperate columns with spaces and
$nano m1
123
456
$ nano m2
78
9 10
11 12
● Put the above files to HDFS at location /user/path/to/matrices/
hdfsdfs -mkdir /user/path/to/matrices
hdfsdfs -put /path/to/M1 /user/path/to/matrices/
hdfsdfs -put /path/to/M2 /user/path/to/matrices/
#!/usr/bin/env python3
import sys
m_r = 2
m_c = 3
n_r = 3
n_c = 2
matrix = []
for row in range(m_r):
r = []
for col in range(n_c):
s=0
for el in range(m_c):
mul = 1
for num in range(2):
line = input() # Read input from standard input
n = list(map(int, line.split('\t')))[-1]
mul *= n
s += mul
r.append(s)
matrix.append(r)
# Print the matrix
for row in matrix:
print('\t'.join([str(x) for x in row]))
Step 4: Running the Map-Reduce Job on Hadoop
You can run the map reduce job and view the result by the following code (considering you
have already put input files in HDFS)
To download the Hadoop jar file, use the following command:
Wget https://jar-download.com/download-handling.php
$ chmod +x /path/to/Mapper.py
$ chmod +x /path/to/Reducer.py
Result:
Thus, the matrix multiplication using map reduce program has been successfully completed.
EXP 4: Run a basic Word Count Map Reduce program to understand Map
Reduce Paradigm.
AIM:
To run a basic Word Count MapReduce program.
Procedure:
Step 1: Create Data File:
Create a file named "word_count_data.txt" and populate it with text data that you wish to
analyse.
Login with your hadoop user.
nano word_count.txt
Output:
nano mapper.py
# Copy and paste the mapper.py code
#!/usr/bin/env python3
# import sys because we need to read and write data to STDIN and STDOUT
import sys
# reading entire line from STDIN (standard input)
for line insys.stdin:
# to remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# we are looping over the words array and printing the word
# with the count of 1 to the STDOUT
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
print('%s\t%s' % (word, 1))
Here in the above program #! is known as shebang and used for interpreting the script.
The file will be run using the command we are specifying.
Step 3: Reducer Logic - reducer.py:
Create a file named "reducer.py" to implement the logic for the reducer. The reducer
will aggregate the occurrences of each word and generate the final output.
nano reducer.py
# Copy and paste the reducer.py code
reducer.py
#!/usr/bin/env python3
from operator importitemgetter
import sys
current_word = None
current_count = 0
word = None
for line insys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
exceptValueError:
continue
ifcurrent_word == word:
current_count += count
else:
ifcurrent_word:
print('%s\t%s' % (current_word, current_count))
current_count = count
current_word = word
if current_word == word:
print('%s\t%s' % (current_word, current_count))
Step 4: Prepare Hadoop Environment:
Start the Hadoop daemons and create a directory in HDFS to store your data.
start-all.sh
hdfsdfs -mkdir /word_count_in_python
hdfsdfs -copyFromLocal /path/to/word_count.txt/word_count_in_python
Step 6: Make Python Files Executable:
Give executable permissions to your mapper.py and reducer.py files.
Result:
Thus, the program for basic Word Count Map Reduce has been executed successfully.
Exp5: Installation of Hive on Ubuntu
Aim:
To Download and install Hive, Understanding Startup scripts, Configuration files.
Procedure:
2. Exporting Hadoop path in Hive-config.sh (To communicate with the Hadoop eco
system we are defining Hadoop Home path in hive config field) Open the hive-
config.sh as shown in below
$cd apache-hive-3.1.2-bin/bin
$cp hive-env.sh.template hive-env.sh
$nano hive-env.sh
Append the below commands on it
export HADOOP_HOME=/home/Hadoop/Hadoop
export HIVE_CONF_DIR=/home/Hadoop/apache-hive-3.1.2/conf
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>your_new_password</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>true</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>datanucleus.autoCreateTables</name>
<value>True</value>
</property>
</configuration>
Result:
Thus, the Apache Hive installation is completed successfully on Ubuntu.
Exp 6: Design and test various schema models to optimize data storage and retrieval
Using Hive.
Aim:
To Design and test various schema models to optimize data storage and retrieval Using
Hbase.
Procedure:
Step 1: Start Hive
Open a terminal and start Hive by running:
$hive
Aim:
To download and install Hbase, Understanding different Hbase modes, Startup
scripts, Configuration files.
Procedure:
$wget https://downloads.apache.org/hbase/2.3.0/hbase-2.5.5-bin.tar.gz
Step 4. Let us extract the tar file using the below command and rename the folder to HBase
to make it meaningful.
$tar -xzf hbase-2.5.5-bin.tar.gz
$mv hbase-2.5.5hbase
Step 5. Now edit (hbase-env.sh) configuration file which is present under conf in hbase and
add JAVA path as mentioned below.
/hbase/conf$ nano hbase-env.sh
Now put JAVA path.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
Step 6. After this edit (.bashrc) file to update the environment variable of Apache HBase so
that it can be accessed from any directory.
nano .bashrc
Add below path.
export HBASE_HOME= /home/hadoop/hbase
export PATH=$PATH:$HBASE_HOME/bin
Step 7. Now config hbase-site.xml by append these xml code .
cd ..
mkdirhbasfile
mkdir zookeeper
chmod 777 hbasfile zookeeper
cd hbase/conf
nano hbase-site.xml
<property>
<name>hbase.rootdir</name>
<value>file:///home/hadoop/hbasfile</value>
</property>
<property>
<name>hbase.zookeeper.property.datadir</name>
<value>/home/Hadoop/zookeeper</value>
</property>
Step 8. Now start Apache HBase and verify it using the below commands.
cd hbase
bin/start-hbase.sh
jps
Step 9. After this we can see Hbase services are running from the JPS command, now let us
start the HBase shell using the below command.
bin/hbase shell
Result:
Thus, the Apache HBase installation is completed successfully on Ubuntu.
Exp 8: Design and test various schema models to optimize data storage and
retrieval Using Hbase.
Aim:
To design and test various schema models to optimize data storage and retrieval Using
Hbase.
Procedure:
Step-by-step schema design experiment with HBase using sample commands:
1. Start HBase:
- If HBase is not running, start it with the following command:
Start -hbase.sh
hbase shell
2. Create a Table:
Create a sample HBase table named "student" with a column family "info" using the
HBase shell:
hbase:002:0>create 'student', 'info'
3. Insert Data:
- Insert sample data into the "student" table. Let's assume you're storing student information
with student IDs as row keys:
4.Query Data:
- Retrieve student information using the row key:
hbase:017:0>disable 'student'
hbase:018:0>drop 'student'
Result:
Thus, the execution of various Hbase commands has been successfully completed.
Exp 9: Practice importing and exporting data from various databases.
Aim:
To practice importing and exporting data from various databases.
Procedure:
SQOOP is basically used to transfer data from relational databases such as MySQL, Oracle
to data warehouses such as Hadoop HDFS (Hadoop File System). Thus, when data is
transferred from a relational database to HDFS, we say we are importing data. Otherwise,
when we transfer data from HDFS to relational databases, we say we are exporting data.
Note: To import or export, the order of columns in both MySQL and Hive should be the
same.
i)Installation of Sqoop:
Step 1: Download the stable version of Apache SqoopWebsite
URL https://archive.apache.org/dist/sqoop/1.4.7/
$wgethttps://archive.apache.org/dist/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
Step 2: Unzip the downloaded file using the tar command
$tar -xvzf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
Step 3: Edit the .bashrc file by using the command
$nano .bashrc
$source .bashrc
Step 5: Check the installed sqoop version using the below command
$sqoop version
Step 3: Create a database and table in the hive where data should be imported.
hive>create database album_hive;
hive>use album_hive;
hive>create table album_details_hive(album_name varchar(65), year int, artist
varchar(65));
Step 4 : Run this command on terminal :
$sqoop export --connect "jdbc:mysql://127.0.0.1:3306/importtohadoop?useSSL=false" \
--username root --password your_new_password \
--table album_details\
--hive-import –hive-table album_hive.album_hive_table \
--m 1 \
Step 5: Check-in hive if data is imported successfully or not.
Result:
Importing and exporting data between MySQL and Hive implemented successfully.