Bda Lab Manual
Bda Lab Manual
SE
-3264 - Big Data Analytics
Laboratory Manual
Department :
Data Science Engineering And Computer Applications
No of contact L T P C
hours/week:
0 0 3 1
What is HDFS??
Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over several
machines and replicated to ensure their durability to failure and high availability to parallel application.
It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes and
node name.
Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks are 128 MB by
default and this is configurable.Files n HDFS are broken into block-sized chunks,which are stored as
independent units.Unlike a file system, if the file is in HDFS is smaller than block size, then it does not
occupy full block?s size, i.e. 5 MB of file stored in HDFS of block size 128 MB takes 5MB of space
only.The HDFS block size is large just to minimize the cost of seek.
Name Node: HDFS works in master-worker pattern where the name node acts as master.Name Node is
controller and manager of HDFS as it knows the status and the metadata of all the files in HDFS; the
metadata information being file permission, names and location of each block.The metadata are small,
so it is stored in the memory of name node,allowing faster access to data. Moreover the HDFS cluster
is accessed by multiple clients concurrently,so all this information is handled bya single machine. The
file system operations like opening, closing, renaming etc. are executed by it.
Data Node: They store and retrieve blocks when they are told to; by client or name node. They report
back to name node periodically, with list of blocks that they are storing. The data node being a
commodity hardware also does the work of block creation, deletion and replication as stated by the
name node.
Secondary Name Node: It is a separate physical machine which acts as a helper of name node. It
performs periodic check points.It communicates with the name node and take snapshot of meta data
which helps minimize downtime and loss of data.
• HDFS Write:
• To use the HDFS commands, first you need to start the Hadoop services using the following
command:
$sbin/start-all.sh
• To check the Hadoop services are up and running use the following command:
$jps
• Check the datanode service is up by running jps commond at client side.
• To perform various file operations use the prefix Hadoop fs/ hdfs dfs as a prefix for each
command
• To create a directory. In Hadoop dfs there is no home directory by default. So let’s first create it
sing mkdir commond.
1. List all possible Linux file operations and execute each one of them in Linux CLI.
2. Interact with HDFS using command line interface to understand the basic working structure of
Hadoop cluster. Using Hadoop CLI, demonstrate the following commands to:
• Create a directory in HDFS.
• create an empty file
• copy files/folders from local file system to hdfs store.
• print file contents.
• copy files/folders from hdfs store to local file system.
• move file from local to hdfs
• copy files within hdfs
• move files within hdfs
• size of each file in directory
• total size of directory/file
• last modified time of directory or path
• change the replication factor of a file/directory in HDFS.
• List the contents of a directory in HDFS.
• Remove a file from HDFS.
• Change File Permissions
• Changing File Ownership
• Checksum Calculation
• File Concatenation
• File Compression/Decompression
• File Block Location Information
• File Encryption/Decryption
3. Use web interface to monitor Name node manager, resource manager, and Data node status.
MapReduce Concept:
1. OBJECTIVE: Run a basic word count Map Reduce program to understand Map Reduce Paradigm.
2. RESOURCES: VMWare stack (Hadoop), __ GB RAM, Web browser, Hard Disk __ GB.
Input value of the WordCount Map task will be a line of text from the input data file and
the key would be the line number <line_number, line_of_text> . Map task outputs <word, one>
for each word in the line of text.
Pseudo-code : mapper.py
#!/usr/bin/python3
"mapper.py"
import sys
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
print ('%s\t%s' % (word, 1))
Pseudo-code: reducer.py
#!/usr/bin/python3
"reducer.py"
import sys
current_word = None
current_count = 0
if current_word == word:
current_count += count
else:
if current_word:
print ('%s\t%s' % (current_word, current_count))
current_count = count
4. Input/Output
TO run word count program locally use following 4 and 5 commands
hdoop@hadoop-client:~cat input.txt |pyhton3 mapper.py
Output:
hi 1
how 1
are 1
you 1
i 1
am 1
good 1
hope 1
you 1
doing 1
good 1
too 1
how 1
about 1
you. 1
i 1
am 1
in 1
manipal 1
studying 1
Btech 1
in 1
Data 1
science. 1
am 2
are 1
Btech 1
Data 1
doing 1
good 2
hi 1
hope 1
how 2
i 2
in 2
manipal 1
science. 1
studying 1
too 1
you 2
you. 1
Data 1
about 1
am 2
are 1
doing 1
good 2
hi 1
hope 1
how 2
i 2
in 2
manipal 1
science. 1
studying 1
too 1
you 2
you. 1
1. Consider the text file (consider larger file size) of your choice and perform word count using
MapReduce technique.
2. Perform Matrix operations using MapReduce by considering 3 * 3 matrix and perform following
operations:
i. Matrix addition and subtraction
ii. Matrix Multiplication
iii. Matrix transpose
Note: Consider 3*3 matrix content as shown below
a,0,0,10
a,0,1,20
a,0,2,30
a,1,0,40
a,1,1,50
a,1,2,60
a,2,0,70
a,2,1,80
a,2,2,90
b,0,0,1
b,0,1,2
b,0,2,3
b,1,0,4
b,1,1,5
b,1,2,6
b,2,0,7
b,2,1,8
b,2,2,9
3. Create a text file containing the 20 student details such as registration number, name and marks
(ex: 1001, john,45 ) .Write a MapReduce program to sort data by student name.
1. Write a MapReduce program to find unit wise salary for the bellow given data.
EmpNo EmpName Unit Designation Salary
1001 John IMST TA 30000
1002 Jack CLOUD PM 80000
1003 Joshi FNPR TA 35000
1004 Jash ECSSAP PM 75000
1005 Yash FSADM SPM 60000
1006 Smith ICS TA 24000
1007 Lion IMST SPM 56000
1008 kate FNPR PM 76000
1009 cassy MFGADM TA 40000
1010 ronald ECSSAP SPM 65000
2. Consider the following sample text file to compute the the average, minimum and maximum
recorded temperature by year wise using concept of Map Reduce.
Temperature.txt
2014 44
2013 42
2012 30
2013 44
2010 45
2014 38
2011 42
2010 44
2. RESOURCES: VMWare stack (Hadoop), __ GB RAM, Web browser, Hard Disk __ GB.
Execution Modes:
MapReduce Mode: This is default mode, which needs access to a Hadoop cluster and HDFS
installation. The input and the output files both are present on the HDFS environment.
Local Mode: With access to a single machine, all files are installed and run using a local host
and file system. The local mode is specified using “-x flag” (i.e. pig –x local). The input and
output files are present on local file system
Command : pig –x local
Running Modes:
Interactive mode: Run pig in interactive mode by invoking grunt shell.
Batch mode: Create pig script to run in batch mode. Write pig latin statements in a file and save
it with .pig extension
Step 1: Write all the pig statements in single file and save with .pig extension. (Ex: pigcript.pig)
Step 2: Execute the pig script choosing local or MapReduce mode.
Input(s):
Successfully read 0 records from: "/bda1/input.txt"
Output(s):
Successfully stored 0 records in: "/bda1/out"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_1705045524875_0010
7) hdoop@hadoop-master:~$ hdfs dfs -ls /bda1/out
Output: Found 2 items
-rw-r--r-- 2 hdoop supergroup 0 2024-01-12 15:05 /bda1/out/_SUCCESS
-rw-r--r-- 2 hdoop supergroup 112 2024-01-12 15:05 /bda1/out/part-m-00000
Week5 Exercise
1. Create the dataset of your choice and perform word count program using spark tool.
2. Given a dataset of employee records containing (name, age, salary), use map transformation to
transform each record into a tuple of (name, age * 2, salary)?
Reg.No EmpName Age Salary
24 John 26 30000
34 Jack 40 80000
61 Joshi 25 35000
45 Jash 35 75000
34 Yash 40 60000
67 Smith 20 24000
42 Lion 42 56000
62 kate 50 76000
21 cassy 51 40000
10 ronald 57 65000
24 John 26 30000
67 Smith 20 24000
45 Jash 35 75000
21 cassy 51 40000
3. From the same employee dataset, filter out employees whose salary is greater than 50000 using
the filter transformation.
4. Create a text file that will have few sentences, use flatMap transformation to split each sentence
into words.
5. Create a dataset having student details such as (name, subject, score), from this dataset group
students by subject using the groupBy transformation.
6. From the employee dataset, collect the first 5 records as an array using the collect action.
7. Demonstrate the creation of RDD using Parallelized collection, existing RDD by finding the sum
of all elements in an RDD1(which holds array elements). Also, create an RDD from external
sources.
A. Consider the dataset given in Question B. Perform the following operations.
sortByKey()
groupByKey()
countBykey()
1) Assume you have a CSV file named clickstream_data.csv with the following columns:
user_id , page_id, timestamp, action (e.g., 'click', 'view', 'purchase').
2) Consider a scenario of Web Log Analysis. Assume you have a log file named web_logs.txt with
the columns: Timestamp, user_id, page_id, action (e.g., 'click', 'view', 'purchase'). Identify
the most engaged users by calculating the total time spent on the website for each user.
Implement the mentioned case with “PySpark Scala”
Week 8: Consider a Spark datafraome as shown below, Need to replace a string in column Card-
type from Checking->Cash using PySpark and Spark with scala
1. Write a Pig Latin statement to display the names of all users who have sent emails and also a list
of all the people that they sent the email to.
2. Store the result in a file.
3. Execute the pig script choosing local and MapReduce mode
Week 10 A: Write a pig script to split customers for reward program base on their life time
values, consider the following as input file:
Week 11: Consider a scenario of “Apache Spark”, objective to split single column to multiple column.
Consider the below data set to perform the stated objective:
Name DOB_Year Gender Salary
James, A, Smith 2018 M 3000
Michael, Rose, Jones 2010 M 4000
Robert, K, Williams 2010 M 4000
Maria, Anne, Jones 2005 F 4000
Jen,Mary, Brown 2010 -2
Alexria, Anne, Smith 2007 -1
• Split Name Column into Multiple Columns with (Firstname, Middlename, Lastname)
Consider the given Employee data with the attributes employee_id, birthday, first_name, family_name,
gender, work_day. Perform the basic HiveQL operations as follows:
1. Create database with the name Employee.
2. Display available databases.
3. Choose the Employee database and Create external and internal table into it.
4. Load the given data to both external and managed table.
5. Perform partitioning by considering gender as a partition key.
6. Create the buckets with suitable size.
7. Find the oldest 10 employees from both male and female category (Note:Here you will refer to
partition tables for query).
8. Find the oldest 10 employee by considering Employee table and compare the time taken to perform
this operation between Question 7 and Question 8.
9. Perform drop and alter operation on internal table.