0% found this document useful (0 votes)

91 views20 pages

Bda Lab Manual

Uploaded by

RAKSHIT AYACHIT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

91 views20 pages

Bda Lab Manual

Uploaded by

RAKSHIT AYACHIT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

D

SE
-3264 - Big Data Analytics
Laboratory Manual

Department :
Data Science Engineering And Computer Applications

Course Name & code :

DSE-3264 & Big Data Analytics Laboratory

Semester & branch :

VI Sem & BTech Data Science & Engineering

Name of the faculty :

Dr.Rashmi Laxmikant Malghan , Mrs. Shavantrevva S B

No of contact L T P C
hours/week:
0 0 3 1

What is HDFS??

Hadoop comes with a distributed file system called HDFS. In HDFS data is distributed over several
machines and replicated to ensure their durability to failure and high availability to parallel application.
It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes and
node name.

HDFS building Blocks:

Blocks: A Block is the minimum amount of data that it can read or write.HDFS blocks are 128 MB by
default and this is configurable.Files n HDFS are broken into block-sized chunks,which are stored as
independent units.Unlike a file system, if the file is in HDFS is smaller than block size, then it does not
occupy full block?s size, i.e. 5 MB of file stored in HDFS of block size 128 MB takes 5MB of space
only.The HDFS block size is large just to minimize the cost of seek.

Name Node: HDFS works in master-worker pattern where the name node acts as master.Name Node is
controller and manager of HDFS as it knows the status and the metadata of all the files in HDFS; the
metadata information being file permission, names and location of each block.The metadata are small,
so it is stored in the memory of name node,allowing faster access to data. Moreover the HDFS cluster
is accessed by multiple clients concurrently,so all this information is handled bya single machine. The
file system operations like opening, closing, renaming etc. are executed by it.

Data Node: They store and retrieve blocks when they are told to; by client or name node. They report
back to name node periodically, with list of blocks that they are storing. The data node being a
commodity hardware also does the work of block creation, deletion and replication as stated by the
name node.

Secondary Name Node: It is a separate physical machine which acts as a helper of name node. It
performs periodic check points.It communicates with the name node and take snapshot of meta data
which helps minimize downtime and loss of data.

• HDFS DataNode and NameNode Image:

• HDFS Read workflow:

• HDFS Write:

• To use the HDFS commands, first you need to start the Hadoop services using the following
command:
$sbin/start-all.sh
• To check the Hadoop services are up and running use the following command:
$jps
• Check the datanode service is up by running jps commond at client side.

• To perform various file operations use the prefix Hadoop fs/ hdfs dfs as a prefix for each
command

• To create a directory. In Hadoop dfs there is no home directory by default. So let’s first create it
sing mkdir commond.

Week 1 Exercise: Hadoop Distributed File System

1. List all possible Linux file operations and execute each one of them in Linux CLI.

2. Interact with HDFS using command line interface to understand the basic working structure of
Hadoop cluster. Using Hadoop CLI, demonstrate the following commands to:
• Create a directory in HDFS.
• create an empty file
• copy files/folders from local file system to hdfs store.
• print file contents.
• copy files/folders from hdfs store to local file system.
• move file from local to hdfs
• copy files within hdfs
• move files within hdfs
• size of each file in directory
• total size of directory/file
• last modified time of directory or path
• change the replication factor of a file/directory in HDFS.
• List the contents of a directory in HDFS.
• Remove a file from HDFS.
• Change File Permissions
• Changing File Ownership
• Checksum Calculation
• File Concatenation
• File Compression/Decompression
• File Block Location Information
• File Encryption/Decryption
3. Use web interface to monitor Name node manager, resource manager, and Data node status.
MapReduce Concept:
1. OBJECTIVE: Run a basic word count Map Reduce program to understand Map Reduce Paradigm.

2. RESOURCES: VMWare stack (Hadoop), __ GB RAM, Web browser, Hard Disk __ GB.

3. PROGRAM LOGIC: MapReduce (Wordcount): It consists of 4 phases (i.e. Partition or splitting,

Mapping, Sorting or shuffling, Reducing).
WordCount is a simple program which counts the number of occurrences of each word in a given
text input data set. WordCount fits very well with the MapReduce programming model making it
a great example to understand the Hadoop Map/Reduce programming style. Our implementation
consists of three main parts:
Mapper
Reducer
Driver
Example:

Step-1. Write a Mapper

Mapper.py: Initially the partition of content takes place based on line.split() function, number of
partitions made = number of mapper class gets created. Mapper overrides the ―map‖ function
which provides <key, value> pairs as the input. Even the key is repeated in same or different
mapper class it doesnot matter as the default value for every key is assigned to be as “1”. A
Mapper implementation may output<key,value> pairs using the provided Context .

Input value of the WordCount Map task will be a line of text from the input data file and
the key would be the line number <line_number, line_of_text> . Map task outputs <word, one>
for each word in the line of text.

Pseudo-code : mapper.py
#!/usr/bin/python3
"mapper.py"
import sys
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
print ('%s\t%s' % (word, 1))

Step-2. Write a Reducer

A Reducer collects the intermediate <key,value> output from multiple map tasks and assemble a
single result. Here, the WordCount program will sum up the occurrence of each word to pairs as
<word, occurrence>.

Pseudo-code: reducer.py
#!/usr/bin/python3
"reducer.py"
import sys
current_word = None
current_count = 0

for line in sys.stdin:

# remove leading and trailing whitespaces
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t')
count = int(count)

if current_word == word:
current_count += count
else:
if current_word:
print ('%s\t%s' % (current_word, current_count))
current_count = count

4. Input/Output
TO run word count program locally use following 4 and 5 commands
hdoop@hadoop-client:~cat input.txt |pyhton3 mapper.py
Output:
hi 1
how 1
are 1
you 1
i 1
am 1
good 1
hope 1
you 1
doing 1
good 1
too 1
how 1
about 1
you. 1
i 1
am 1
in 1
manipal 1
studying 1
Btech 1
in 1
Data 1
science. 1

hdoop@hadoop-client:~$ cat input.txt |python3 mapper.py|sort|python3 reducer.py

Output:
about 1

am 2

are 1

Btech 1

Data 1

doing 1

good 2

hi 1

hope 1

how 2

i 2
in 2

manipal 1

science. 1

studying 1

too 1

you 2

you. 1

TO run word count program on Hadoop framework use following command:

hdoop@hadoop-client:~$ hadoop jar '/home/hdoop/hadoop/share/hadoop/tools/lib/hadoop-
streaming-3.3.6.jar' -file mapper.py -mapper mapper.py -file reducer.py -reducer reducer.py -
input /bda1/input.txt -output /bda1/oup1
Output:2024-01-16 10:18:00,489 WARN streaming.StreamJob: -file option is deprecated,
please use generic option -files instead.
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-unjar2657340332712108565/] []
/tmp/streamjob3503883941011863300.jar tmpDir=null
2024-01-16 10:18:01,069 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to
ResourceManager at /192.168.159.101:8032
2024-01-16 10:18:01,343 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to
ResourceManager at /192.168.159.101:8032
2024-01-16 10:18:01,544 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for
path: /tmp/hadoop-yarn/staging/hdoop/.staging/job_1705376153146_0001
2024-01-16 10:18:02,354 INFO mapred.FileInputFormat: Total input files to process : 1
2024-01-16 10:18:02,425 INFO mapreduce.JobSubmitter: number of splits:2
2024-01-16 10:18:02,577 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1705376153146_0001
2024-01-16 10:18:02,577 INFO mapreduce.JobSubmitter: Executing with tokens: []
2024-01-16 10:18:02,786 INFO conf.Configuration: resource-types.xml not found
2024-01-16 10:18:02,786 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
2024-01-16 10:18:02,975 INFO impl.YarnClientImpl: Submitted application
application_1705376153146_0001
2024-01-16 10:18:03,028 INFO mapreduce.Job: The url to track the job: http://hadoop-
master:8088/proxy/application_1705376153146_0001/
2024-01-16 10:18:03,029 INFO mapreduce.Job: Running job: job_1705376153146_0001
2024-01-16 10:18:09,113 INFO mapreduce.Job: Job job_1705376153146_0001 running in uber
mode : false
2024-01-16 10:18:09,115 INFO mapreduce.Job: map 0% reduce 0%
2024-01-16 10:18:14,186 INFO mapreduce.Job: map 100% reduce 0%
2024-01-16 10:18:18,219 INFO mapreduce.Job: map 100% reduce 100%
2024-01-16 10:18:19,248 INFO mapreduce.Job: Job job_1705376153146_0001 completed
successfully
2024-01-16 10:18:19,322 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=214
FILE: Number of bytes written=843282
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=356
HDFS: Number of bytes written=127
HDFS: Number of read operations=11
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=6466
Total time spent by all reduces in occupied slots (ms)=1625
Total time spent by all map tasks (ms)=6466
Total time spent by all reduce tasks (ms)=1625
Total vcore-milliseconds taken by all map tasks=6466
Total vcore-milliseconds taken by all reduce tasks=1625
Total megabyte-milliseconds taken by all map tasks=6621184
Total megabyte-milliseconds taken by all reduce tasks=1664000
Map-Reduce Framework
Map input records=6
Map output records=24
Map output bytes=160
Map output materialized bytes=220
Input split bytes=188
Combine input records=0
Combine output records=0
Reduce input groups=18
Reduce shuffle bytes=220
Reduce input records=24
Reduce output records=18
Spilled Records=48
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=1469
CPU time spent (ms)=3500
Physical memory (bytes) snapshot=1203916800
Virtual memory (bytes) snapshot=7675027456
Total committed heap usage (bytes)=1232601088
Peak Map Physical memory (bytes)=477769728
Peak Map Virtual memory (bytes)=2556215296
Peak Reduce Physical memory (bytes)=250322944
Peak Reduce Virtual memory (bytes)=2562641920
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=168
File Output Format Counters
Bytes Written=127
2024-01-16 10:18:19,322 INFO streaming.StreamJob: Output directory: /bda1/oup1
If the above 7th command runs successfully: Then in local host -browse utilities-specific folder : 2 files
will get created (status: successful and Part-00000)
To view that : hdfs dfs -ls/bda1/output
To display : hdoop@hadoop-client:~$ hdfs dfs -cat /bda1/oup1/part-00000
Output:
Btech 1

Data 1

about 1

am 2

are 1

doing 1
good 2

hi 1

hope 1

how 2

i 2

in 2

manipal 1

science. 1

studying 1

too 1

you 2

you. 1

hdoop@hadoop-client:~$ hdfs dfs -get /bda/output/part-00000 /home/hdoop

hdoop@hadoop-client:~$ Cat part-0000

Week 2: Exercise: MapReduce with Python

1. Consider the text file (consider larger file size) of your choice and perform word count using
MapReduce technique.
2. Perform Matrix operations using MapReduce by considering 3 * 3 matrix and perform following
operations:
i. Matrix addition and subtraction
ii. Matrix Multiplication
iii. Matrix transpose
Note: Consider 3*3 matrix content as shown below

a,0,0,10
a,0,1,20
a,0,2,30
a,1,0,40
a,1,1,50
a,1,2,60
a,2,0,70
a,2,1,80
a,2,2,90

b,0,0,1
b,0,1,2
b,0,2,3
b,1,0,4
b,1,1,5
b,1,2,6
b,2,0,7
b,2,1,8
b,2,2,9

3. Create a text file containing the 20 student details such as registration number, name and marks
(ex: 1001, john,45 ) .Write a MapReduce program to sort data by student name.

Week 3: Exercise: MapReduce with Python

1. Write a MapReduce program to find unit wise salary for the bellow given data.
EmpNo EmpName Unit Designation Salary
1001 John IMST TA 30000
1002 Jack CLOUD PM 80000
1003 Joshi FNPR TA 35000
1004 Jash ECSSAP PM 75000
1005 Yash FSADM SPM 60000
1006 Smith ICS TA 24000
1007 Lion IMST SPM 56000
1008 kate FNPR PM 76000
1009 cassy MFGADM TA 40000
1010 ronald ECSSAP SPM 65000

2. Consider the following sample text file to compute the the average, minimum and maximum
recorded temperature by year wise using concept of Map Reduce.
Temperature.txt
2014 44
2013 42
2012 30
2013 44
2010 45
2014 38
2011 42
2010 44

PIG TOOL: Pig Latin

1. OBJECTIVE: HOW TO EXECUTE & RUN THE PROGRAM LOCALLY & TEST IT ON HADOOP

2. RESOURCES: VMWare stack (Hadoop), __ GB RAM, Web browser, Hard Disk __ GB.

3. PROGRAM LOGIC: PIG (Wordcount, Identify Most Popular Movie)

Focus on the data transformations rather than the underlying MapReduce implementation.
Apache Pig's high-level dataflow engine simplifies the development of large-scale data
processing tasks on Hadoop clusters by providing an abstraction layer and leveraging the power
of MapReduce without requiring users to write complex Java code.

Execution Modes:
MapReduce Mode: This is default mode, which needs access to a Hadoop cluster and HDFS
installation. The input and the output files both are present on the HDFS environment.

Command : pig –x mapreduce

Or
pig

Local Mode: With access to a single machine, all files are installed and run using a local host
and file system. The local mode is specified using “-x flag” (i.e. pig –x local). The input and
output files are present on local file system
Command : pig –x local

Running Modes:
Interactive mode: Run pig in interactive mode by invoking grunt shell.
Batch mode: Create pig script to run in batch mode. Write pig latin statements in a file and save
it with .pig extension

Executing pig in “Batch Mode”:

While executing Apache Pig statements (commands) in batch mode , perform the steps as below:

Step 1: Write all the pig statements in single file and save with .pig extension. (Ex: pigcript.pig)
Step 2: Execute the pig script choosing local or MapReduce mode.

Execution using Grunt Mode:

grunt>exec /pigscript.pig

Executing a Pig Script from HDFS

We can also execute a Pig script that resides in the HDFS.
Suppose there is a Pig script with the name pigscript.pig in the HDFS directory named
/pig_data/. We can execute it as shown below
$ pig -x mapreduce hdfs://localhost:9000/pig_data/pigscript.pig

Procedure to execute Pig Progrsm:

Step1: Create input.txt file

Step2: Transfer to HDFS

hdfs dfs put /home/hdoop/input.txt bda1/
Step3: Create Pigscript file
sudo gedit pigscript.pig
OR
vi pigscript.pig

Code to be typed in pigscript.pig

record = load '/bda1/input.txt/’;
store record into '/bda1/out’;
Step4: Run pigscript in mapreduce mode
pig -x mapreduce pigscript.pig
Step5: Check the status of execution

Output: Found 2 items

-rw-r--r-- 2 hdoop supergroup 0 2024-01-12 15:05 /bda1/out/_SUCCESS
-rw-r--r-- 2 hdoop supergroup 112 2024-01-12 15:05 /bda1/out/part-m-00000

Step 6: View the output file

hdfs dfs -cat /bda1/out/part-m-00000

Sample Program with Execution: (Input and Output)

PIG EXECUTION : NORMAL TEXT FILE (HOW TO EXECUTE & RUN THE PROGRAM
LOCALLY & TEST IT ON HADOOP)
1) Create input.txt file :
Content :
hi how are you
i am good
hope you doing good too
how about you.
i am in manipal
studying Btech in Data science.

2) Transfer to HDFS: hdfs dfs -put /home/hdoop/input.txt bda1/

3) Create Pigscript file: sudo gedit pigscript.pig
Content : record = load '/bda1/input.txt/';
store record into '/bda1/out';

4) hdfs dfs -ls bda1/input

For compilation of Program:
5) hdoop@hadoop-master:~$ pig pigscript.pig
For execution of Program:
6) hdoop@hadoop-master:~$ run pigscript.pig

HadoopVersion PigVersion UserId StartedAt FinishedAt Features

3.3.6 0.17.0 hdoop 2024-01-12 15:05:13 2024-01-12 15:06:55 UNKNOWN
Output:
Success!

Job Stats (time in seconds):

JobId Maps Reduces MaxMapTime MinMapTime AvgMapTime MedianMapTime
MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime
Alias FeatureOutputs
job_1705045524875_0010 1 0 n/a n/a n/a n/a 0 0 0
0 record MAP_ONLY /bda1/out,

Input(s):
Successfully read 0 records from: "/bda1/input.txt"

Output(s):
Successfully stored 0 records in: "/bda1/out"

Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1705045524875_0010
7) hdoop@hadoop-master:~$ hdfs dfs -ls /bda1/out
Output: Found 2 items
-rw-r--r-- 2 hdoop supergroup 0 2024-01-12 15:05 /bda1/out/_SUCCESS
-rw-r--r-- 2 hdoop supergroup 112 2024-01-12 15:05 /bda1/out/part-m-00000

8) hdoop@hadoop-master:~$ hdfs dfs -cat /bda1/out/part-m-00000

Output:
hi how are you
i am good
hope you doing good too
how about you.
i am in manipal
studying Btech in Data science.

Week 4 A): Pig Execution

A. Consider normal text file to learn the pig running modes and execution modes. Run the
program locally and test it on Hadoop.
B. Write a pig program to count the number of word occurrences using python in different
modes (local mode, MapReduce mode)
C. Execute the pig script to find the “most popular moive in the dataset”. In this example we
will be dealing with 2 files (ratings.data and movies.item). Consider the dataset:
wget https://raw.githubusercontent.com/ashaypatil11/hadoop/main/movies.item,
wget https://raw.githubusercontent.com/ashaypatil11/hadoop/main/ratings.data

Week5 Exercise
1. Create the dataset of your choice and perform word count program using spark tool.
2. Given a dataset of employee records containing (name, age, salary), use map transformation to
transform each record into a tuple of (name, age * 2, salary)?
Reg.No EmpName Age Salary
24 John 26 30000
34 Jack 40 80000
61 Joshi 25 35000
45 Jash 35 75000
34 Yash 40 60000
67 Smith 20 24000
42 Lion 42 56000
62 kate 50 76000
21 cassy 51 40000
10 ronald 57 65000
24 John 26 30000
67 Smith 20 24000
45 Jash 35 75000
21 cassy 51 40000

3. From the same employee dataset, filter out employees whose salary is greater than 50000 using
the filter transformation.
4. Create a text file that will have few sentences, use flatMap transformation to split each sentence
into words.
5. Create a dataset having student details such as (name, subject, score), from this dataset group
students by subject using the groupBy transformation.
6. From the employee dataset, collect the first 5 records as an array using the collect action.
7. Demonstrate the creation of RDD using Parallelized collection, existing RDD by finding the sum
of all elements in an RDD1(which holds array elements). Also, create an RDD from external
sources.
A. Consider the dataset given in Question B. Perform the following operations.
sortByKey()
groupByKey()
countBykey()

Week 7 : Spark Execution and Scala

1) Assume you have a CSV file named clickstream_data.csv with the following columns:
user_id , page_id, timestamp, action (e.g., 'click', 'view', 'purchase').

• Load the data into a PySpark DataFrame.

• Display the schema and the first 5 rows of the DataFrame.
• Calculate the total number of clicks, views, and purchases for each user.
• Identify the most common sequence of actions performed by users (e.g., click -> view ->
purchase).

2) Consider a scenario of Web Log Analysis. Assume you have a log file named web_logs.txt with
the columns: Timestamp, user_id, page_id, action (e.g., 'click', 'view', 'purchase'). Identify
the most engaged users by calculating the total time spent on the website for each user.
Implement the mentioned case with “PySpark Scala”

Week 8: Consider a Spark datafraome as shown below, Need to replace a string in column Card-
type from Checking->Cash using PySpark and Spark with scala

Hint: Use method 1: na.replace and method 2: using regexp_replace

Customer_NO Card_type Date Category Transaction Type Amount

1000210 Platinum Card 3/17/2018 Fast Food Debit 23.34

1000210 Silver Card 3/19/2018 Restaurants Debit 36.48

1000210 Checking 3/19/2018 Utilities Debit 35

1000210 Platinum Card 3/20/2018 Shopping Debit 14.97

1000210 Silver Card 3/22/2018 Gas & Fuel Debit 30.55

1000210 Platinum Card 3/23/2018 Credit Card Payment Debit 559.91

1000210 Checking 3/23/2018 Credit Card Payment Debit 559.91

Week 9: Create a file which contains bag dataset as shown below
User ID From To
user1001 [email protected] {([email protected]),([email protected]),(
[email protected])}
user1002 [email protected] {([email protected]),([email protected])}

user1003 [email protected] {([email protected]),([email protected])}

1. Write a Pig Latin statement to display the names of all users who have sent emails and also a list
of all the people that they sent the email to.
2. Store the result in a file.
3. Execute the pig script choosing local and MapReduce mode

Week 10 A: Write a pig script to split customers for reward program base on their life time
values, consider the following as input file:

Customers Life Time Value

Jack 25000
Smith 8000
David 35000
John 15000
Scott 10000
Joshi 28000
Ajay 12000
Vinay 30000
Joseph 21000

• If Life Time Value is >1000 and <=2000 ->Silver Program

• If Life TIme Value is >2000-> Gold Program.

Week 10 B: Create a data file for below schemas:

Order: CustomerId, ItemId, ItemName, OrderDte, DeliveryDate
Customers: CustomerId, CustomerName, Address, City, State, Country
Load Order and Customer Data.
Write a Pig Latin Script to determine number of items bought by each customer.

Week 11: Consider a scenario of “Apache Spark”, objective to split single column to multiple column.
Consider the below data set to perform the stated objective:
Name DOB_Year Gender Salary
James, A, Smith 2018 M 3000
Michael, Rose, Jones 2010 M 4000
Robert, K, Williams 2010 M 4000
Maria, Anne, Jones 2005 F 4000
Jen,Mary, Brown 2010 -2
Alexria, Anne, Smith 2007 -1
• Split Name Column into Multiple Columns with (Firstname, Middlename, Lastname)

Week 12 : Data Processing and analysis using Apache Hive tool

Consider the given Employee data with the attributes employee_id, birthday, first_name, family_name,
gender, work_day. Perform the basic HiveQL operations as follows:
1. Create database with the name Employee.
2. Display available databases.
3. Choose the Employee database and Create external and internal table into it.
4. Load the given data to both external and managed table.
5. Perform partitioning by considering gender as a partition key.
6. Create the buckets with suitable size.
7. Find the oldest 10 employees from both male and female category (Note:Here you will refer to
partition tables for query).
8. Find the oldest 10 employee by considering Employee table and compare the time taken to perform
this operation between Question 7 and Question 8.
9. Perform drop and alter operation on internal table.

Big Datalab
No ratings yet
Big Datalab
4 pages
Cloud PDF
No ratings yet
Cloud PDF
47 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
42 pages
Lsde Workshop wk9
No ratings yet
Lsde Workshop wk9
31 pages
Lez.d-01-Hadoop (A) Intro
No ratings yet
Lez.d-01-Hadoop (A) Intro
58 pages
PDC All Labs
100% (1)
PDC All Labs
129 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
BDA Lab Assignment 3 PDF
No ratings yet
BDA Lab Assignment 3 PDF
17 pages
Big Data File
No ratings yet
Big Data File
16 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
Big Data Ia Answers
No ratings yet
Big Data Ia Answers
14 pages
Run Python MapReduce On Local Docker Hadoop Cluster - DEV Community
No ratings yet
Run Python MapReduce On Local Docker Hadoop Cluster - DEV Community
5 pages
BDA Practical
No ratings yet
BDA Practical
18 pages
084 Liza Bda File
No ratings yet
084 Liza Bda File
23 pages
Module 3 - Mapreduce
No ratings yet
Module 3 - Mapreduce
40 pages
Palak
No ratings yet
Palak
10 pages
Hadoop Module1
No ratings yet
Hadoop Module1
37 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
Big Data Analytics Lab
No ratings yet
Big Data Analytics Lab
18 pages
Unit 3 Mapreduce
No ratings yet
Unit 3 Mapreduce
14 pages
Writing An Hadoop MapReduce Program in Python
No ratings yet
Writing An Hadoop MapReduce Program in Python
21 pages
BDA Lab Manual - Organized
No ratings yet
BDA Lab Manual - Organized
69 pages
Mapreduce, Hadoop and Amazon Aws: Yasser Ganjisaffar
No ratings yet
Mapreduce, Hadoop and Amazon Aws: Yasser Ganjisaffar
33 pages
BDF Programs
No ratings yet
BDF Programs
32 pages
@bigdatalabfile 09
No ratings yet
@bigdatalabfile 09
35 pages
Lab2 BD
No ratings yet
Lab2 BD
20 pages
TP3 - Hadoop Python - Wordcount
No ratings yet
TP3 - Hadoop Python - Wordcount
6 pages
Kcs 061 PPT Unit 2
No ratings yet
Kcs 061 PPT Unit 2
56 pages
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
No ratings yet
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
11 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
58 pages
Big Data
No ratings yet
Big Data
22 pages
Commands in Hadoop
No ratings yet
Commands in Hadoop
7 pages
Hadoop and Map Reduce
No ratings yet
Hadoop and Map Reduce
27 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
Bda Lab Manual 2024
No ratings yet
Bda Lab Manual 2024
45 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
32 pages
CS-702 (D) BigData
No ratings yet
CS-702 (D) BigData
61 pages
Big Data Lab Manual Printout
No ratings yet
Big Data Lab Manual Printout
51 pages
10 Dfs
No ratings yet
10 Dfs
5 pages
Activity 2
No ratings yet
Activity 2
31 pages
Hadoop Administrator Training - Lab Hand Book
No ratings yet
Hadoop Administrator Training - Lab Hand Book
12 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
27 pages
BDA UNIT - 3 Updated
No ratings yet
BDA UNIT - 3 Updated
25 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Part 03 Intro To Hadoop
No ratings yet
Part 03 Intro To Hadoop
22 pages
Unit II Hadoop and Map Reduce Overview
No ratings yet
Unit II Hadoop and Map Reduce Overview
136 pages
Hadoop Week 3
No ratings yet
Hadoop Week 3
60 pages
Module-2 PPT-1
No ratings yet
Module-2 PPT-1
126 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
Bda Lab S
No ratings yet
Bda Lab S
92 pages
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
No ratings yet
Hadoop Echosystem and Ibm Big Insights: Rafie Tarabay Eng - Rafie@Mans - Edu.Eg
112 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
Hadoop BigData Testing Overview
No ratings yet
Hadoop BigData Testing Overview
37 pages
C21053 Jay Vijay Karwatkar-Big Data Analytics & Visualization
No ratings yet
C21053 Jay Vijay Karwatkar-Big Data Analytics & Visualization
210 pages
BDT Lab Manual
No ratings yet
BDT Lab Manual
48 pages
Unit V Programming Model
No ratings yet
Unit V Programming Model
53 pages
Zinc Flake Coating Ex Geomet
No ratings yet
Zinc Flake Coating Ex Geomet
7 pages
Case Note Excellent1 Annotated
No ratings yet
Case Note Excellent1 Annotated
8 pages
N.E.F. Phobia
No ratings yet
N.E.F. Phobia
2 pages
Nitish Bnkassociate
No ratings yet
Nitish Bnkassociate
2 pages
Class 6 History Worksheet
No ratings yet
Class 6 History Worksheet
5 pages
Apple Supplier List 2013
No ratings yet
Apple Supplier List 2013
33 pages
Adv PT1
No ratings yet
Adv PT1
23 pages
New Microsoft Word Document (3) BBBB
No ratings yet
New Microsoft Word Document (3) BBBB
85 pages
RICOH Pro L4130/L4160 Print Guide: First, Confirm The Following Items
No ratings yet
RICOH Pro L4130/L4160 Print Guide: First, Confirm The Following Items
8 pages
Application of Assembly Construction in Intelligen
No ratings yet
Application of Assembly Construction in Intelligen
6 pages
Monthly RE Generation Report April 2025
No ratings yet
Monthly RE Generation Report April 2025
28 pages
Spesifikasi Rig 450 HP (BMA#06)
No ratings yet
Spesifikasi Rig 450 HP (BMA#06)
21 pages
Sir - 11 - 21 Rate List 2022
No ratings yet
Sir - 11 - 21 Rate List 2022
10 pages
00 MCB BC-L Series Leaflet
No ratings yet
00 MCB BC-L Series Leaflet
2 pages
OU Diary-2020 Informatica PDF
No ratings yet
OU Diary-2020 Informatica PDF
75 pages
Notes On The Balance of Power
No ratings yet
Notes On The Balance of Power
1 page
Position Description BIM Manager
No ratings yet
Position Description BIM Manager
5 pages
Chilled Displays
No ratings yet
Chilled Displays
65 pages
Under Guidance of Hassan Zakir Jafri SB
No ratings yet
Under Guidance of Hassan Zakir Jafri SB
10 pages
Photovoltaic Systems - Artificial Intelligence-Based Fault - K - Mohana Sundaram, Sanjeevikumar Padmanaban, Jens Bo - CRC Press (Unlimited), (S - L - ), - 9781000545852
No ratings yet
Photovoltaic Systems - Artificial Intelligence-Based Fault - K - Mohana Sundaram, Sanjeevikumar Padmanaban, Jens Bo - CRC Press (Unlimited), (S - L - ), - 9781000545852
151 pages
Valve T Parker TH 1000 27FM
No ratings yet
Valve T Parker TH 1000 27FM
3 pages
Schedule of Examination - Second Semester AY 2024 2025
No ratings yet
Schedule of Examination - Second Semester AY 2024 2025
8 pages
Medical Devices Report 2020 Revmar2021
No ratings yet
Medical Devices Report 2020 Revmar2021
15 pages
Questionnaire Employee Name: Designation: Academic Qualification: Experience
No ratings yet
Questionnaire Employee Name: Designation: Academic Qualification: Experience
4 pages
USACE 87 Wetland Delineation Manual PDF
No ratings yet
USACE 87 Wetland Delineation Manual PDF
143 pages
Benguet EP 2017
No ratings yet
Benguet EP 2017
158 pages
Competency Mapping: Asst Professor, Amity University Noida Asst Professor, Amity University Noida
No ratings yet
Competency Mapping: Asst Professor, Amity University Noida Asst Professor, Amity University Noida
3 pages
NEstle PDF
No ratings yet
NEstle PDF
13 pages
IoT Module 4 Associated IoT Technologies
No ratings yet
IoT Module 4 Associated IoT Technologies
56 pages
Irits 0618 058 0221 Reciprocating Compressors
No ratings yet
Irits 0618 058 0221 Reciprocating Compressors
12 pages

Bda Lab Manual

Uploaded by

Bda Lab Manual

Uploaded by

D

Course Name & code :

Semester & branch :

Name of the faculty :

HDFS building Blocks:

• HDFS DataNode and NameNode Image:

Week 1 Exercise: Hadoop Distributed File System

3. PROGRAM LOGIC: MapReduce (Wordcount): It consists of 4 phases (i.e. Partition or splitting,

Step-1. Write a Mapper

Step-2. Write a Reducer

for line in sys.stdin:

hdoop@hadoop-client:~$ cat input.txt |python3 mapper.py|sort|python3 reducer.py

TO run word count program on Hadoop framework use following command:

hdoop@hadoop-client:~$ hdfs dfs -get /bda/output/part-00000 /home/hdoop

Week 2: Exercise: MapReduce with Python

Week 3: Exercise: MapReduce with Python

PIG TOOL: Pig Latin

3. PROGRAM LOGIC: PIG (Wordcount, Identify Most Popular Movie)

Command : pig –x mapreduce

Executing pig in “Batch Mode”:

Execution using Grunt Mode:

Executing a Pig Script from HDFS

Procedure to execute Pig Progrsm:

Step2: Transfer to HDFS

Code to be typed in pigscript.pig

Output: Found 2 items

Step 6: View the output file

Sample Program with Execution: (Input and Output)

2) Transfer to HDFS: hdfs dfs -put /home/hdoop/input.txt bda1/

4) hdfs dfs -ls bda1/input

HadoopVersion PigVersion UserId StartedAt FinishedAt Features

Job Stats (time in seconds):

8) hdoop@hadoop-master:~$ hdfs dfs -cat /bda1/out/part-m-00000

Week 4 A): Pig Execution

Week 7 : Spark Execution and Scala

• Load the data into a PySpark DataFrame.

Hint: Use method 1: na.replace and method 2: using regexp_replace

1000210 Platinum Card 3/17/2018 Fast Food Debit 23.34

1000210 Silver Card 3/19/2018 Restaurants Debit 36.48

1000210 Checking 3/19/2018 Utilities Debit 35

1000210 Platinum Card 3/20/2018 Shopping Debit 14.97

1000210 Silver Card 3/22/2018 Gas & Fuel Debit 30.55

1000210 Platinum Card 3/23/2018 Credit Card Payment Debit 559.91

1000210 Checking 3/23/2018 Credit Card Payment Debit 559.91

user1003 [email protected] {([email protected]),([email protected])}

Customers Life Time Value

• If Life Time Value is >1000 and <=2000 ->Silver Program

Week 10 B: Create a data file for below schemas:

Week 12 : Data Processing and analysis using Apache Hive tool

You might also like