0% found this document useful (0 votes)

14 views32 pages

Bda Lab Record

The document is a certificate template for students who have completed experiments in a Big Data and Hadoop lab during the academic year 2023-24. It includes a detailed index of experiments, course objectives, outcomes, and specific tasks related to Hadoop, MapReduce, Pig, Hive, and Spark. Additionally, it provides sample code for various MapReduce programs and outlines the structure of the lab course offered by KKR & KSR Institute of Technology and Sciences.

Uploaded by

22jr1a1216

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views32 pages

Bda Lab Record

Uploaded by

22jr1a1216

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

CERTIFICATE

This is to Certify that ……………………………………….. bearing the Regd No ……………........

is a student of……………….B.Tech……………………………………..semester

has completed …………………. Experiments in …………………………………………….

Laboratory during the academic year 2023-24.

Signature of Laboratory Incharge Signature of Head of the Department

Signature of External
Examiner

1|P a ge
INDEX

Page
SNO Date Name of the Experiment Marks Remarks
No
Big Data & Hadoop lab Experiments

1 a) Understanding and using basic HDFS

commands
b) Run a basic word count Map Reduce
program to understand Map Reduce
Paradigm.
2 Write a Map Reduce program that mines
weather data

3 Implement matrix multiplication with Hadoop

Map Reduce.

4 Working with files in Hadoop file system:

Reading, Writing and Copying

5 Write Pig Latin scripts sort, group, join,

project, and filter your data.
6 Run the Pig Latin Scripts to find Word Count
and max. temp for each and every year.

7 Writing User Defined Functions/Eval functions

for filtering unwanted data in Pig

8 Working with Hive QL, Use Hive to create,

alter, and drop databases, tables, views,
functions, and indexes.
9 Writing User Defined Functions in Hive

10 Understanding the processing of large dataset

on Spark framework.

11 Ingesting structured and unstructured data

using Sqoop, Flume

2|P a ge
KKR & KSR INSTITUTE OF TECHNOLOGY AND
SCIENCES
(Autonomous)
Accredited by NBA & NAAC with Grade “A” and Affiliated to JNTUK-Kakinada Vinjanampadu,
Vatticherukuru Mandal, Guntur, Andhra Pradesh522017

DEPARTMENT OF CSE - DATA SCIENCE

SEMESTER - VI
Course Code Course Name L T P C
20CD6L01 BIG DATA AND HADOOP LAB 0 0 3 1.5
Course Objectives:
● Provide the knowledge to setup a Hadoop Cluster.
● Impart knowledge to develop programs using MapReduce.
● Discuss Pig, PigLatin and HiveQL to process bigdata.
● Present latest big data frameworks and applications using Spark
● Integrate Hadoop with R (RHadoop) to process and visualize.
Course Outcomes:
CO-1 : Understand Hadoop working environment.
CO-2: Apply Map Reduce programs for real world problems.
CO-3 : Implement scripts using Pig to solve real world problems.
CO-4 : Analyze queries using Hive to analyze the datasets
CO-5 : Understand spark working environment and integration with RExperiments:
TASK 1: a) Understanding and using basic HDFS commands
b)Run a basic word count Map Reduce program to understand Map Reduce Paradigm.
TASK 2: Write a Map Reduce program that mines weather data
TASK 3: Implement matrix multiplication with Hadoop Map Reduce.
TASK 4: Working with files in Hadoop file system: Reading, Writing and Copying
TASK-5: Write Pig Latin scripts sort, group, join, project, and filter your data.
TASK 6: Run the Pig Latin Scripts to find Word Count and max. temp for each and every year.
TASK-7: Writing User Defined Functions/Eval functions for filtering unwanted data in Pig TASK-8:
Working with Hive QL, Use Hive to create, alter, and drop databases, tables, views,
functions, and indexes
TASK 9: Writing User Defined Functions in Hive
TASK 10: Understanding the processing of large dataset on Spark framework.
TASK 11: Ingesting structured and unstructured data using Sqoop, Flume
Text Books:
1. Tom White, “Hadoop: The Deﬁnitive Guide”, 4th Edition, O’Reilly Inc,2015.
2. Tanmay Deshpande, “Hadoop Real-World Solutions Cookbook”, 2ndEdition, Packt
Publishing, 2016
Reference Books:
Edward Capriolo, Dean Wampler, and Jason Rutherglen, “Programming Hive”, O’Reilly
3|P a ge
LAB EXPERIMENTS
Experment-1
1a)
Aim: Understanding and working on the HDFS commands.

Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It
provides high-throughput access to application data and is suitable for applications with large data sets. Below are
some basic HDFS commands along with examples:

1. **Listing Files/Directories**:

```bash
hdfs dfs -ls <path>
```

Example: List files and directories in the root directory.

```bash
hdfs dfs -ls /
```

2. **Creating a Directory**:

```bash
hdfs dfs -mkdir <directory>
```

Example: Create a directory named "data" in the root directory.

```bash
hdfs dfs -mkdir /data
```

3. Copying Files to HDFS:

```bash
hdfs dfs -put <source> <destination>
```

Example: Copy a file named "file.txt" from the local file system to the "data" directory in HDFS.

```bash
hdfs dfs -put file.txt /data
```

4. Copying Files from HDFS to Local File System:

4|P a ge
```bash
hdfs dfs -get <source> <destination>
```

Example: Copy a file named "file.txt" from the "data" directory in HDFS to the local file system.

```bash
hdfs dfs -get /data/file.txt /local/path
```

5. **Removing Files/Directories**:

```bash
hdfs dfs -rm <path>
```

Example: Remove a file named "file.txt" from the "data" directory in HDFS.

```bash
hdfs dfs -rm /data/file.txt
```

6. Removing Directories Recursively:

```bash
hdfs dfs -rm -r <directory>
```

Example: Remove the "data" directory and all its contents recursively from HDFS.

```bash
hdfs dfs -rm -r /data
```
7. **Moving Files/Directories within HDFS**:

```bash
hdfs dfs -mv <source> <destination>
```

Example: Move a file named "file.txt" from the "data" directory to another directory within HDFS.

```bash
hdfs dfs -mv /data/file.txt /newlocation/
```

5|P a ge
1b)
Aim: Run a basic word count Map Reduce program to understand Map Reduce Paradigm.

Input
Set of data

Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS, caR, CAR, car, BUS, TRAIN
Output
Convert into another set of data (Key,Value)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1), (car,1), (BUS,1), (TRAIN,1)
Reduce Function – Takes the output from Map as an input and combines those data tuples into a smaller set
of tuples.
Example – (Reduce function in Word Count)
Input Set of Tuples (output of Map function)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1), (TRAIN,1),(BUS,1),
(buS,1),(caR,1),(CAR,1), (car,1), (BUS,1), (TRAIN,1)
Output Converts into smaller set of tuples
(BUS,7), (CAR,7), (TRAIN,4)

Mapper code:

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;

6|P a ge
import org.apache.hadoop.mapred.Reporter;

public class WordMapper extends MapReduceBase implements

Mapper<LongWritable,Text,Text,IntWritable>{

public void map(LongWritable key,Text value,

OutputCollector<Text,IntWritable>output,Reporter r)
throws IOException{
String s=value.toString();
for(String word:s.split(" "))
{
if(word.length()>0)
{
output.collect(new Text(word),new IntWritable(1));
}
}
}

Reducer code:

public class WordMapper extends MapReduceBase implements

Mapper<LongWritable,Text,Text,IntWritable>{

public void map(LongWritable key,Text value,

Driver Code:

import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;

7|P a ge
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class WordCount extends Configured implements Tool{

public int run(String[] args) throws Exception {

if (args.length<2){
System.out.println("plz give i/p and o/p properly");
return -1;
}
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("WordCount");
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));

conf.setMapperClass(WordMapper.class);
conf.setReducerClass(WordReducer.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);

return 0;
}

public static void main(String args[]) throws Exception

{
int exitcode = ToolRunner.run(new WordCount(),args);
System.exit(exitcode);
}
}

I/P:
HDFS is a storage unit of Hadoop
MapReduce is a processing tool of Hadoop
O/P:
HDFS 1
Hadoop 2
MapReduce 1
a 2
is 2
of 2
processing 1
storage 1
tool 1
unit 1

8|P a ge
2)
Aim: Write a Map Reduce program that mines weather data.

Here, we will write a Map-Reduce program for analyzing weather datasets to understand its data
processing programming model. Weather sensors are collecting weather information across the globe in a
large volume of log data. This weather data is semi-structured and record-oriented.
This data is stored in a line-oriented ASCII format, where each row represents a single record. Each row has
lots of ﬁelds like longitude, latitude, daily max-min temperature, daily average temperature, etc. for easiness,
we will focus on the main element, i.e. temperature. We will use the data from the National Centres for
Environmental Information(NCEI). It has a massive amount of historical weather data that we can use for our
data analysis.

Prg:

Driver code:

import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class HighestDriver extends Configured implements Tool{

public static void main(String[] args) throws Exception

int exitcode = ToolRunner.run(new HighestDriver(),args);

System.exit(exitcode);

public int run(String args[]) throws Exception{

JobConf conf = new JobConf(getConf(), HighestDriver.class);
conf.setJobName("HighestDriver");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(HighestMapper.class);
conf.setReducerClass(HighestReducer.class);
Path inp = new Path(args[1]);
Path out = new Path(args[2]);

FileInputFormat.addInputPath(conf, inp);
FileOutputFormat.setOutputPath(conf, out);

JobClient.runJob(conf);

return 0;
}

9|P a ge
}

Mapper code

import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class HighestMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text,
IntWritable>
{
public static final int MISSING = 9999;
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException
{
String line = value.toString();
String year = line.substring(15,19);

int temperature;
if (line.charAt(87)=='+')
temperature = Integer.parseInt(line.substring(88, 92));
else
temperature = Integer.parseInt(line.substring(87, 92));
String quality = line.substring(92, 93);
if(temperature != MISSING && quality.matches("[01459]"))
output.collect(new Text(year),new IntWritable(temperature));

Reducer code:

import java.io.IOException
import java.util.Iterator;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class HighestReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text,

IntWritable>

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException
{
int max_temp = 0;
while (values.hasNext())
{
int current=values.next().get();
if ( max_temp < current)
max_temp = current;

output.collect(key, new IntWritable(max_temp/10));

10 | P a g e
I/P:

0057
332130 # USAF weather station identifier
99999 # WBAN weather station identifier
19500101 # observation date
0300 # observation time
4
+51317 # latitude (degrees x 1000)
+028783 # longitude (degrees x 1000)
FM-12
+0171 # elevation (meters)
99999
V020
320 # wind direction (degrees)
1 # quality code
N
0072
1
00450 # sky ceiling height (meters)
1 # quality code
CN
010000 # visibility distance (meters)
1 # quality code
N9
-0128 # air temperature (degrees Celsius x 10)
1 # quality code
-0139 # dew point temperature (degrees Celsius x 10)
1 # quality code
10268 # atmospheric pressure (hectopascals x 10)
1 # quality code

O/P:
1901 317
1902 244
1903 289
1904 256
1905 283

11 | P a g e
3)
Aim: Implement matrix multiplication with Hadoop Map Reduce.

Algorithm:
MapReduce is a technique in which a huge program is subdivided into small tasks and run parallelly to make
computation faster, save time, and mostly used in distributed systems. It has 2 important parts:
● Mapper: It takes raw data input and organizes into key, value pairs. For example, In a dictionary, you search
for the word “Data” and its associated meaning is “facts and statistics collected together for reference or
analysis”. Here the Key is Data and the Value associated with is facts and statistics collected together for
reference or analysis.
● Reducer: It is responsible for processing data in parallel and produce ﬁnal output.
Let us consider the matrix multiplication example to visualize MapReduce. Consider the following matrix:

Here matrix A is a 2×2 matrix which means the number of rows(i)=2 and the number of
columns(j)=2. Matrix B is also a 2×2 matrix where number of rows(j)=2 and number of
columns(k)=2. Each cell of the matrix is labelled as Aij and Bij. Ex. element 3 in matrix A is called
A21 i.e. 2nd-row 1st column. Now One step matrix multiplication has 1 mapper and 1 reducer.
The Formula is:
Mapper for Matrix A (k, v)=((i, k), (A, j, Aij))
for all k Mapper for Matrix B (k, v)=((i, k),
(B, j, Bjk)) for all iTherefore computing the
mapper for Matrix A:
# k, i, j computes the number of times
it occurs. # Here all are 2, therefore
when k=1, i can have # 2 values 1 & 2,
each case can have 2 further
# values of j=1 and j=2. Substituting
all values# in formula

k=1 i=1 j=1 ((1, 1), (A, 1, 1))

k=1 i=1 j=1 ((1, 1), (A, 1, 1))
j=2 ((1, 1), (A, 2, 2))
i=2 j=1 ((2, 1), (A, 1, 3))

12 | P a g e
j=2 ((2, 1), (A, 2, 4))

k=2 i=1 j=1 ((1, 2), (A, 1, 1))

j=2 ((1, 2), (A, 2, 2))
i=2 j=1 ((2, 2), (A, 1, 3))
j=2 ((2, 2), (A, 2, 4))
Computing the mapper for Matrix B i=1 j=1 k=1 ((1, 1), (B, 1, 5))
k=2 ((1, 2), (B, 1, 6))
j=2 k=1 ((1, 1), (B, 2, 7))
k=2 ((1, 2), (B, 2, 8))

i=2 j=1 k=1 ((2, 1), (B, 1, 5))

k=2 ((2, 2), (B, 1, 6))
j=2 k=1 ((2, 1), (B, 2, 7))
k=2 ((2, 2), (B, 2, 8))
The formula for Reducer is:
Reducer(k, v)=(i, k)=>Make sorted Alist and Blist (i, k) => Summation (Aij * Bjk)) for j
Output =>((i, k), sum)
Therefore computing the reducer:
# We can observe from Mapper computation # that 4 pairs are common (1, 1), (1, 2),
# (2, 1) and (2, 2)
# Make a list separate for Matrix A & # B with adjoining values taken from # Mapper step above:

(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}

Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(1*5) + (2*7)] =19 (i)

(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)}

Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(1*6) + (2*8)] =22 (ii)

(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}

Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(3*5) + (4*7)] =43 (iii)
(2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)}
Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(3*6) + (4*8)] =50 (iv)

From (i), (ii), (iii) and (iv) we conclude that ((1, 1), 19)
((1, 2), 22)
((2, 1), 43)
((2, 2), 50)
Therefore the Final Matrix is:

13 | P a g e
Driver Code:

import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
//import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
public class MatDriver {

public static void main(String[] args)throws Exception {

//JobConf conf=new JobConf();
Job job=new Job();
try {
//job = new Job(conf,"Matrix Multiplication");
job.setJarByClass(MatDriver.class);
job.setMapperClass(MatMapper.class);
job.setReducerClass(MatReduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
try {
System.exit(job.waitForCompletion(true)?0:-1);
} catch (ClassNotFoundException | IOException | InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

14 | P a g e
Mapper Code:

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
//import org.apache.hadoop.mapreduce.Mapper.Context;

public class MatMapper extends

Mapper<LongWritable, Text, Text, Text>
{
long lMax=5;
long iMax=5;
@Override
protected void map
(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{
// input format is ["a", 0, 0, 63]
String[] csv = value.toString().split(",");
String matrix = csv[0].trim();
int row = Integer.parseInt(csv[1].trim());
int col = Integer.parseInt(csv[2].trim());
if(matrix.contains("a"))
{
for (int i=0; i < lMax; i++)
{
String akey = Integer.toString(row) + "," +
Integer.toString(i);
context.write(new Text(akey), value);
}
}
if(matrix.contains("b"))
{
for (int i=0; i < iMax; i++)
{
String akey = Integer.toString(i) + "," +
Integer.toString(col);
context.write(new Text(akey), value);
}
}
}
}

Reducer code:

import java.io.IOException;
//import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class MatReduce extends Reducer<Text, Text, Text, LongWritable> {

@Override
protected void reduce

15 | P a g e
(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int[] a = new int[5];
int[] b = new int[5];
// b, 2, 0, 30
for (Text value : values) {
System.out.println(value);
String cell[] = value.toString().split(",");
if (cell[0].contains("a")) // take rows here
{
int col = Integer.parseInt(cell[2].trim());
a[col] = Integer.parseInt(cell[3].trim());
}
else if (cell[0].contains("b")) // take col here
{
int row = Integer.parseInt(cell[1].trim());
b[row] = Integer.parseInt(cell[3].trim());
}
}
int total = 0;
for (int i = 0; i < 5; i++) {
int val = a[i] * b[i];
total += val;
}
context.write(key, new LongWritable(total));
}
}

I/P:
Matrix - A
0,0,1
0,1,2
1,0,3
1,1,4
Matrix – B
0,0,1
0,1,2
1,0,3
1,1,4

O/P:
0,0,7
0,1,10
1,0,15
1,1,22

16 | P a g e
4) Aim: Working with files in Hadoop file system: Reading, Writing and Copying.

Algorithm:

File Read in HDFS:

Step 1: The client opens the file it wishes to read by calling open() on the File System Object(which for HDFS is an
instance of Distributed File System).
Step 2: Distributed File System( DFS) calls the name node, using remote procedure calls (RPCs), to determine the
locations of the first few blocks in the file. For each block, the name node returns the addresses of the data nodes that
have a copy of that block. The DFS returns an FSDataInputStream to the client for it to read data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages the data node and name node I/O.
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the info node addresses for the
primary few blocks within the file, then connects to the primary (closest) data node for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read() repeatedly on the stream. Step 5:
When the end of the block is reached, DFSInputStream will close the connection to the data node, then finds the best
data node for the next block. This happens transparently to the client, which from its point of view is simply reading
an endless stream. Blocks are read as, with the DFSInputStream opening new connections to data nodes because the
client reads through the stream. It will also call the name node to retrieve the data node locations for the next batch of
blocks as needed.
Step 6: When the client has finished reading the file, a function is called, close() on the FSDataInputStream

File Write in HDFS:

17 | P a g e
Step 1: The client creates the file by calling create() on DistributedFileSystem(DFS).
Step 2: DFS makes an RPC call to the name node to create a new file in the file system’s namespace, with no blocks
associated with it. The name node performs various checks to make sure the file doesn’t already exist and that the
client has the right permissions to create the file. If these checks pass, the name node prepares a record of the new
file; otherwise, the file can’t be created and therefore the client is thrown an error i.e. IOException. The DFS returns
an FSDataOutputStream for the client to start out writing data to. Step 3: Because the client writes data, the
DFSOutputStream splits it into packets, which it writes to an indoor queue called the info queue. The data queue is
consumed by the DataStreamer, which is liable for asking the name node to allocate new blocks by picking an
inventory of suitable data nodes to store the replicas. The list of data nodes forms a pipeline, and here we’ll assume
the replication level is three, so there are three nodes in the pipeline. The DataStreamer streams the packets to the
primary data node within the pipeline, which stores each packet and forwards it to the second data node within the
pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to the third (and last) data node in the
pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting to be acknowledged by data
nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline and waits for acknowledgments
before connecting to the name node to signal whether the file is complete or not.
HDFS follows Write Once Read Many models. So, we can’t edit files that are already stored in HDFS, but we can
include them by again reopening the file. This design allows HDFS to scale to a large number of concurrent clients
because the data traffic is spread across all the data nodes in the cluster. Thus, it increases the availability, scalability,
and throughput of the system.

Reading:
Use the -cat command to display the content of the file. The syntax for the same is:

Writing:

 echo "hello world" > hello_world.txt

 $HADOOP_HOME/bin/hadoop fs -put /home/hello_world.txt

Copying:
 Create a file using “echo” command

echo "hello world" > sample1.txt

 Now move this file to another directory “/user/hadoop1”

$ HADOOP_HOM/bin/hadoop fs -cp /user/data/sample1.txt /user/hadoop1

18 | P a g e
Pig Installation

STEP:1
Download the pig from https://dlcdn.apache.org/pig/pig-0.16.0/ and download the following file pig-0.16.0.tar.gz

STEP:2
Copy the above file into the Hadoop file system
STEP:3
Now extract the copied file using the following command tar -xzf pig-0.16.0.tar.gz

STEP:4
Edit the .bashrc file to update the environment variables of Apache Pig. Add the following variables at the end of
the .bashrc as shown in the image below

# Set PIG_HOME

export PIG_HOME=/home/student/Installations/hadoop-1.2.1/pig-0.16.0
export PATH=$PATH:/home/student/Installations/hadoop-1.2.1/pig-0.16.0/bin
export PIG_CLASSPATH=$HADOOP_CONF_DIR

STEP:5
TO reflect the changes run the following command  source /home/student/.bashrc

STEP:6
To know everything is working check the version of pig -> pig -version
STEP:7
To open grunt shell run the command -> pig -x local

19 | P a g e
TASK-5: Write Pig Latin scripts sort, group, join, project, and filter your data.

Pig scripts are nothing but the files with the extension of .pig which includes pig commands
Instead of running each command individually we write a script and execute in batch mode.
Here is the dataset that we are going to work on

->loaded as data in pig script

I have this dataset in /home/student/Installations/hadoop-1.2.1
1.Sort: TO sort we are using orderby which is similar to sql. The syntax is
alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] }
2.Group: Groups the data in one or more relations.
Syntax: alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [USING 'collected' | 'merge']
[PARTITION BY partitioner]
3.join : Use the JOIN operator to perform an inner, equijoin join of two or more relations based on common field
values

4.Filter: FILTER is commonly used to select the data that you want; or, conversely, to filter out (remove) the data
you don’t want.

5.Project : Project means selecting the columns that we want

-> marks dataset loaded as data1 in pig script

Here is the pig script

We are running this pig script in the local so open terminal in the file location ie pig script
And type the command pig -x local filename.pig

20 | P a g e
Output’s snip section:
1.SORT
Output for order here we can see the classmates are randomly ordered in the dataset we are ordering it by
regno

2.Group:
Group according to their hobbies

3. Filter:
Filtering whose hobbies == kabaddi

4.Project:
Projection only name and hobbies column

5.Joins:
Innerjoin:
Displaying the students who have entry in both hobbies and marks tables by joining them

21 | P a g e
Leftjoin:

Where the left table is being joined with the right table if no entry found in right table it has no corresponding
entry

22 | P a g e
TASK 6: Run the Pig Latin Scripts to find Word Count and max. temp for each and every year.

DATASET

Pig script

<- word countoutput <- max temp

Max example

23 | P a g e
Task -7: Writing User Defined Functions/Eval functions for filtering unwanted data in Pig.

To write a UDF using Java, we have to integrate the jar file Pig-0.15.0.jar. In this section, we discuss how to write a
sample UDF using Eclipse. Before proceeding further, make sure you have installed Eclipse and Maven in your
system.
 Create a new class file with name Sample_Eval and copy the following content in it.

UDF IN JAVA:
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class Sample_Eval extends EvalFunc<String>{

public String exec(Tuple input) throws IOException {

if (input == null || input.size() == 0)
return null;
String str = (String)input.get(0);
return str.toUpperCase();
}
}
After writing the UDF and generating the Jar file, follow the steps given below −
Step 1: Registering the Jar file
After writing UDF (in Java) we have to register the Jar file that contain the UDF using the Register operator. By
registering the Jar file, users can intimate the location of the UDF to Apache Pig.
Syntax
Given below is the syntax of the Register operator.
 REGISTER path;

Start Apache Pig in local mode and register the jar file sample_udf.jar as shown below.
$cd PIG_HOME/bin
$./pig –x local
 REGISTER '/$PIG_HOME/sample_udf.jar'
Note − assume the Jar file in the path − /$PIG_HOME/sample_udf.jar
Step 2: Defining Alias
After registering the UDF we can define an alias to it using the Define operator.
 DEFINE sample_eval sample_eval();
Step 3: Using the UDF
After defining the alias you can use the UDF same as the built-in functions. Suppose there is a file named emp_data
in the HDFS /Pig_Data/ directory with the following content.

24 | P a g e
And assume we have loaded this file into Pig as shown below.
grunt>emp_data = LOAD 'hdfs://localhost:9000/pig_data/emp1.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, city:chararray);
Let us now convert the names of the employees in to upper case using the UDF sample_eval.
grunt> Upper_case = FOREACH emp_data GENERATE sample_eval(name);
Verify the contents of the relation Upper_case as shown below.

TASK-8: Working with Hive QL, Use Hive to create, alter, and drop databases, tables, views, functions, and
indexes.
Prg:
Creating, loading data into a table:

Altering a table:
 ALTER TABLE my_table ADD COLUMNS (age INT);
o/p: OK
Drop a table:
 DROP TABLE my_table;
o/p: OK

25 | P a g e
Creating a view:

Drop a view:
 DROP VIEW my_view;
o/p: OK
Alter a view:

Create a function:
 CREATE FUNCTION my_function AS 'org.example.MyFunction' USING JAR 'my_functions.jar';
o/p: OK
Drop a function:
 DROP FUNCTION my_function;
o/p: OK

Create an index:

Drop index:
 DROP INDEX my_index ON my_table;
o/p: OK

26 | P a g e
Task 9: Writing User Defined Functions in Hive.

Prg:

1. Create a Java class for the User Defined Function which extends
org.apache.hadoop.hive.sq.exec.UDF and implements more than one evaluate() methods. Put in
your desired logic and you are almost there.
2. Package your Java class into a JAR file (I am using Maven)
3. Go to Hive CLI, add your JAR, and verify your JARs is in the Hive CLI class path.
4. CREATE TEMPORARY FUNCTION in Hive which points to your Java class
5. Use it in Hive SQL.

package org.hardik.letsdobigdata;

import org.apache.commons.lang.StringUtils;

import org.apache.hadoop.hive.ql.exec.UDF;

import org.apache.hadoop.io.Text;

public class Strip extends UDF

private Text result = new Text();

public Text evaluate(Text str, String stripChars)

if(str == null)

return null;

result.set(StringUtils.strip(str.toString(), stripChars));

return result;

public Text evaluate(Text str)

if(str == null)

return null;

result.set(StringUtils.strip(str.toString()));

return result;

27 | P a g e
}

 evaluate(Text str, String stripChars) - will trim specified characters in stripChars from
first argument str.
 evaluate(Text str) - will trim leading and trailing spaces.

Package Your Java Class into a JAR

 Please make sure you have Maven installed.
$ cd HiveUDFs

 run "mvn clean package". This will create a JAR file which contains our UDF class. Copy the
JAR's path.

Next, add the Hive Jar files to the project class path.
 Edit the .bashrc file to update the environment variables of Apache Pig. Add the following
variables at the end of the .bashrc as shown below.

export HIVE_HOME=/usr/local/hive
export HADOOP_HOME=/usr/local/hadoop
export CLASSPATH=$CLASSPATH:$HIVE_HOME/lib/*:$HADOOP_HOME/*:$HADOOP_HOME/lib/*

 Package the UDF into a JAR file by using below command

jar cf StringLengthUDF.jar StringLengthUDF.class

 Register the UDF in Hive:
$ hive
ADD JAR ~/my_udf/ Strip.jar;

 The first query strips ‘ha’ from string ‘hadoop’ as expected (2 argument evaluate() in
code). The second query strips trailing and leading spaces as expected.

28 | P a g e
TASK 10: Understanding the processing of large dataset on Spark framework.
Prg:

 importing the pyspark.sql module and creating a local SparkSession:

from pyspark.sql import SparkSession

sc = SparkSession.builder.master("local").appName("Test").getOrCreate()

 Read the data of csv file to SparkSession using sc.read.

raw_data = sc.read.options(delimiter="\t",header=True).csv("en.openfoodfacts.org.products.csv")

 The result are stored in pyspark.sql.dataframe variable, Now let us look into our data scheme and
no.of record in it .

 Let us compute the number of products per country to get an idea about the database composition :
from pyspark.sql.functions import col
BDD_countries = raw_data.groupBy("countries_tags").count().persist()

 BDD_countries is also a pyspark data frame and has the following structure :

 we can filter this new data frame to keep only the countries that have at least 5000 products recorded
in the database and plot the result:

29 | P a g e
TASK 11: Ingesting structured and unstructured data using Sqoop, Flume
Prg:
Ingesting structured data using Sqoop
Sqoop is a tool that enables you to import data from structured data stores like relational databases into Hadoop.
Here are the basic steps to ingest data using Sqoop:
a. Install Sqoop: Install Sqoop on the machine that will run the Sqoop command.
b. Connect to the source database: Use the `sqoop import` command to connect to the source database, specify the
table to import, and provide the necessary credentials.
c. Specify the target Hadoop system: Specify the target Hadoop system to where you want to import the data
d. Define the target Hadoop destination: Define the Hadoop destination for the imported data, such as a Hive table.
Here is an example command to import data from a MySQL database into a Hive table:

As you can see in the below image, after executing this command Map tasks will be executed at the back end.

After the code is executed, you can check the Web UI of HDFS i.e. localhost:50070 where the data is imported.

30 | P a g e
Ingesting unstructured data using Flume

Flume is a tool that enables you to ingest unstructured data such as log files and event streams into Hadoop. Here
are the basic steps to ingest data using Flume:

a. Install Flume: Install Flume on the machine that will run the Flume command.
b. Configure Flume: Configure Flume to read data from the source and write it to the target Hadoop system. This
involves specifying the source type, the source location, and the target Hadoop destination.
c. Start the Flume agent: Start the Flume agent to begin the data ingestion process.
Here is an example configuration file for Flume that reads data from a log file and writes it to a Hadoop
destination:

agent.sources = logsource
agent.channels = memoryChannel
agent.sinks = hdfssink

31 | P a g e
agent.sources.logsource.type = exec
agent.sources.logsource.command = tail -F /var/log/messages
agent.sources.logsource.channels = memoryChannel

agent.sinks.hdfssink.type = hdfs
agent.sinks.hdfssink.hdfs.path = hdfs://hadoop.example.com/logs/
agent.sinks.hdfssink.channel = memoryChannel

agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 1000

agent.sources.logsource.interceptors = timestampInterceptor
agent.sources.logsource.interceptors.timestampInterceptor.type = timestamp

This configuration reads data from the `/var/log/messages` file using the `tail -F` command and writes it to the
`/logs` directory in HDFS. It also adds a timestamp to each event using the `timestampInterceptor` interceptor.

32 | P a g e

Bda Lab
No ratings yet
Bda Lab
94 pages
20dce017 Bda Pracfil
No ratings yet
20dce017 Bda Pracfil
41 pages
BDA Practical
No ratings yet
BDA Practical
18 pages
CSF443 Lab-Report Nimish Shandilya 1000016934
No ratings yet
CSF443 Lab-Report Nimish Shandilya 1000016934
17 pages
PDC All Labs
100% (1)
PDC All Labs
129 pages
Configure Hadoop Cluster in Pseudo Distributed Mode. Try Hadoop Basic Commands
No ratings yet
Configure Hadoop Cluster in Pseudo Distributed Mode. Try Hadoop Basic Commands
88 pages
BDH Record - Merged
No ratings yet
BDH Record - Merged
47 pages
Lab File Format
No ratings yet
Lab File Format
60 pages
Cloud PDF
No ratings yet
Cloud PDF
47 pages
C21053 Jay Vijay Karwatkar-Big Data Analytics & Visualization
No ratings yet
C21053 Jay Vijay Karwatkar-Big Data Analytics & Visualization
210 pages
Big Data & Analytics Lab Manual
No ratings yet
Big Data & Analytics Lab Manual
51 pages
Manual 5
No ratings yet
Manual 5
51 pages
Bda Record 18071a0597-1
No ratings yet
Bda Record 18071a0597-1
28 pages
BDA Lab Manual - Organized
No ratings yet
BDA Lab Manual - Organized
69 pages
BDT Lab Manual
No ratings yet
BDT Lab Manual
48 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
59 pages
Big Data Analytics IT
No ratings yet
Big Data Analytics IT
55 pages
Data Science
No ratings yet
Data Science
82 pages
Bda Lab S
No ratings yet
Bda Lab S
92 pages
Big Data Manual
No ratings yet
Big Data Manual
82 pages
BDA Manual SHUBHAM
No ratings yet
BDA Manual SHUBHAM
22 pages
BDA - Manual - 1to6 Ayushi
No ratings yet
BDA - Manual - 1to6 Ayushi
22 pages
Bigdata Lab
No ratings yet
Bigdata Lab
55 pages
BDA Record
No ratings yet
BDA Record
34 pages
CS-702 (D) BigData
No ratings yet
CS-702 (D) BigData
61 pages
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
No ratings yet
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
28 pages
Notes
No ratings yet
Notes
53 pages
BDA Lab Manual 200305105108
No ratings yet
BDA Lab Manual 200305105108
44 pages
BIG Data Master
No ratings yet
BIG Data Master
24 pages
084 Liza Bda File
No ratings yet
084 Liza Bda File
23 pages
@bigdatalabfile 09
No ratings yet
@bigdatalabfile 09
35 pages
Big Data Lab Manual Printout
No ratings yet
Big Data Lab Manual Printout
51 pages
CCS334 Bda Lab Manual
No ratings yet
CCS334 Bda Lab Manual
48 pages
Bda Lab Manual 2024
No ratings yet
Bda Lab Manual 2024
45 pages
BDA Practicalfile
No ratings yet
BDA Practicalfile
19 pages
Bda Lab
No ratings yet
Bda Lab
36 pages
BDA Journal
No ratings yet
BDA Journal
52 pages
Big Data Analytics - Sem 7 CVMU
No ratings yet
Big Data Analytics - Sem 7 CVMU
4 pages
Big Data and Hadoop: by - Ujjwal Kumar Gupta
No ratings yet
Big Data and Hadoop: by - Ujjwal Kumar Gupta
57 pages
Big Data Record 2024-25
No ratings yet
Big Data Record 2024-25
46 pages
Bad601 Lab Maual
No ratings yet
Bad601 Lab Maual
34 pages
BDA Lab 8 Manual
No ratings yet
BDA Lab 8 Manual
7 pages
Bda Record
No ratings yet
Bda Record
46 pages
Bda Lab Manual - Cse 8 Sem - Compl
No ratings yet
Bda Lab Manual - Cse 8 Sem - Compl
57 pages
Bda Index
No ratings yet
Bda Index
3 pages
Bda Manual Index Ayushi
No ratings yet
Bda Manual Index Ayushi
2 pages
BIG Data File
No ratings yet
BIG Data File
28 pages
CCS334 Set3
No ratings yet
CCS334 Set3
2 pages
Course: Big Data Analytics Lab Scheme: 2017
No ratings yet
Course: Big Data Analytics Lab Scheme: 2017
25 pages
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
No ratings yet
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
30 pages
Extreme Computing Lab Exercises Session One: 1 Getting Started
No ratings yet
Extreme Computing Lab Exercises Session One: 1 Getting Started
6 pages
Data Analytics Lab Manual
No ratings yet
Data Analytics Lab Manual
43 pages
Practical-1: Aim: Hadoop Configuration and Single Node Cluster Setup and Perform File Management Task in
No ratings yet
Practical-1: Aim: Hadoop Configuration and Single Node Cluster Setup and Perform File Management Task in
61 pages
Latency Tweaks Roblox 2
100% (1)
Latency Tweaks Roblox 2
3 pages
Big Data Masters Program
No ratings yet
Big Data Masters Program
13 pages
Lab Manual: Department of Computer Science and Engineering
No ratings yet
Lab Manual: Department of Computer Science and Engineering
10 pages
Answers
No ratings yet
Answers
5 pages
Hadoop Course Content
No ratings yet
Hadoop Course Content
3 pages
Using Map Reduce Concept, Implement A Java Pro...
No ratings yet
Using Map Reduce Concept, Implement A Java Pro...
2 pages
CS702D BigData Labmanual
No ratings yet
CS702D BigData Labmanual
12 pages
Speeduino Manual PDF
100% (1)
Speeduino Manual PDF
107 pages
CLF-C02 Exam Guide Slides
No ratings yet
CLF-C02 Exam Guide Slides
30 pages
ANSYS Offshore Solutions What You Will Learn About
No ratings yet
ANSYS Offshore Solutions What You Will Learn About
3 pages
CS QP Xii 2020
100% (1)
CS QP Xii 2020
10 pages
CC Module 4
No ratings yet
CC Module 4
35 pages
Reverse Engineering in Cybersecurity
No ratings yet
Reverse Engineering in Cybersecurity
2 pages
ADempiere Manual
100% (1)
ADempiere Manual
30 pages
No Code Low Code
No ratings yet
No Code Low Code
15 pages
Design and Implementation of A Distributed 3D Computer Game Engine
No ratings yet
Design and Implementation of A Distributed 3D Computer Game Engine
337 pages
PID500 Modbus
No ratings yet
PID500 Modbus
41 pages
Hacking Step
No ratings yet
Hacking Step
10 pages
Special Characters PDF
No ratings yet
Special Characters PDF
2 pages
Sample User Guide
No ratings yet
Sample User Guide
181 pages
DSA All Units
No ratings yet
DSA All Units
60 pages
Basic Organization of A Computer System
No ratings yet
Basic Organization of A Computer System
6 pages
SAP Data Warehouse Cloud - DP Agent Installation V2
No ratings yet
SAP Data Warehouse Cloud - DP Agent Installation V2
16 pages
Project Report: Sentiment Analysis in Hindi Language
No ratings yet
Project Report: Sentiment Analysis in Hindi Language
27 pages
White Paper - CRM Contact Person Replication - 09feb10
No ratings yet
White Paper - CRM Contact Person Replication - 09feb10
11 pages
ONTAP 9.10.1 Performance Tech Spec
No ratings yet
ONTAP 9.10.1 Performance Tech Spec
1 page
Irony and Satire English Quiz Presentation in Cream Modern Abstract Style
No ratings yet
Irony and Satire English Quiz Presentation in Cream Modern Abstract Style
20 pages
1) WAP To Demostrate Making of Thread To Print Numbers From 1 To 10
No ratings yet
1) WAP To Demostrate Making of Thread To Print Numbers From 1 To 10
5 pages
Chapter 14
No ratings yet
Chapter 14
28 pages
My Resume 1710570055
No ratings yet
My Resume 1710570055
1 page
DSA Lab 05
No ratings yet
DSA Lab 05
5 pages
Spss Tasks
No ratings yet
Spss Tasks
11 pages
Ekos Faq 2022
No ratings yet
Ekos Faq 2022
1 page
Erp Performance As Intervening Variable To Financial Performance For Erp Implementation, Adherence To Coso, and GCG Implementation
No ratings yet
Erp Performance As Intervening Variable To Financial Performance For Erp Implementation, Adherence To Coso, and GCG Implementation
20 pages
T100taf Bing Dk046b
No ratings yet
T100taf Bing Dk046b
3 pages
ISYS2092 Software Testing Assignment 1 (15%) : Learning Objectives
No ratings yet
ISYS2092 Software Testing Assignment 1 (15%) : Learning Objectives
2 pages
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)

Bda Lab Record

Uploaded by

Bda Lab Record

Uploaded by

CERTIFICATE

This is to Certify that ……………………………………….. bearing the Regd No ……………........

has completed …………………. Experiments in …………………………………………….

Laboratory during the academic year 2023-24.

Signature of Laboratory Incharge Signature of Head of the Department

1 a) Understanding and using basic HDFS

3 Implement matrix multiplication with Hadoop

4 Working with files in Hadoop file system:

5 Write Pig Latin scripts sort, group, join,

7 Writing User Defined Functions/Eval functions

8 Working with Hive QL, Use Hive to create,

10 Understanding the processing of large dataset

11 Ingesting structured and unstructured data

DEPARTMENT OF CSE - DATA SCIENCE

Example: List files and directories in the root directory.

Example: Create a directory named "data" in the root directory.

3. **Copying Files to HDFS**:

4. **Copying Files from HDFS to Local File System**:

6. **Removing Directories Recursively**:

public class WordMapper extends MapReduceBase implements

public void map(LongWritable key,Text value,

public class WordMapper extends MapReduceBase implements

public void map(LongWritable key,Text value,

public class WordCount extends Configured implements Tool{

public int run(String[] args) throws Exception {

public static void main(String args[]) throws Exception

public class HighestDriver extends Configured implements Tool{

public static void main(String[] args) throws Exception

int exitcode = ToolRunner.run(new HighestDriver(),args);

public int run(String args[]) throws Exception{

public class HighestReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text,

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output,

output.collect(key, new IntWritable(max_temp/10));

k=1 i=1 j=1 ((1, 1), (A, 1, 1))

k=2 i=1 j=1 ((1, 2), (A, 1, 1))

i=2 j=1 k=1 ((2, 1), (B, 1, 5))

(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}

(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)}

(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}

public static void main(String[] args)throws Exception {

public class MatMapper extends

public class MatReduce extends Reducer<Text, Text, Text, LongWritable> {

File Read in HDFS:

File Write in HDFS:

 echo "hello world" > hello_world.txt

echo "hello world" > sample1.txt

$ HADOOP_HOM/bin/hadoop fs -cp /user/data/sample1.txt /user/hadoop1

->loaded as data in pig script

5.Project : Project means selecting the columns that we want

-> marks dataset loaded as data1 in pig script

Here is the pig script

<- word countoutput <- max temp

public class Sample_Eval extends EvalFunc<String>{

public String exec(Tuple input) throws IOException {

public class Strip extends UDF

private Text result = new Text();

public Text evaluate(Text str, String stripChars)

public Text evaluate(Text str)

Package Your Java Class into a JAR

 Package the UDF into a JAR file by using below command

jar cf StringLengthUDF.jar StringLengthUDF.class

 importing the pyspark.sql module and creating a local SparkSession:

from pyspark.sql import SparkSession

 Read the data of csv file to SparkSession using sc.read.

You might also like

3. Copying Files to HDFS:

4. Copying Files from HDFS to Local File System:

6. Removing Directories Recursively: