0% found this document useful (0 votes)
11 views32 pages

Bda Lab Record

The document is a certificate template for students who have completed experiments in a Big Data and Hadoop lab during the academic year 2023-24. It includes a detailed index of experiments, course objectives, outcomes, and specific tasks related to Hadoop, MapReduce, Pig, Hive, and Spark. Additionally, it provides sample code for various MapReduce programs and outlines the structure of the lab course offered by KKR & KSR Institute of Technology and Sciences.

Uploaded by

22jr1a1216
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views32 pages

Bda Lab Record

The document is a certificate template for students who have completed experiments in a Big Data and Hadoop lab during the academic year 2023-24. It includes a detailed index of experiments, course objectives, outcomes, and specific tasks related to Hadoop, MapReduce, Pig, Hive, and Spark. Additionally, it provides sample code for various MapReduce programs and outlines the structure of the lab course offered by KKR & KSR Institute of Technology and Sciences.

Uploaded by

22jr1a1216
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

CERTIFICATE

This is to Certify that ……………………………………….. bearing the Regd No ……………........


is a student of……………….B.Tech……………………………………..semester

has completed …………………. Experiments in …………………………………………….

Laboratory during the academic year 2023-24.

Signature of Laboratory Incharge Signature of Head of the Department

Signature of External
Examiner

1|P a ge
INDEX

Page
SNO Date Name of the Experiment Marks Remarks
No
Big Data & Hadoop lab Experiments

1 a) Understanding and using basic HDFS


commands
b) Run a basic word count Map Reduce
program to understand Map Reduce
Paradigm.
2 Write a Map Reduce program that mines
weather data

3 Implement matrix multiplication with Hadoop


Map Reduce.

4 Working with files in Hadoop file system:


Reading, Writing and Copying

5 Write Pig Latin scripts sort, group, join,


project, and filter your data.
6 Run the Pig Latin Scripts to find Word Count
and max. temp for each and every year.

7 Writing User Defined Functions/Eval functions


for filtering unwanted data in Pig

8 Working with Hive QL, Use Hive to create,


alter, and drop databases, tables, views,
functions, and indexes.
9 Writing User Defined Functions in Hive

10 Understanding the processing of large dataset


on Spark framework.

11 Ingesting structured and unstructured data


using Sqoop, Flume

2|P a ge
KKR & KSR INSTITUTE OF TECHNOLOGY AND
SCIENCES
(Autonomous)
Accredited by NBA & NAAC with Grade “A” and Affiliated to JNTUK-Kakinada Vinjanampadu,
Vatticherukuru Mandal, Guntur, Andhra Pradesh522017

DEPARTMENT OF CSE - DATA SCIENCE


SEMESTER - VI
Course Code Course Name L T P C
20CD6L01 BIG DATA AND HADOOP LAB 0 0 3 1.5
Course Objectives:
● Provide the knowledge to setup a Hadoop Cluster.
● Impart knowledge to develop programs using MapReduce.
● Discuss Pig, PigLatin and HiveQL to process bigdata.
● Present latest big data frameworks and applications using Spark
● Integrate Hadoop with R (RHadoop) to process and visualize.
Course Outcomes:
CO-1 : Understand Hadoop working environment.
CO-2: Apply Map Reduce programs for real world problems.
CO-3 : Implement scripts using Pig to solve real world problems.
CO-4 : Analyze queries using Hive to analyze the datasets
CO-5 : Understand spark working environment and integration with RExperiments:
TASK 1: a) Understanding and using basic HDFS commands
b)Run a basic word count Map Reduce program to understand Map Reduce Paradigm.
TASK 2: Write a Map Reduce program that mines weather data
TASK 3: Implement matrix multiplication with Hadoop Map Reduce.
TASK 4: Working with files in Hadoop file system: Reading, Writing and Copying
TASK-5: Write Pig Latin scripts sort, group, join, project, and filter your data.
TASK 6: Run the Pig Latin Scripts to find Word Count and max. temp for each and every year.
TASK-7: Writing User Defined Functions/Eval functions for filtering unwanted data in Pig TASK-8:
Working with Hive QL, Use Hive to create, alter, and drop databases, tables, views,
functions, and indexes
TASK 9: Writing User Defined Functions in Hive
TASK 10: Understanding the processing of large dataset on Spark framework.
TASK 11: Ingesting structured and unstructured data using Sqoop, Flume
Text Books:
1. Tom White, “Hadoop: The Definitive Guide”, 4th Edition, O’Reilly Inc,2015.
2. Tanmay Deshpande, “Hadoop Real-World Solutions Cookbook”, 2ndEdition, Packt
Publishing, 2016
Reference Books:
Edward Capriolo, Dean Wampler, and Jason Rutherglen, “Programming Hive”, O’Reilly
3|P a ge
LAB EXPERIMENTS
Experment-1
1a)
Aim: Understanding and working on the HDFS commands.

Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It
provides high-throughput access to application data and is suitable for applications with large data sets. Below are
some basic HDFS commands along with examples:

1. **Listing Files/Directories**:

```bash
hdfs dfs -ls <path>
```

Example: List files and directories in the root directory.

```bash
hdfs dfs -ls /
```

2. **Creating a Directory**:

```bash
hdfs dfs -mkdir <directory>
```

Example: Create a directory named "data" in the root directory.

```bash
hdfs dfs -mkdir /data
```

3. **Copying Files to HDFS**:

```bash
hdfs dfs -put <source> <destination>
```

Example: Copy a file named "file.txt" from the local file system to the "data" directory in HDFS.

```bash
hdfs dfs -put file.txt /data
```

4. **Copying Files from HDFS to Local File System**:

4|P a ge
```bash
hdfs dfs -get <source> <destination>
```

Example: Copy a file named "file.txt" from the "data" directory in HDFS to the local file system.

```bash
hdfs dfs -get /data/file.txt /local/path
```

5. **Removing Files/Directories**:

```bash
hdfs dfs -rm <path>
```

Example: Remove a file named "file.txt" from the "data" directory in HDFS.

```bash
hdfs dfs -rm /data/file.txt
```

6. **Removing Directories Recursively**:

```bash
hdfs dfs -rm -r <directory>
```

Example: Remove the "data" directory and all its contents recursively from HDFS.

```bash
hdfs dfs -rm -r /data
```
7. **Moving Files/Directories within HDFS**:

```bash
hdfs dfs -mv <source> <destination>
```

Example: Move a file named "file.txt" from the "data" directory to another directory within HDFS.

```bash
hdfs dfs -mv /data/file.txt /newlocation/
```

5|P a ge
1b)
Aim: Run a basic word count Map Reduce program to understand Map Reduce Paradigm.

Input
Set of data

Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS, caR, CAR, car, BUS, TRAIN
Output
Convert into another set of data (Key,Value)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1), (car,1), (BUS,1), (TRAIN,1)
Reduce Function – Takes the output from Map as an input and combines those data tuples into a smaller set
of tuples.
Example – (Reduce function in Word Count)
Input Set of Tuples (output of Map function)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1), (TRAIN,1),(BUS,1),
(buS,1),(caR,1),(CAR,1), (car,1), (BUS,1), (TRAIN,1)
Output Converts into smaller set of tuples
(BUS,7), (CAR,7), (TRAIN,4)

Mapper code:

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;

6|P a ge
import org.apache.hadoop.mapred.Reporter;

public class WordMapper extends MapReduceBase implements


Mapper<LongWritable,Text,Text,IntWritable>{

public void map(LongWritable key,Text value,


OutputCollector<Text,IntWritable>output,Reporter r)
throws IOException{
String s=value.toString();
for(String word:s.split(" "))
{
if(word.length()>0)
{
output.collect(new Text(word),new IntWritable(1));
}
}
}

Reducer code:

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class WordMapper extends MapReduceBase implements


Mapper<LongWritable,Text,Text,IntWritable>{

public void map(LongWritable key,Text value,


OutputCollector<Text,IntWritable>output,Reporter r)
throws IOException{
String s=value.toString();
for(String word:s.split(" "))
{
if(word.length()>0)
{
output.collect(new Text(word),new IntWritable(1));
}
}
}

Driver Code:

import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;

7|P a ge
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class WordCount extends Configured implements Tool{

public int run(String[] args) throws Exception {

if (args.length<2){
System.out.println("plz give i/p and o/p properly");
return -1;
}
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("WordCount");
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));

conf.setMapperClass(WordMapper.class);
conf.setReducerClass(WordReducer.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);

return 0;
}

public static void main(String args[]) throws Exception


{
int exitcode = ToolRunner.run(new WordCount(),args);
System.exit(exitcode);
}
}

I/P:
HDFS is a storage unit of Hadoop
MapReduce is a processing tool of Hadoop
O/P:
HDFS 1
Hadoop 2
MapReduce 1
a 2
is 2
of 2
processing 1
storage 1
tool 1
unit 1

8|P a ge
2)
Aim: Write a Map Reduce program that mines weather data.

Here, we will write a Map-Reduce program for analyzing weather datasets to understand its data
processing programming model. Weather sensors are collecting weather information across the globe in a
large volume of log data. This weather data is semi-structured and record-oriented.
This data is stored in a line-oriented ASCII format, where each row represents a single record. Each row has
lots of fields like longitude, latitude, daily max-min temperature, daily average temperature, etc. for easiness,
we will focus on the main element, i.e. temperature. We will use the data from the National Centres for
Environmental Information(NCEI). It has a massive amount of historical weather data that we can use for our
data analysis.

Prg:

Driver code:

import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class HighestDriver extends Configured implements Tool{

public static void main(String[] args) throws Exception

int exitcode = ToolRunner.run(new HighestDriver(),args);


System.exit(exitcode);

public int run(String args[]) throws Exception{


JobConf conf = new JobConf(getConf(), HighestDriver.class);
conf.setJobName("HighestDriver");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(HighestMapper.class);
conf.setReducerClass(HighestReducer.class);
Path inp = new Path(args[1]);
Path out = new Path(args[2]);

FileInputFormat.addInputPath(conf, inp);
FileOutputFormat.setOutputPath(conf, out);

JobClient.runJob(conf);

return 0;
}

9|P a ge
}

Mapper code

import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class HighestMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text,
IntWritable>
{
public static final int MISSING = 9999;
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException
{
String line = value.toString();
String year = line.substring(15,19);

int temperature;
if (line.charAt(87)=='+')
temperature = Integer.parseInt(line.substring(88, 92));
else
temperature = Integer.parseInt(line.substring(87, 92));
String quality = line.substring(92, 93);
if(temperature != MISSING && quality.matches("[01459]"))
output.collect(new Text(year),new IntWritable(temperature));

Reducer code:

import java.io.IOException
import java.util.Iterator;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;

public class HighestReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text,


IntWritable>

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output,


Reporter reporter) throws IOException
{
int max_temp = 0;
while (values.hasNext())
{
int current=values.next().get();
if ( max_temp < current)
max_temp = current;

output.collect(key, new IntWritable(max_temp/10));

10 | P a g e
I/P:

0057
332130 # USAF weather station identifier
99999 # WBAN weather station identifier
19500101 # observation date
0300 # observation time
4
+51317 # latitude (degrees x 1000)
+028783 # longitude (degrees x 1000)
FM-12
+0171 # elevation (meters)
99999
V020
320 # wind direction (degrees)
1 # quality code
N
0072
1
00450 # sky ceiling height (meters)
1 # quality code
CN
010000 # visibility distance (meters)
1 # quality code
N9
-0128 # air temperature (degrees Celsius x 10)
1 # quality code
-0139 # dew point temperature (degrees Celsius x 10)
1 # quality code
10268 # atmospheric pressure (hectopascals x 10)
1 # quality code

O/P:
1901 317
1902 244
1903 289
1904 256
1905 283

11 | P a g e
3)
Aim: Implement matrix multiplication with Hadoop Map Reduce.

Algorithm:
MapReduce is a technique in which a huge program is subdivided into small tasks and run parallelly to make
computation faster, save time, and mostly used in distributed systems. It has 2 important parts:
● Mapper: It takes raw data input and organizes into key, value pairs. For example, In a dictionary, you search
for the word “Data” and its associated meaning is “facts and statistics collected together for reference or
analysis”. Here the Key is Data and the Value associated with is facts and statistics collected together for
reference or analysis.
● Reducer: It is responsible for processing data in parallel and produce final output.
Let us consider the matrix multiplication example to visualize MapReduce. Consider the following matrix:

Here matrix A is a 2×2 matrix which means the number of rows(i)=2 and the number of
columns(j)=2. Matrix B is also a 2×2 matrix where number of rows(j)=2 and number of
columns(k)=2. Each cell of the matrix is labelled as Aij and Bij. Ex. element 3 in matrix A is called
A21 i.e. 2nd-row 1st column. Now One step matrix multiplication has 1 mapper and 1 reducer.
The Formula is:
Mapper for Matrix A (k, v)=((i, k), (A, j, Aij))
for all k Mapper for Matrix B (k, v)=((i, k),
(B, j, Bjk)) for all iTherefore computing the
mapper for Matrix A:
# k, i, j computes the number of times
it occurs. # Here all are 2, therefore
when k=1, i can have # 2 values 1 & 2,
each case can have 2 further
# values of j=1 and j=2. Substituting
all values# in formula

k=1 i=1 j=1 ((1, 1), (A, 1, 1))


k=1 i=1 j=1 ((1, 1), (A, 1, 1))
j=2 ((1, 1), (A, 2, 2))
i=2 j=1 ((2, 1), (A, 1, 3))

12 | P a g e
j=2 ((2, 1), (A, 2, 4))

k=2 i=1 j=1 ((1, 2), (A, 1, 1))


j=2 ((1, 2), (A, 2, 2))
i=2 j=1 ((2, 2), (A, 1, 3))
j=2 ((2, 2), (A, 2, 4))
Computing the mapper for Matrix B i=1 j=1 k=1 ((1, 1), (B, 1, 5))
k=2 ((1, 2), (B, 1, 6))
j=2 k=1 ((1, 1), (B, 2, 7))
k=2 ((1, 2), (B, 2, 8))

i=2 j=1 k=1 ((2, 1), (B, 1, 5))


k=2 ((2, 2), (B, 1, 6))
j=2 k=1 ((2, 1), (B, 2, 7))
k=2 ((2, 2), (B, 2, 8))
The formula for Reducer is:
Reducer(k, v)=(i, k)=>Make sorted Alist and Blist (i, k) => Summation (Aij * Bjk)) for j
Output =>((i, k), sum)
Therefore computing the reducer:
# We can observe from Mapper computation # that 4 pairs are common (1, 1), (1, 2),
# (2, 1) and (2, 2)
# Make a list separate for Matrix A & # B with adjoining values taken from # Mapper step above:

(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}


Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(1*5) + (2*7)] =19 (i)

(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)}


Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(1*6) + (2*8)] =22 (ii)

(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}


Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(3*5) + (4*7)] =43 (iii)
(2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)}
Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(3*6) + (4*8)] =50 (iv)

From (i), (ii), (iii) and (iv) we conclude that ((1, 1), 19)
((1, 2), 22)
((2, 1), 43)
((2, 2), 50)
Therefore the Final Matrix is:

13 | P a g e
Driver Code:

import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
//import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
public class MatDriver {

public static void main(String[] args)throws Exception {


//JobConf conf=new JobConf();
Job job=new Job();
try {
//job = new Job(conf,"Matrix Multiplication");
job.setJarByClass(MatDriver.class);
job.setMapperClass(MatMapper.class);
job.setReducerClass(MatReduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
try {
System.exit(job.waitForCompletion(true)?0:-1);
} catch (ClassNotFoundException | IOException | InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

14 | P a g e
Mapper Code:

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
//import org.apache.hadoop.mapreduce.Mapper.Context;

public class MatMapper extends


Mapper<LongWritable, Text, Text, Text>
{
long lMax=5;
long iMax=5;
@Override
protected void map
(LongWritable key, Text value, Context context)
throws IOException, InterruptedException
{
// input format is ["a", 0, 0, 63]
String[] csv = value.toString().split(",");
String matrix = csv[0].trim();
int row = Integer.parseInt(csv[1].trim());
int col = Integer.parseInt(csv[2].trim());
if(matrix.contains("a"))
{
for (int i=0; i < lMax; i++)
{
String akey = Integer.toString(row) + "," +
Integer.toString(i);
context.write(new Text(akey), value);
}
}
if(matrix.contains("b"))
{
for (int i=0; i < iMax; i++)
{
String akey = Integer.toString(i) + "," +
Integer.toString(col);
context.write(new Text(akey), value);
}
}
}
}

Reducer code:

import java.io.IOException;
//import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class MatReduce extends Reducer<Text, Text, Text, LongWritable> {


@Override
protected void reduce

15 | P a g e
(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int[] a = new int[5];
int[] b = new int[5];
// b, 2, 0, 30
for (Text value : values) {
System.out.println(value);
String cell[] = value.toString().split(",");
if (cell[0].contains("a")) // take rows here
{
int col = Integer.parseInt(cell[2].trim());
a[col] = Integer.parseInt(cell[3].trim());
}
else if (cell[0].contains("b")) // take col here
{
int row = Integer.parseInt(cell[1].trim());
b[row] = Integer.parseInt(cell[3].trim());
}
}
int total = 0;
for (int i = 0; i < 5; i++) {
int val = a[i] * b[i];
total += val;
}
context.write(key, new LongWritable(total));
}
}

I/P:
Matrix - A
0,0,1
0,1,2
1,0,3
1,1,4
Matrix – B
0,0,1
0,1,2
1,0,3
1,1,4

O/P:
0,0,7
0,1,10
1,0,15
1,1,22

16 | P a g e
4) Aim: Working with files in Hadoop file system: Reading, Writing and Copying.

Algorithm:

File Read in HDFS:

Step 1: The client opens the file it wishes to read by calling open() on the File System Object(which for HDFS is an
instance of Distributed File System).
Step 2: Distributed File System( DFS) calls the name node, using remote procedure calls (RPCs), to determine the
locations of the first few blocks in the file. For each block, the name node returns the addresses of the data nodes that
have a copy of that block. The DFS returns an FSDataInputStream to the client for it to read data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages the data node and name node I/O.
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the info node addresses for the
primary few blocks within the file, then connects to the primary (closest) data node for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read() repeatedly on the stream. Step 5:
When the end of the block is reached, DFSInputStream will close the connection to the data node, then finds the best
data node for the next block. This happens transparently to the client, which from its point of view is simply reading
an endless stream. Blocks are read as, with the DFSInputStream opening new connections to data nodes because the
client reads through the stream. It will also call the name node to retrieve the data node locations for the next batch of
blocks as needed.
Step 6: When the client has finished reading the file, a function is called, close() on the FSDataInputStream

File Write in HDFS:

17 | P a g e
Step 1: The client creates the file by calling create() on DistributedFileSystem(DFS).
Step 2: DFS makes an RPC call to the name node to create a new file in the file system’s namespace, with no blocks
associated with it. The name node performs various checks to make sure the file doesn’t already exist and that the
client has the right permissions to create the file. If these checks pass, the name node prepares a record of the new
file; otherwise, the file can’t be created and therefore the client is thrown an error i.e. IOException. The DFS returns
an FSDataOutputStream for the client to start out writing data to. Step 3: Because the client writes data, the
DFSOutputStream splits it into packets, which it writes to an indoor queue called the info queue. The data queue is
consumed by the DataStreamer, which is liable for asking the name node to allocate new blocks by picking an
inventory of suitable data nodes to store the replicas. The list of data nodes forms a pipeline, and here we’ll assume
the replication level is three, so there are three nodes in the pipeline. The DataStreamer streams the packets to the
primary data node within the pipeline, which stores each packet and forwards it to the second data node within the
pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to the third (and last) data node in the
pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting to be acknowledged by data
nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline and waits for acknowledgments
before connecting to the name node to signal whether the file is complete or not.
HDFS follows Write Once Read Many models. So, we can’t edit files that are already stored in HDFS, but we can
include them by again reopening the file. This design allows HDFS to scale to a large number of concurrent clients
because the data traffic is spread across all the data nodes in the cluster. Thus, it increases the availability, scalability,
and throughput of the system.

Reading:
Use the -cat command to display the content of the file. The syntax for the same is:

Writing:

 echo "hello world" > hello_world.txt


 $HADOOP_HOME/bin/hadoop fs -put /home/hello_world.txt

Copying:
 Create a file using “echo” command

echo "hello world" > sample1.txt


 Now move this file to another directory “/user/hadoop1”

$ HADOOP_HOM/bin/hadoop fs -cp /user/data/sample1.txt /user/hadoop1

18 | P a g e
Pig Installation

STEP:1
Download the pig from https://dlcdn.apache.org/pig/pig-0.16.0/ and download the following file pig-0.16.0.tar.gz

STEP:2
Copy the above file into the Hadoop file system
STEP:3
Now extract the copied file using the following command tar -xzf pig-0.16.0.tar.gz

STEP:4
Edit the .bashrc file to update the environment variables of Apache Pig. Add the following variables at the end of
the .bashrc as shown in the image below

# Set PIG_HOME

export PIG_HOME=/home/student/Installations/hadoop-1.2.1/pig-0.16.0
export PATH=$PATH:/home/student/Installations/hadoop-1.2.1/pig-0.16.0/bin
export PIG_CLASSPATH=$HADOOP_CONF_DIR

STEP:5
TO reflect the changes run the following command  source /home/student/.bashrc

STEP:6
To know everything is working check the version of pig -> pig -version
STEP:7
To open grunt shell run the command -> pig -x local

19 | P a g e
TASK-5: Write Pig Latin scripts sort, group, join, project, and filter your data.

Pig scripts are nothing but the files with the extension of .pig which includes pig commands
Instead of running each command individually we write a script and execute in batch mode.
Here is the dataset that we are going to work on

->loaded as data in pig script


I have this dataset in /home/student/Installations/hadoop-1.2.1
1.Sort: TO sort we are using orderby which is similar to sql. The syntax is
alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] }
2.Group: Groups the data in one or more relations.
Syntax: alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression …] [USING 'collected' | 'merge']
[PARTITION BY partitioner]
3.join : Use the JOIN operator to perform an inner, equijoin join of two or more relations based on common field
values

4.Filter: FILTER is commonly used to select the data that you want; or, conversely, to filter out (remove) the data
you don’t want.

5.Project : Project means selecting the columns that we want

-> marks dataset loaded as data1 in pig script

Here is the pig script

We are running this pig script in the local so open terminal in the file location ie pig script
And type the command pig -x local filename.pig

20 | P a g e
Output’s snip section:
1.SORT
Output for order here we can see the classmates are randomly ordered in the dataset we are ordering it by
regno

2.Group:
Group according to their hobbies

3. Filter:
Filtering whose hobbies == kabaddi

4.Project:
Projection only name and hobbies column

5.Joins:
Innerjoin:
Displaying the students who have entry in both hobbies and marks tables by joining them

21 | P a g e
Leftjoin:

Where the left table is being joined with the right table if no entry found in right table it has no corresponding
entry

22 | P a g e
TASK 6: Run the Pig Latin Scripts to find Word Count and max. temp for each and every year.

DATASET

Pig script

<- word countoutput <- max temp

Max example

23 | P a g e
Task -7: Writing User Defined Functions/Eval functions for filtering unwanted data in Pig.

To write a UDF using Java, we have to integrate the jar file Pig-0.15.0.jar. In this section, we discuss how to write a
sample UDF using Eclipse. Before proceeding further, make sure you have installed Eclipse and Maven in your
system.
 Create a new class file with name Sample_Eval and copy the following content in it.

UDF IN JAVA:
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;

public class Sample_Eval extends EvalFunc<String>{

public String exec(Tuple input) throws IOException {


if (input == null || input.size() == 0)
return null;
String str = (String)input.get(0);
return str.toUpperCase();
}
}
After writing the UDF and generating the Jar file, follow the steps given below −
Step 1: Registering the Jar file
After writing UDF (in Java) we have to register the Jar file that contain the UDF using the Register operator. By
registering the Jar file, users can intimate the location of the UDF to Apache Pig.
Syntax
Given below is the syntax of the Register operator.
 REGISTER path;

Start Apache Pig in local mode and register the jar file sample_udf.jar as shown below.
$cd PIG_HOME/bin
$./pig –x local
 REGISTER '/$PIG_HOME/sample_udf.jar'
Note − assume the Jar file in the path − /$PIG_HOME/sample_udf.jar
Step 2: Defining Alias
After registering the UDF we can define an alias to it using the Define operator.
 DEFINE sample_eval sample_eval();
Step 3: Using the UDF
After defining the alias you can use the UDF same as the built-in functions. Suppose there is a file named emp_data
in the HDFS /Pig_Data/ directory with the following content.

24 | P a g e
And assume we have loaded this file into Pig as shown below.
grunt>emp_data = LOAD 'hdfs://localhost:9000/pig_data/emp1.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, city:chararray);
Let us now convert the names of the employees in to upper case using the UDF sample_eval.
grunt> Upper_case = FOREACH emp_data GENERATE sample_eval(name);
Verify the contents of the relation Upper_case as shown below.

TASK-8: Working with Hive QL, Use Hive to create, alter, and drop databases, tables, views, functions, and
indexes.
Prg:
Creating, loading data into a table:

Altering a table:
 ALTER TABLE my_table ADD COLUMNS (age INT);
o/p: OK
Drop a table:
 DROP TABLE my_table;
o/p: OK

25 | P a g e
Creating a view:

Drop a view:
 DROP VIEW my_view;
o/p: OK
Alter a view:

Create a function:
 CREATE FUNCTION my_function AS 'org.example.MyFunction' USING JAR 'my_functions.jar';
o/p: OK
Drop a function:
 DROP FUNCTION my_function;
o/p: OK

Create an index:

Drop index:
 DROP INDEX my_index ON my_table;
o/p: OK

26 | P a g e
Task 9: Writing User Defined Functions in Hive.

Prg:

1. Create a Java class for the User Defined Function which extends
org.apache.hadoop.hive.sq.exec.UDF and implements more than one evaluate() methods. Put in
your desired logic and you are almost there.
2. Package your Java class into a JAR file (I am using Maven)
3. Go to Hive CLI, add your JAR, and verify your JARs is in the Hive CLI class path.
4. CREATE TEMPORARY FUNCTION in Hive which points to your Java class
5. Use it in Hive SQL.

package org.hardik.letsdobigdata;

import org.apache.commons.lang.StringUtils;

import org.apache.hadoop.hive.ql.exec.UDF;

import org.apache.hadoop.io.Text;

public class Strip extends UDF

private Text result = new Text();

public Text evaluate(Text str, String stripChars)

if(str == null)

return null;

result.set(StringUtils.strip(str.toString(), stripChars));

return result;

public Text evaluate(Text str)

if(str == null)

return null;

result.set(StringUtils.strip(str.toString()));

return result;

27 | P a g e
}

 evaluate(Text str, String stripChars) - will trim specified characters in stripChars from
first argument str.
 evaluate(Text str) - will trim leading and trailing spaces.

Package Your Java Class into a JAR


 Please make sure you have Maven installed.
$ cd HiveUDFs

 run "mvn clean package". This will create a JAR file which contains our UDF class. Copy the
JAR's path.

Next, add the Hive Jar files to the project class path.
 Edit the .bashrc file to update the environment variables of Apache Pig. Add the following
variables at the end of the .bashrc as shown below.

export HIVE_HOME=/usr/local/hive
export HADOOP_HOME=/usr/local/hadoop
export CLASSPATH=$CLASSPATH:$HIVE_HOME/lib/*:$HADOOP_HOME/*:$HADOOP_HOME/lib/*

 Package the UDF into a JAR file by using below command

jar cf StringLengthUDF.jar StringLengthUDF.class


 Register the UDF in Hive:
$ hive
ADD JAR ~/my_udf/ Strip.jar;

 The first query strips ‘ha’ from string ‘hadoop’ as expected (2 argument evaluate() in
code). The second query strips trailing and leading spaces as expected.

28 | P a g e
TASK 10: Understanding the processing of large dataset on Spark framework.
Prg:

 importing the pyspark.sql module and creating a local SparkSession:

from pyspark.sql import SparkSession


sc = SparkSession.builder.master("local").appName("Test").getOrCreate()

 Read the data of csv file to SparkSession using sc.read.


raw_data = sc.read.options(delimiter="\t",header=True).csv("en.openfoodfacts.org.products.csv")

 The result are stored in pyspark.sql.dataframe variable, Now let us look into our data scheme and
no.of record in it .

 Let us compute the number of products per country to get an idea about the database composition :
from pyspark.sql.functions import col
BDD_countries = raw_data.groupBy("countries_tags").count().persist()

 BDD_countries is also a pyspark data frame and has the following structure :

 we can filter this new data frame to keep only the countries that have at least 5000 products recorded
in the database and plot the result:

29 | P a g e
TASK 11: Ingesting structured and unstructured data using Sqoop, Flume
Prg:
Ingesting structured data using Sqoop
Sqoop is a tool that enables you to import data from structured data stores like relational databases into Hadoop.
Here are the basic steps to ingest data using Sqoop:
a. Install Sqoop: Install Sqoop on the machine that will run the Sqoop command.
b. Connect to the source database: Use the `sqoop import` command to connect to the source database, specify the
table to import, and provide the necessary credentials.
c. Specify the target Hadoop system: Specify the target Hadoop system to where you want to import the data
d. Define the target Hadoop destination: Define the Hadoop destination for the imported data, such as a Hive table.
Here is an example command to import data from a MySQL database into a Hive table:

As you can see in the below image, after executing this command Map tasks will be executed at the back end.

After the code is executed, you can check the Web UI of HDFS i.e. localhost:50070 where the data is imported.

30 | P a g e
Ingesting unstructured data using Flume

Flume is a tool that enables you to ingest unstructured data such as log files and event streams into Hadoop. Here
are the basic steps to ingest data using Flume:

a. Install Flume: Install Flume on the machine that will run the Flume command.
b. Configure Flume: Configure Flume to read data from the source and write it to the target Hadoop system. This
involves specifying the source type, the source location, and the target Hadoop destination.
c. Start the Flume agent: Start the Flume agent to begin the data ingestion process.
Here is an example configuration file for Flume that reads data from a log file and writes it to a Hadoop
destination:

agent.sources = logsource
agent.channels = memoryChannel
agent.sinks = hdfssink

31 | P a g e
agent.sources.logsource.type = exec
agent.sources.logsource.command = tail -F /var/log/messages
agent.sources.logsource.channels = memoryChannel

agent.sinks.hdfssink.type = hdfs
agent.sinks.hdfssink.hdfs.path = hdfs://hadoop.example.com/logs/
agent.sinks.hdfssink.channel = memoryChannel

agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 1000

agent.sources.logsource.interceptors = timestampInterceptor
agent.sources.logsource.interceptors.timestampInterceptor.type = timestamp

This configuration reads data from the `/var/log/messages` file using the `tail -F` command and writes it to the
`/logs` directory in HDFS. It also adds a timestamp to each event using the `timestampInterceptor` interceptor.

32 | P a g e

You might also like