Bda Lab Record
Bda Lab Record
Signature of External
Examiner
1|P a ge
INDEX
Page
SNO Date Name of the Experiment Marks Remarks
No
Big Data & Hadoop lab Experiments
2|P a ge
KKR & KSR INSTITUTE OF TECHNOLOGY AND
SCIENCES
(Autonomous)
Accredited by NBA & NAAC with Grade “A” and Affiliated to JNTUK-Kakinada Vinjanampadu,
Vatticherukuru Mandal, Guntur, Andhra Pradesh522017
Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It
provides high-throughput access to application data and is suitable for applications with large data sets. Below are
some basic HDFS commands along with examples:
1. **Listing Files/Directories**:
```bash
hdfs dfs -ls <path>
```
```bash
hdfs dfs -ls /
```
2. **Creating a Directory**:
```bash
hdfs dfs -mkdir <directory>
```
```bash
hdfs dfs -mkdir /data
```
```bash
hdfs dfs -put <source> <destination>
```
Example: Copy a file named "file.txt" from the local file system to the "data" directory in HDFS.
```bash
hdfs dfs -put file.txt /data
```
4|P a ge
```bash
hdfs dfs -get <source> <destination>
```
Example: Copy a file named "file.txt" from the "data" directory in HDFS to the local file system.
```bash
hdfs dfs -get /data/file.txt /local/path
```
5. **Removing Files/Directories**:
```bash
hdfs dfs -rm <path>
```
Example: Remove a file named "file.txt" from the "data" directory in HDFS.
```bash
hdfs dfs -rm /data/file.txt
```
```bash
hdfs dfs -rm -r <directory>
```
Example: Remove the "data" directory and all its contents recursively from HDFS.
```bash
hdfs dfs -rm -r /data
```
7. **Moving Files/Directories within HDFS**:
```bash
hdfs dfs -mv <source> <destination>
```
Example: Move a file named "file.txt" from the "data" directory to another directory within HDFS.
```bash
hdfs dfs -mv /data/file.txt /newlocation/
```
5|P a ge
1b)
Aim: Run a basic word count Map Reduce program to understand Map Reduce Paradigm.
Input
Set of data
Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS, caR, CAR, car, BUS, TRAIN
Output
Convert into another set of data (Key,Value)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1), (car,1), (BUS,1), (TRAIN,1)
Reduce Function – Takes the output from Map as an input and combines those data tuples into a smaller set
of tuples.
Example – (Reduce function in Word Count)
Input Set of Tuples (output of Map function)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1), (TRAIN,1),(BUS,1),
(buS,1),(caR,1),(CAR,1), (car,1), (BUS,1), (TRAIN,1)
Output Converts into smaller set of tuples
(BUS,7), (CAR,7), (TRAIN,4)
Mapper code:
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
6|P a ge
import org.apache.hadoop.mapred.Reporter;
Reducer code:
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
Driver Code:
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
7|P a ge
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
if (args.length<2){
System.out.println("plz give i/p and o/p properly");
return -1;
}
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("WordCount");
FileInputFormat.setInputPaths(conf,new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
conf.setMapperClass(WordMapper.class);
conf.setReducerClass(WordReducer.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
return 0;
}
I/P:
HDFS is a storage unit of Hadoop
MapReduce is a processing tool of Hadoop
O/P:
HDFS 1
Hadoop 2
MapReduce 1
a 2
is 2
of 2
processing 1
storage 1
tool 1
unit 1
8|P a ge
2)
Aim: Write a Map Reduce program that mines weather data.
Here, we will write a Map-Reduce program for analyzing weather datasets to understand its data
processing programming model. Weather sensors are collecting weather information across the globe in a
large volume of log data. This weather data is semi-structured and record-oriented.
This data is stored in a line-oriented ASCII format, where each row represents a single record. Each row has
lots of fields like longitude, latitude, daily max-min temperature, daily average temperature, etc. for easiness,
we will focus on the main element, i.e. temperature. We will use the data from the National Centres for
Environmental Information(NCEI). It has a massive amount of historical weather data that we can use for our
data analysis.
Prg:
Driver code:
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
FileInputFormat.addInputPath(conf, inp);
FileOutputFormat.setOutputPath(conf, out);
JobClient.runJob(conf);
return 0;
}
9|P a ge
}
Mapper code
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class HighestMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text,
IntWritable>
{
public static final int MISSING = 9999;
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException
{
String line = value.toString();
String year = line.substring(15,19);
int temperature;
if (line.charAt(87)=='+')
temperature = Integer.parseInt(line.substring(88, 92));
else
temperature = Integer.parseInt(line.substring(87, 92));
String quality = line.substring(92, 93);
if(temperature != MISSING && quality.matches("[01459]"))
output.collect(new Text(year),new IntWritable(temperature));
Reducer code:
import java.io.IOException
import java.util.Iterator;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
10 | P a g e
I/P:
0057
332130 # USAF weather station identifier
99999 # WBAN weather station identifier
19500101 # observation date
0300 # observation time
4
+51317 # latitude (degrees x 1000)
+028783 # longitude (degrees x 1000)
FM-12
+0171 # elevation (meters)
99999
V020
320 # wind direction (degrees)
1 # quality code
N
0072
1
00450 # sky ceiling height (meters)
1 # quality code
CN
010000 # visibility distance (meters)
1 # quality code
N9
-0128 # air temperature (degrees Celsius x 10)
1 # quality code
-0139 # dew point temperature (degrees Celsius x 10)
1 # quality code
10268 # atmospheric pressure (hectopascals x 10)
1 # quality code
O/P:
1901 317
1902 244
1903 289
1904 256
1905 283
11 | P a g e
3)
Aim: Implement matrix multiplication with Hadoop Map Reduce.
Algorithm:
MapReduce is a technique in which a huge program is subdivided into small tasks and run parallelly to make
computation faster, save time, and mostly used in distributed systems. It has 2 important parts:
● Mapper: It takes raw data input and organizes into key, value pairs. For example, In a dictionary, you search
for the word “Data” and its associated meaning is “facts and statistics collected together for reference or
analysis”. Here the Key is Data and the Value associated with is facts and statistics collected together for
reference or analysis.
● Reducer: It is responsible for processing data in parallel and produce final output.
Let us consider the matrix multiplication example to visualize MapReduce. Consider the following matrix:
Here matrix A is a 2×2 matrix which means the number of rows(i)=2 and the number of
columns(j)=2. Matrix B is also a 2×2 matrix where number of rows(j)=2 and number of
columns(k)=2. Each cell of the matrix is labelled as Aij and Bij. Ex. element 3 in matrix A is called
A21 i.e. 2nd-row 1st column. Now One step matrix multiplication has 1 mapper and 1 reducer.
The Formula is:
Mapper for Matrix A (k, v)=((i, k), (A, j, Aij))
for all k Mapper for Matrix B (k, v)=((i, k),
(B, j, Bjk)) for all iTherefore computing the
mapper for Matrix A:
# k, i, j computes the number of times
it occurs. # Here all are 2, therefore
when k=1, i can have # 2 values 1 & 2,
each case can have 2 further
# values of j=1 and j=2. Substituting
all values# in formula
12 | P a g e
j=2 ((2, 1), (A, 2, 4))
From (i), (ii), (iii) and (iv) we conclude that ((1, 1), 19)
((1, 2), 22)
((2, 1), 43)
((2, 2), 50)
Therefore the Final Matrix is:
13 | P a g e
Driver Code:
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
//import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
public class MatDriver {
14 | P a g e
Mapper Code:
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
//import org.apache.hadoop.mapreduce.Mapper.Context;
Reducer code:
import java.io.IOException;
//import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
15 | P a g e
(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int[] a = new int[5];
int[] b = new int[5];
// b, 2, 0, 30
for (Text value : values) {
System.out.println(value);
String cell[] = value.toString().split(",");
if (cell[0].contains("a")) // take rows here
{
int col = Integer.parseInt(cell[2].trim());
a[col] = Integer.parseInt(cell[3].trim());
}
else if (cell[0].contains("b")) // take col here
{
int row = Integer.parseInt(cell[1].trim());
b[row] = Integer.parseInt(cell[3].trim());
}
}
int total = 0;
for (int i = 0; i < 5; i++) {
int val = a[i] * b[i];
total += val;
}
context.write(key, new LongWritable(total));
}
}
I/P:
Matrix - A
0,0,1
0,1,2
1,0,3
1,1,4
Matrix – B
0,0,1
0,1,2
1,0,3
1,1,4
O/P:
0,0,7
0,1,10
1,0,15
1,1,22
16 | P a g e
4) Aim: Working with files in Hadoop file system: Reading, Writing and Copying.
Algorithm:
Step 1: The client opens the file it wishes to read by calling open() on the File System Object(which for HDFS is an
instance of Distributed File System).
Step 2: Distributed File System( DFS) calls the name node, using remote procedure calls (RPCs), to determine the
locations of the first few blocks in the file. For each block, the name node returns the addresses of the data nodes that
have a copy of that block. The DFS returns an FSDataInputStream to the client for it to read data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages the data node and name node I/O.
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the info node addresses for the
primary few blocks within the file, then connects to the primary (closest) data node for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read() repeatedly on the stream. Step 5:
When the end of the block is reached, DFSInputStream will close the connection to the data node, then finds the best
data node for the next block. This happens transparently to the client, which from its point of view is simply reading
an endless stream. Blocks are read as, with the DFSInputStream opening new connections to data nodes because the
client reads through the stream. It will also call the name node to retrieve the data node locations for the next batch of
blocks as needed.
Step 6: When the client has finished reading the file, a function is called, close() on the FSDataInputStream
17 | P a g e
Step 1: The client creates the file by calling create() on DistributedFileSystem(DFS).
Step 2: DFS makes an RPC call to the name node to create a new file in the file system’s namespace, with no blocks
associated with it. The name node performs various checks to make sure the file doesn’t already exist and that the
client has the right permissions to create the file. If these checks pass, the name node prepares a record of the new
file; otherwise, the file can’t be created and therefore the client is thrown an error i.e. IOException. The DFS returns
an FSDataOutputStream for the client to start out writing data to. Step 3: Because the client writes data, the
DFSOutputStream splits it into packets, which it writes to an indoor queue called the info queue. The data queue is
consumed by the DataStreamer, which is liable for asking the name node to allocate new blocks by picking an
inventory of suitable data nodes to store the replicas. The list of data nodes forms a pipeline, and here we’ll assume
the replication level is three, so there are three nodes in the pipeline. The DataStreamer streams the packets to the
primary data node within the pipeline, which stores each packet and forwards it to the second data node within the
pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to the third (and last) data node in the
pipeline.
Step 5: The DFSOutputStream sustains an internal queue of packets that are waiting to be acknowledged by data
nodes, called an “ack queue”.
Step 6: This action sends up all the remaining packets to the data node pipeline and waits for acknowledgments
before connecting to the name node to signal whether the file is complete or not.
HDFS follows Write Once Read Many models. So, we can’t edit files that are already stored in HDFS, but we can
include them by again reopening the file. This design allows HDFS to scale to a large number of concurrent clients
because the data traffic is spread across all the data nodes in the cluster. Thus, it increases the availability, scalability,
and throughput of the system.
Reading:
Use the -cat command to display the content of the file. The syntax for the same is:
Writing:
Copying:
Create a file using “echo” command
18 | P a g e
Pig Installation
STEP:1
Download the pig from https://dlcdn.apache.org/pig/pig-0.16.0/ and download the following file pig-0.16.0.tar.gz
STEP:2
Copy the above file into the Hadoop file system
STEP:3
Now extract the copied file using the following command tar -xzf pig-0.16.0.tar.gz
STEP:4
Edit the .bashrc file to update the environment variables of Apache Pig. Add the following variables at the end of
the .bashrc as shown in the image below
# Set PIG_HOME
export PIG_HOME=/home/student/Installations/hadoop-1.2.1/pig-0.16.0
export PATH=$PATH:/home/student/Installations/hadoop-1.2.1/pig-0.16.0/bin
export PIG_CLASSPATH=$HADOOP_CONF_DIR
STEP:5
TO reflect the changes run the following command source /home/student/.bashrc
STEP:6
To know everything is working check the version of pig -> pig -version
STEP:7
To open grunt shell run the command -> pig -x local
19 | P a g e
TASK-5: Write Pig Latin scripts sort, group, join, project, and filter your data.
Pig scripts are nothing but the files with the extension of .pig which includes pig commands
Instead of running each command individually we write a script and execute in batch mode.
Here is the dataset that we are going to work on
4.Filter: FILTER is commonly used to select the data that you want; or, conversely, to filter out (remove) the data
you don’t want.
We are running this pig script in the local so open terminal in the file location ie pig script
And type the command pig -x local filename.pig
20 | P a g e
Output’s snip section:
1.SORT
Output for order here we can see the classmates are randomly ordered in the dataset we are ordering it by
regno
2.Group:
Group according to their hobbies
3. Filter:
Filtering whose hobbies == kabaddi
4.Project:
Projection only name and hobbies column
5.Joins:
Innerjoin:
Displaying the students who have entry in both hobbies and marks tables by joining them
21 | P a g e
Leftjoin:
Where the left table is being joined with the right table if no entry found in right table it has no corresponding
entry
22 | P a g e
TASK 6: Run the Pig Latin Scripts to find Word Count and max. temp for each and every year.
DATASET
Pig script
Max example
23 | P a g e
Task -7: Writing User Defined Functions/Eval functions for filtering unwanted data in Pig.
To write a UDF using Java, we have to integrate the jar file Pig-0.15.0.jar. In this section, we discuss how to write a
sample UDF using Eclipse. Before proceeding further, make sure you have installed Eclipse and Maven in your
system.
Create a new class file with name Sample_Eval and copy the following content in it.
UDF IN JAVA:
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
Start Apache Pig in local mode and register the jar file sample_udf.jar as shown below.
$cd PIG_HOME/bin
$./pig –x local
REGISTER '/$PIG_HOME/sample_udf.jar'
Note − assume the Jar file in the path − /$PIG_HOME/sample_udf.jar
Step 2: Defining Alias
After registering the UDF we can define an alias to it using the Define operator.
DEFINE sample_eval sample_eval();
Step 3: Using the UDF
After defining the alias you can use the UDF same as the built-in functions. Suppose there is a file named emp_data
in the HDFS /Pig_Data/ directory with the following content.
24 | P a g e
And assume we have loaded this file into Pig as shown below.
grunt>emp_data = LOAD 'hdfs://localhost:9000/pig_data/emp1.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, city:chararray);
Let us now convert the names of the employees in to upper case using the UDF sample_eval.
grunt> Upper_case = FOREACH emp_data GENERATE sample_eval(name);
Verify the contents of the relation Upper_case as shown below.
TASK-8: Working with Hive QL, Use Hive to create, alter, and drop databases, tables, views, functions, and
indexes.
Prg:
Creating, loading data into a table:
Altering a table:
ALTER TABLE my_table ADD COLUMNS (age INT);
o/p: OK
Drop a table:
DROP TABLE my_table;
o/p: OK
25 | P a g e
Creating a view:
Drop a view:
DROP VIEW my_view;
o/p: OK
Alter a view:
Create a function:
CREATE FUNCTION my_function AS 'org.example.MyFunction' USING JAR 'my_functions.jar';
o/p: OK
Drop a function:
DROP FUNCTION my_function;
o/p: OK
Create an index:
Drop index:
DROP INDEX my_index ON my_table;
o/p: OK
26 | P a g e
Task 9: Writing User Defined Functions in Hive.
Prg:
1. Create a Java class for the User Defined Function which extends
org.apache.hadoop.hive.sq.exec.UDF and implements more than one evaluate() methods. Put in
your desired logic and you are almost there.
2. Package your Java class into a JAR file (I am using Maven)
3. Go to Hive CLI, add your JAR, and verify your JARs is in the Hive CLI class path.
4. CREATE TEMPORARY FUNCTION in Hive which points to your Java class
5. Use it in Hive SQL.
package org.hardik.letsdobigdata;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
if(str == null)
return null;
result.set(StringUtils.strip(str.toString(), stripChars));
return result;
if(str == null)
return null;
result.set(StringUtils.strip(str.toString()));
return result;
27 | P a g e
}
evaluate(Text str, String stripChars) - will trim specified characters in stripChars from
first argument str.
evaluate(Text str) - will trim leading and trailing spaces.
run "mvn clean package". This will create a JAR file which contains our UDF class. Copy the
JAR's path.
Next, add the Hive Jar files to the project class path.
Edit the .bashrc file to update the environment variables of Apache Pig. Add the following
variables at the end of the .bashrc as shown below.
export HIVE_HOME=/usr/local/hive
export HADOOP_HOME=/usr/local/hadoop
export CLASSPATH=$CLASSPATH:$HIVE_HOME/lib/*:$HADOOP_HOME/*:$HADOOP_HOME/lib/*
The first query strips ‘ha’ from string ‘hadoop’ as expected (2 argument evaluate() in
code). The second query strips trailing and leading spaces as expected.
28 | P a g e
TASK 10: Understanding the processing of large dataset on Spark framework.
Prg:
The result are stored in pyspark.sql.dataframe variable, Now let us look into our data scheme and
no.of record in it .
Let us compute the number of products per country to get an idea about the database composition :
from pyspark.sql.functions import col
BDD_countries = raw_data.groupBy("countries_tags").count().persist()
BDD_countries is also a pyspark data frame and has the following structure :
we can filter this new data frame to keep only the countries that have at least 5000 products recorded
in the database and plot the result:
29 | P a g e
TASK 11: Ingesting structured and unstructured data using Sqoop, Flume
Prg:
Ingesting structured data using Sqoop
Sqoop is a tool that enables you to import data from structured data stores like relational databases into Hadoop.
Here are the basic steps to ingest data using Sqoop:
a. Install Sqoop: Install Sqoop on the machine that will run the Sqoop command.
b. Connect to the source database: Use the `sqoop import` command to connect to the source database, specify the
table to import, and provide the necessary credentials.
c. Specify the target Hadoop system: Specify the target Hadoop system to where you want to import the data
d. Define the target Hadoop destination: Define the Hadoop destination for the imported data, such as a Hive table.
Here is an example command to import data from a MySQL database into a Hive table:
As you can see in the below image, after executing this command Map tasks will be executed at the back end.
After the code is executed, you can check the Web UI of HDFS i.e. localhost:50070 where the data is imported.
30 | P a g e
Ingesting unstructured data using Flume
Flume is a tool that enables you to ingest unstructured data such as log files and event streams into Hadoop. Here
are the basic steps to ingest data using Flume:
a. Install Flume: Install Flume on the machine that will run the Flume command.
b. Configure Flume: Configure Flume to read data from the source and write it to the target Hadoop system. This
involves specifying the source type, the source location, and the target Hadoop destination.
c. Start the Flume agent: Start the Flume agent to begin the data ingestion process.
Here is an example configuration file for Flume that reads data from a log file and writes it to a Hadoop
destination:
agent.sources = logsource
agent.channels = memoryChannel
agent.sinks = hdfssink
31 | P a g e
agent.sources.logsource.type = exec
agent.sources.logsource.command = tail -F /var/log/messages
agent.sources.logsource.channels = memoryChannel
agent.sinks.hdfssink.type = hdfs
agent.sinks.hdfssink.hdfs.path = hdfs://hadoop.example.com/logs/
agent.sinks.hdfssink.channel = memoryChannel
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 1000
agent.sources.logsource.interceptors = timestampInterceptor
agent.sources.logsource.interceptors.timestampInterceptor.type = timestamp
This configuration reads data from the `/var/log/messages` file using the `tail -F` command and writes it to the
`/logs` directory in HDFS. It also adds a timestamp to each event using the `timestampInterceptor` interceptor.
32 | P a g e