0% found this document useful (0 votes)
12 views52 pages

Big Data Lab

The document outlines two programming tasks in a Big Data Analytics lab: implementing a Word Count MapReduce program and matrix multiplication using MapReduce. It includes code examples, execution steps, and theoretical explanations for both tasks, as well as an introduction to Apache Pig and its relational operators for data analysis. The document serves as a practical guide for students in the Computer Science & Engineering department to understand and implement big data processing techniques.

Uploaded by

fakemefakerules
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views52 pages

Big Data Lab

The document outlines two programming tasks in a Big Data Analytics lab: implementing a Word Count MapReduce program and matrix multiplication using MapReduce. It includes code examples, execution steps, and theoretical explanations for both tasks, as well as an introduction to Apache Pig and its relational operators for data analysis. The document serves as a practical guide for students in the Computer Science & Engineering department to understand and implement big data processing techniques.

Uploaded by

fakemefakerules
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

BIG DATA ANALYTICS LAB – 21CSL76

1. Hadoop Programming: Word Count MapReduce Program


import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper

extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 1
BIG DATA ANALYTICS LAB – 21CSL76

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,

Context context

) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

result.set(sum);

context.write(key, result);

public static void main(String[] args) throws Exception {

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 2
BIG DATA ANALYTICS LAB – 21CSL76

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "word count");

job.setJarByClass(WordCount.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

EXECUTION STEPS

1.Start-all.sh

2.jps

3.gedit WordCount.java

4.hadoop com.sun.tools.javac.Main WordCount.java

5.jar cf wcjava.jar WordCount*.class

6.create input file – wc1.txt

7.hadoop fs –mkdir /wcj

8.hadoop fs –put wc1.txt /wcj

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 3
BIG DATA ANALYTICS LAB – 21CSL76

9.hadoop jar wcjava.jar WordCount /wcj/wc1.txt /wcj/output

10.hadoop fs –cat /wcj/output/part-r-00000

OUTPUT:

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 4
BIG DATA ANALYTICS LAB – 21CSL76

2. Implementing Matrix Multiplication Using Map-Reduce.


THEORY:
In mathematics, matrix multiplication or the matrix product is a binary operation that produces a
matrix from two matrices. In more detail, if A is an n × m matrix and B is an m × p matrix, their
matrix product AB is an n × p matrix, in which the m entries across a row of A are multiplied
with the m entries down a column of B and summed to produce an entry of AB. When two
linear transformations are represented by matrices, then the matrix product represents the
composition of the two transformations.

Algorithm for Map Function:

for each element mij of M do

produce (key,value) pairs as ((i,k), (M,j,mij), for k=1,2,3,.. upto the number of columns of N

for each element njk of N do

produce (key,value) pairs as ((i,k),(N,j,Njk), for i = 1,2,3,.. Upto the number of rows of M.

return Set of (key,value) pairs that each key (i,k), has list with values (M,j,mij) and (N, j,njk)
for all possible values of j.

Algorithm for Reduce Function:

for each key (i,k) do

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 5
BIG DATA ANALYTICS LAB – 21CSL76

sort values begin with M by j in listM

sort values begin with N by j in listN

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 6
BIG DATA ANALYTICS LAB – 21CSL76

multiply mij and njk for jth value of each list sum

up mij x njk return (i,k), Σj=1 mij x njk

CODE:
import java.io.IOException;

import java.util.*;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.*;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class MatrixMulti {

public static class Map extends Mapper<LongWritable, Text, Text, Text> {

public void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

Configuration conf = context.getConfiguration();

int m = Integer.parseInt(conf.get("m"));

int p = Integer.parseInt(conf.get("p"));

String line = value.toString();

String[] indicesAndValue = line.split(",");

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 7
BIG DATA ANALYTICS LAB – 21CSL76

Text outputKey = new Text();

Text outputValue = new Text();

// implementation

if (indicesAndValue[0].equals("A")) {

for (int k = 0; k < p; k++) {

outputKey.set(indicesAndValue[1] + "," + k);

outputValue.set("A," + indicesAndValue[2] + ","

+ indicesAndValue[3]);

context.write(outputKey, outputValue);

} else {

for (int i = 0; i < m; i++) {

outputKey.set(i + "," + indicesAndValue[2]);

outputValue.set("B," + indicesAndValue[1] + ","

+ indicesAndValue[3]);

context.write(outputKey, outputValue);

public static class Reduce extends Reducer<Text, Text, Text, Text> {

public void reduce(Text key, Iterable<Text> values, Context context)

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 8
BIG DATA ANALYTICS LAB – 21CSL76

throws IOException, InterruptedException {

String[] value;

HashMap<Integer, Float> hashA = new HashMap<Integer, Float>();

HashMap<Integer, Float> hashB = new HashMap<Integer, Float>();

int n = Integer.parseInt(context.getConfiguration().get("n"));

float result = 0.0f;

float a_ij;

float b_jk;

//implementation

for (Text val : values) {

value = val.toString().split(",");

if (value[0].equals("A")) {

hashA.put(Integer.parseInt(value[1]),

Float.parseFloat(value[2]));

} else {

hashB.put(Integer.parseInt(value[1]),

Float.parseFloat(value[2]));

for (int j = 0; j < n; j++) {

a_ij = hashA.containsKey(j) ? hashA.get(j) : 0.0f;

b_jk = hashB.containsKey(j) ? hashB.get(j) : 0.0f;

result += a_ij * b_jk;

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 9
BIG DATA ANALYTICS LAB – 21CSL76

if (result != 0.0f) {

context.write(null,

new Text(key.toString() + "," + Float.toString(result)));

public static void main(String[] args) throws Exception {

if (args.length != 5) {

System.err.println("Use: MatrixMulti <input> <output> <m> <n> <p>");

System.exit(1);

Configuration conf = new Configuration();

// A is an m-by-n matrix; B is an n-by-p matrix.

conf.set("m", args[2]);

conf.set("n", args[3]);

conf.set("p", args[4]);

Job job = Job.getInstance(conf);

job.setJobName("MatrixMatrixMultiplicationOneStep");

job.setJarByClass(MatrixMulti.class);

job.setOutputKeyClass(Text.class);

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 10
BIG DATA ANALYTICS LAB – 21CSL76

job.setOutputValueClass(Text.class);

job.setMapperClass(Map.class);

job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

job.setNumReduceTasks(1);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

INPUT 1

A,0,0,1.0

A,0,1,2.0

A,1,0,3.0

A,1,1,4.0

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 11
BIG DATA ANALYTICS LAB – 21CSL76

INPUT 2

B,0,0,5.0

B,0,1,6.0

B,1,0,7.0

B,1,1,8.0

EXECUTION STEPS:

1.Start-all.sh

2.jps

3.gedit MatrixMulti.java

4.hadoop com.sun.tools.javac.Main MatrixMulti.java

5.jar cf wcjava.jar MatrixMulti*.class

6.create input file- mm1.txt & mm2.txt

7.hadoop fs –mkdir /mmj

8.hadoop fs –put mm1.txt mm2.txt /mmj

9.hadoop jar wcjava.jar MatrixMulti /mmj/mm*.txt /mmj/output

10.hadoop fs –cat /mmj/output/part-r-00000

OUTPUT:

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 12
BIG DATA ANALYTICS LAB – 21CSL76

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 13
BIG DATA ANALYTICS LAB – 21CSL76

3. Implementing Relational Algorithm on Pig.


THEORY:
In this instructional post, we will explore and understand few important relational operators in
Pig which is widely used in big data industry. Before we understand relational operators, let us
see what Pig is.

Apache Pig, developed by Yahoo! helps in analyzing large datasets and spend less time in
writing mapper and reducer programs. Pig enables users to write complex data analysis code
without prior knowledge of Java. Pig’s simple SQL-like scripting language is called Pig
Latin and has its own Pig runtime environment where PigLatin programs are executed. For
more details, I would suggest you to go through this blog.
Once you complete this blog, I would suggest you to get your hands dirty with a POC from
this blog.

Below are two datasets that will be used in this post. Employee_details.txt This data set have
4 columns i.e. Emp_id: unique id for each employee Name: name of the employee Salary:
salary of an employee Ratings: Rating of an employee.

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 14
BIG DATA ANALYTICS LAB – 21CSL76

Employee_expenses.txt This data set have 2 columns i.e. Emp_id: id of an employee


Expense: expenses made by an employee

Relational Operators:
Load
To load the data either from local filesystem or Hadoop filesystem.
Syntax:
LOAD ‘path_of_data’ [USING function] [AS schema]; Where;
path_of_data : file/directory name in single quotes. USING : is the
keyword.
function : If you choose to omit this, default load function PigStorage() is used.
AS : is the keyword
schema : schema of your data along with data type.
Eg:
The file named employee_details.txt is comma separated file and we are going to load it from local file
system

A = LOAD ‘/home/sukcse/employee_details.txt’ USING PigStorage(‘,’) AS (id:int,


name:chararray, salary:int, ratings:int);

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 15
BIG DATA ANALYTICS LAB – 21CSL76

dump A

Similarly, you can load another data into another relation, say ‘B’
B = LOAD ‘/home/acadgild/pig/employee_expenses.txt’ USING PigStorage(‘\t’) AS
(id:int, expenses:int);
As the fields in this file are tab separated, you need to use ‘\t’
NOTE: If you load this dataset in relation A, the earlier dataset will not be accessible.

Limit

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 16
BIG DATA ANALYTICS LAB – 21CSL76

Used to limit the number of outputs to the desired number.


Syntax:
Alias = LIMIT alias n;
Where;
alias : name of the relation.
n : number of tuples to be displayed. Ex:
We will be limiting the result of relation A (described above) to 5.
limited_val = LIMIT A 5;

NOTE: there is no guarantee which 5 tuples will be the output.

Order

Sorts a relation based on single or multiple fields. Syntax:


alias = ORDER alias BY {field_name [ASC | DESC] Where;
alias : is the relation
ORDER : is the keyword. BY:
is the keyword.
field_name : column on which you want to sort the relation. ASC
: sort in ascending order
DESC : sort in descending order. Eg:
We will sort the relation A based on the ratings field and get top 3 employee details with
highest ratings.

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 17
Sorted = ORDER A by ratings DESC;

Result = LIMIT Sorted 3;


You can also Order the relation based on multiple fields. Let’s order the relation A based on
Descending ‘ratings’ and Ascending ‘names’ and generate top 3 result.
Ex:
Double_sorted = ORDER A by ratings DESC, name ASC; Final_result
= LIMIT Double_sorted 3;

Now, compare the ‘Result’ and ‘Final_result’ relation.


Group

Groups the data based on one or multiple fields. It groups together tuples that have the same
group key (key field). The key field will be a tuple if the group key has more than one field,
otherwise it will be the same type as that of the group key.
Syntax:
alias = GROUP alias {ALL | BY field};
Where;
alias : is the relation GROUP
: is the keyword
ALL : keyword. Use ALL if you want all tuples to go to a single group BY
: keyword
Field : field name on which you want to group your data.
Ex:
We will group our relation A based on ratings.
Grouped = GROUP A BY ratings;

26
You can see that the output is a tuple based on ‘ratings’ with multiple bags. You can also
group the data based on more than one fields. For example,
Multi_group = GROUP A BY (ratings, salary);
Foreach

It generates data transformations based on desired columns of data.


Syntax:
alias = FOREACH alias GENERATE {expression |
field}; Where;
alias : is the relation
FOREACH : is the keyword
GENERATE : is the keyword
Ex:
In our previous example we saw how to group a relation. Now, using FOREACH, we
will generate the count of employees belonging to a particular group.
Result = FOREACH Grouped GENERATE group,COUNT(A.ratings);

From the result, we can conclude that there are 4 employees who got 1 as their rating. It is
indicated by the first row. Basically, if want to operate at column level, you can use Foreach.
Filter

Filters a relation based on certain condition.


Syntax:
alias = FILTER alias BY expression;
Where;
alias : is the relation
FILTER : is the keyword
BY : is the keyword
expression : condition on which filter will be performed.
Ex:
We will filter our data (relation A) based on ratings greater than or equal to 4.
Filtered = FILTER A BY ratings >= 4;

27
Use the FILTER operator to work with tuples or rows of data (if you want to work with
columns of data, use the FOREACH …GENERATE operation).
FILTER is commonly used to select the data that you want; or, conversely, to filter out
(remove) the data you don’t want.
We can also use multiple conditions to filter data at one go.
Eg:

Multi_condition = FILTER A BY (ratings >= 4) AND (salary >1000)


This will produce results that follows in the category of ratings greater than equals 4 and
salary greater than 1000.

Store
Stores and saves the data into a filesystem.
Syntax:
STORE alias INTO ‘directory’ [USING function] Where;
STORE : is a keyword
alias : is the relation which you want to store.
INTO : is the keyword
directory : name of directory where you want to store your result.
NOTE: If directory already exists, you will receive an error and STORE operation will fail.
function : the store function. By default, PigStorage is the default storage function and hence it
is not mandatory to mention this explicitly.
Ex:
We will store our result named “Multi_condition” (achieved in previous operation) into local
file system.

Only catch here is, I have specified PigStorage(‘|’) , means, I will loading result into local fs
and the delimiter will be pipeline i.e. ‘|’
Let’s check the result.

28
29
4. Implementing Database Operations on Hive.
THEORY:
Hive defines a simple SQL-like query language to querying and managing large datasets called
Hive-QL ( HQL ). It’s easy to use if you’re familiar with SQL Language. Hive allows
programmers who are familiar with the language to write the custom MapReduce framework to
perform more sophisticated analysis.
Uses of Hive:
1. The Apache Hive distributed storage.
2. Hive provides tools to enable easy data extract/transform/load (ETL)
3. It provides the structure on a variety of data formats.
4. By using Hive, we can access files stored in Hadoop Distributed File System (HDFS is used
to querying and managing large datasets residing in) or in other data storage systems such as
Apache HBase.
Limitations of Hive:
• Hive is not designed for Online transaction processing (OLTP ), it is only used for the
Online Analytical Processing.
• Hive supports overwriting or apprehending data, but not updates and deletes.
• In Hive, sub queries are not supported.
Why Hive is used inspite of Pig?
The following are the reasons why Hive is used in spite of Pig’s availability:
• Hive-QL is a declarative language line SQL, PigLatin is a data flow language.
• Pig: a data-flow language and environment for exploring very large datasets.
• Hive: a distributed data warehouse.
Components of Hive:
Metastore :

Hive stores the schema of the Hive tables in a Hive Metastore. Metastore is used to hold all the
information about the tables and partitions that are in the warehouse. By default, the metastore
is run in the same process as the Hive service and the default Metastore is DerBy Database.
SerDe :
Serializer, Deserializer gives instructions to hive on how to process a record.
Hive Commands :
Data Definition Language (DDL )

DDL statements are used to build and modify the tables and other objects in the database.
Example :

CREATE, DROP, TRUNCATE, ALTER, SHOW, DESCRIBE Statements.


Go to Hive shell by giving the command sudo hive and enter the
command ‘create database<data base name>’ to create the new database in the Hive.

30
To list out the databases in Hive warehouse, enter the command ‘show databases’.

The database creates in a default location of the Hive warehouse. In Cloudera,


Hive database store in a /user/hive/warehouse.

The command to use the database is USE <data base name>

Copy the input data to HDFS from local by using the copy From Local command.

When we create a table in hive, it creates in the default location of the hive warehouse. –
“/user/hive/warehouse”, after creation of the table we can move the data from HDFS to
hive table.
The following command creates a table with in location of “/user/hive/warehouse/retail.db”

31
Note : retail.db is the database created in the Hive warehouse.

32
Data Manipulation Language (DML )
DML statements are used to retrieve, store, modify, delete, insert and update data in the database.

Example :

LOAD, INSERT Statements.


Syntax :
LOAD data <LOCAL> inpath <file path> into table [tablename]
The Load operation is used to move the data into corresponding Hive table. If the
keyword local is specified, then in the load command will give the local file system path. If the
keyword local is not specified we have to use the HDFS path of the file.

Here are some examples for the LOAD data LOCAL command

After loading the data into the Hive table we can apply the Data Manipulation Statements
or aggregate functions retrieve the data.
Example to count number of records:

Count aggregate function is used count the total number of the records in a table.

‘create external’ Table :


33
The create external keyword is used to create a table and provides a location where the
table will create, so that Hive does not use a default location for this table. An EXTERNAL
table points to any HDFS location for its storage, rather than default storage.

34
Insert Command:
The insert command is used to load the data Hive table. Inserts can be done to a table or
a partition.
• INSERT OVERWRITE is used to overwrite the existing data in the table or partition.
• INSERT INTO is used to append the data into existing data in a table. (Note: INSERT INTO

syntax is work from the version 0.8)

Example for ‘Partitioned By’ and ‘Clustered By’ Command :


‘Partitioned by‘ is used to divided the table into the Partition and can be divided in to

buckets by using the ‘Clustered By‘ command.

When we insert the data Hive throwing errors, the dynamic partition mode is strict and
dynamic partition not enabled (by Jeff at dresshead website). So we need to set the following
parameters in Hive shell.

35
BIG DATA ANALYTICS LAB – 21CSL76

5. Implementing Frequent Item set algorithm using Map-Reduce.

FREQUENT ITEM SET PROGRAM

//package freq_item;

import java.io.*;
import java.io.IOException;
import java.util.*;
import java.util.Scanner;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class FreqItem {


static int ntrans=0;

public static class MapperClass


extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context


) throws IOException, InterruptedException {
StringTokenizer items = new StringTokenizer(value.toString()," \t\n\r\f,.:;?![]'");
LinkedList <String>list = new LinkedList<String>();
LinkedList <String>clist = new LinkedList<String>();
LinkedList <String>templist1 = new LinkedList<String>();
LinkedList <String>templist2 = new LinkedList<String>();
ntrans++;
String str="";

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 36
BIG DATA ANALYTICS LAB – 21CSL76

int f=0;
int count=0,i=0,j=0,nitem=0;
Iterator iterator;

while (items.hasMoreTokens()) {
word.set(items.nextToken());
nitem++;
count=0;
context.write(word, one);
}
StringTokenizer items2 = new StringTokenizer(value.toString()," \t\n\r\f,.:;?![]'");
while (items2.hasMoreTokens()) {
list.add(items2.nextToken());
}

count=0;
clist.clear();

count=1;

while (count < nitem)


{
count=count+1;

for (i=0;i<(list.size()-1);i++)
{

items2 = new StringTokenizer(list.get(i));


while (items2.hasMoreTokens())
{
templist1.add(items2.nextToken());
}

for (j=i+1;j<list.size();j++)
{
items2 = new StringTokenizer(list.get(j));
while (items2.hasMoreTokens())
{
templist2.add(items2.nextToken());
}

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 37
BIG DATA ANALYTICS LAB – 21CSL76

f=0;
for (int k=0;k < (templist1.size()-1);k++)
{
if(!(templist1.get(k).equals(templist2.get(k))))
{
f=1;
break;
}
}

if(f == 0)
{
str="";
str=list.get(i)+" "+templist2.get(templist2.size()-1);

clist.add(str);
items2 = new StringTokenizer(str,"\n");
if(items2.hasMoreTokens()){
word.set(items2.nextToken());
context.write(word, one);
}
}
templist2.clear();
}

templist1.clear();
}
list.clear();
for(int k=0;k < clist.size();k++)
list.add(clist.get(k));
clist.clear();
}
}
}
public static class ReducerClass
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 38
BIG DATA ANALYTICS LAB – 21CSL76

Context context) throws IOException, InterruptedException {


int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}

result.set(sum);
if(sum>9)
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {

long start = System.currentTimeMillis( );


int m;
System.out.println("input 1 int");
Scanner in=new Scanner(System.in);

Configuration conf = new Configuration();


Job job = new Job(conf, "project");
job.setJarByClass(FreqItem.class);
job.setMapperClass(MapperClass.class);
job.setReducerClass(ReducerClass.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
if(job.waitForCompletion(true))
{

}
else
System.exit(1);
}
}

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 39
BIG DATA ANALYTICS LAB – 21CSL76

Input:

Item1 Item2 Item3 Item4 Item5 Item7 Item8 Item9


Item8
Item1 Item2 Item9
Item3
Item2 Item5 Item6 Item7 Item8 Item9
Item3 Item7 Item8
Item1 Item2 Item3 Item7 Item8
Item3 Item4 Item5 Item6 Item7
Item2 Item5 Item6 Item7 Item8 Item9
Item1 Item2 Item4 Item5 Item6 Item7 Item8
Item1 Item2 Item3 Item5
Item2 Item3 Item6 Item7 Item9
Item2 Item7
Item1 Item2 Item5 Item6
Item3 Item6 Item7 Item8
Item1 Item2 Item4 Item9
Item4 Item7 Item8 Item9
Item1 Item2 Item6 Item7 Item8
Item3 Item5 Item6 Item8
Item1 Item3 Item4 Item5 Item7 Item8
Item1 Item4 Item6
Item3 Item6 Item8
Item1 Item2 Item4 Item5
Item3 Item5
Item3 Item5 Item8 Item9
Item1 Item3 Item5 Item8
Item1 Item4 Item5 Item6
Item2 Item3 Item4 Item9
Item1 Item2 Item3 Item5 Item7 Item8
Item4 Item7 Item8 Item9
Item1 Item2 Item3 Item4 Item5 Item6 Item8 Item9
Item1 Item4 Item5 Item7
Item1 Item2 Item5 Item6 Item7 Item8
Item2 Item3 Item7
Item4 Item8 Item9
Item6 Item7
Item1 Item3 Item5 Item6 Item9

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 40
BIG DATA ANALYTICS LAB – 21CSL76

EXECUTION STEPS:

1.Start-all.sh

2.jps

3.gedit FreqItem.java

4.hadoop com.sun.tools.javac.Main FreqItem.java

5.jar cf fi.jar FreqItem *.class

6.create input file- input.txt

7.hadoop fs –mkdir /fi

8.hadoop fs –put input.txt /fi

9.hadoop jar wcjava.jar FreqItem /fi/input.txt /fi/output

10.hadoop fs –cat /fi/output/part-r-00000


OUTPUT:

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 41
BIG DATA ANALYTICS LAB – 21CSL76

6. Implementing Clustering algorithm using Map-Reduce

import java.io.IOException;
import java.util.*;
import java.io.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapred.Reducer;

@SuppressWarnings("deprecation")
public class kMeansClustering {
public static String OUT = "outfile";
public static String IN = "inputlarger";
public static String CENTROID_FILE_NAME = "/centroid.txt";
public static String OUTPUT_FILE_NAME = "/part-00000";
public static String DATA_FILE_NAME = "/data.txt";
public static String JOB_NAME = "KMeans";
public static String SPLITTER = "\t| ";
public static List<Double> mCenters = new ArrayList<Double>();

/*
* In Mapper class we are overriding configure function. In this we are
* reading file from Distributed Cache and then storing that into instance
Faculty of Engineering &Technology, (Co-Ed)
Department of Computer Science & Engineering Page 42
BIG DATA ANALYTICS LAB – 21CSL76

* variable "mCenters"
*/
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, DoubleWritable, DoubleWritable> {
@Override
public void configure(JobConf job) {
try {
// Fetch the file from Distributed Cache Read it and store the
// centroid in the ArrayList
Path[] cacheFiles = DistributedCache.getLocalCacheFiles(job);
if (cacheFiles != null && cacheFiles.length > 0) {
String line;
mCenters.clear();
BufferedReader cacheReader = new BufferedReader(
new FileReader(cacheFiles[0].toString()));
try {
// Read the file split by the splitter and store it in
// the list
while ((line = cacheReader.readLine()) != null) {
String[] temp = line.split(SPLITTER);

mCenters.add(Double.parseDouble(temp[0]));
}
} finally {
cacheReader.close();
}
}
} catch (IOException e) {
System.err.println("Exception reading DistribtuedCache: " + e);
}
}

/*
* Map function will find the minimum center of the point and emit it to
* the reducer
*/
@Override
public void map(LongWritable key, Text value,
OutputCollector<DoubleWritable, DoubleWritable> output,
Reporter reporter) throws IOException {

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 43
BIG DATA ANALYTICS LAB – 21CSL76

String line = value.toString();


double point = Double.parseDouble(line);
double min1, min2 = Double.MAX_VALUE, nearest_center = mCenters
.get(0);
// Find the minimum center from a point
for (double c : mCenters) {
min1 = c - point;
if (Math.abs(min1) < Math.abs(min2)) {
nearest_center = c;
min2 = min1;
}
}
// Emit the nearest center and the point
output.collect(new DoubleWritable(nearest_center),
new DoubleWritable(point));
}
}

public static class Reduce extends MapReduceBase implements


Reducer<DoubleWritable, DoubleWritable, DoubleWritable, Text> {

/*
* Reduce function will emit all the points to that center and calculate
* the next center for these points
*/
@Override
public void reduce(DoubleWritable key, Iterator<DoubleWritable> values,
OutputCollector<DoubleWritable, Text> output, Reporter reporter)
throws IOException {
double newCenter;
double sum = 0;
int no_elements = 0;
String points = "";
while (values.hasNext()) {
double d = values.next().get();
points = points + " " + Double.toString(d);
sum = sum + d;
++no_elements;
}

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 44
BIG DATA ANALYTICS LAB – 21CSL76

// We have new center now


newCenter = sum / no_elements;

// Emit new center and point


output.collect(new DoubleWritable(newCenter), new Text(points));
}
}

public static void main(String[] args) throws Exception {


run(args);
}

public static void run(String[] args) throws Exception {


IN = args[0];
OUT = args[1];
String input = IN;
String output = OUT + System.nanoTime();
String again_input = output;

// Reiterating till the convergence


int iteration = 0;
boolean isdone = false;
while (isdone == false) {
JobConf conf = new JobConf(kMeansClustering.class);
if (iteration == 0) {
Path hdfsPath = new Path(input + CENTROID_FILE_NAME);
// upload the file to hdfs. Overwrite any existing copy.
DistributedCache.addCacheFile(hdfsPath.toUri(), conf);
} else {
Path hdfsPath = new Path(again_input +
OUTPUT_FILE_NAME);
// upload the file to hdfs. Overwrite any existing copy.
DistributedCache.addCacheFile(hdfsPath.toUri(), conf);
}

conf.setJobName(JOB_NAME);
conf.setMapOutputKeyClass(DoubleWritable.class);
conf.setMapOutputValueClass(DoubleWritable.class);
conf.setOutputKeyClass(DoubleWritable.class);
conf.setOutputValueClass(Text.class);

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 45
BIG DATA ANALYTICS LAB – 21CSL76

conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf,
new Path(input + DATA_FILE_NAME));
FileOutputFormat.setOutputPath(conf, new Path(output));

JobClient.runJob(conf);

Path ofile = new Path(output + OUTPUT_FILE_NAME);


FileSystem fs = FileSystem.get(new Configuration());
BufferedReader br = new BufferedReader(new InputStreamReader(
fs.open(ofile)));
List<Double> centers_next = new ArrayList<Double>();
String line = br.readLine();
while (line != null) {
String[] sp = line.split("\t| ");
double c = Double.parseDouble(sp[0]);
centers_next.add(c);
line = br.readLine();
}
br.close();

String prev;
if (iteration == 0) {
prev = input + CENTROID_FILE_NAME;
} else {
prev = again_input + OUTPUT_FILE_NAME;
}
Path prevfile = new Path(prev);
FileSystem fs1 = FileSystem.get(new Configuration());
BufferedReader br1 = new BufferedReader(new InputStreamReader(
fs1.open(prevfile)));
List<Double> centers_prev = new ArrayList<Double>();
String l = br1.readLine();
while (l != null) {
String[] sp1 = l.split(SPLITTER);
double d = Double.parseDouble(sp1[0]);

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 46
BIG DATA ANALYTICS LAB – 21CSL76

centers_prev.add(d);
l = br1.readLine();
}
br1.close();

// Sort the old centroid and new centroid and check for convergence
// condition
Collections.sort(centers_next);
Collections.sort(centers_prev);

Iterator<Double> it = centers_prev.iterator();
for (double d : centers_next) {
double temp = it.next();
if (Math.abs(temp - d) <= 0.1) { //convergence factor
isdone = true;
} else {
isdone = false;
break;
}
}
++iteration;
again_input = output;
output = OUT + System.nanoTime();
}
}
}

Input:
data.txt centroid.txt
20 20.0
23 30.0
19 40.0
29 60.0
33
29
43
35
18
25
27
47

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 47
BIG DATA ANALYTICS LAB – 21CSL76

55
63
59
69
15
25
54
89

EXECUTION STEPS:

1. start-all.sh

2. jps

3. gedit kMeansClustering.java

4. hadoop com.sun.tools.javac.Main kMeansClustering.java

5. jar cf km.jar kMeansClustering *.class

6. create input file- data.txt & centroid.txt

7. hadoop fs –mkdir /kmm

8. hadoop fs –put data.txt centroid.txt /kmm

9. hadoop jar km.jar kMeansClustering /kmm /kmm/output

10. hadoop fs –cat /kmm/output*/part-00000

OutPut:

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 48
BIG DATA ANALYTICS LAB – 21CSL76

7. Implementing Page Rank algorithm using Map-Reduce


Faculty of Engineering &Technology, (Co-Ed)
Department of Computer Science & Engineering Page 49
BIG DATA ANALYTICS LAB – 21CSL76

Theory:
PageRank is a way of measuring the importance of website pages. PageRank works by
counting the number and quality of links to a page to determine a rough estimate of how
important the website is. The underlying assumption is that more important websites are
likely to receive more links from other websites.
In the general case, the PageRank value for any page u can be expressed as:

i.e. the PageRank value for a page u is dependent on the PageRank values for each
page v contained in the set Bu (the set containing all pages linking to page u),
divided by the number L(v) of links from page v.
Suppose consider a small network of four web pages: A, B, C and D. Links from a page to
itself, or multiple outbound links from one single page to another single page, are ignored.
PageRank is initialized to the same value for all pages. In the original form of PageRank,
the sum of PageRank over all pages was the total number of pages on the web at that time,
so each page in this example would have an initial value of 1.

The damping factor (generally set to 0.85) is subtracted from 1 (and in some variations of
the algorithm, the result is divided by the number of documents (N) in the collection) and
this term is then added to the product of the damping factor and the sum of the incoming
PageRank scores.
That is,

So any page’s PageRank is derived in large part from the PageRanks of other
pages. The damping factor adjusts the derived value downward.

CODE:
import numpy as np
import scipy as sc
import pandas as pd
from fractions import Fraction
def display_format(my_vector, my_decimal):
return np.round((my_vector).astype(np.float),
decimals=my_decimal) my_dp = Fraction(1,3)
Mat = np.matrix([[0,0,1], [Fraction(1,2),0,0], [Fraction(1,2),1,0]]) Ex = np.zeros((3,3))
Ex[:] = my_dp beta = 0.7
Al = beta * Mat + ((1-beta) * Ex)
r = np.matrix([my_dp, my_dp, my_dp])
r = np.transpose(r)
previous_r = r
for i in range(1,100):
r = Al * r

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 50
BIG DATA ANALYTICS LAB – 21CSL76

print(display_format(r,3))
if (previous_r==r).all():
break
previous_r = r
print ("Final:\n", display_format(r,3)) print ("sum", np.sum(r))

OUTPUT:
[[0.333]
[0.217]
[0.45 ]]
[[0.415]
[0.217]
[0.368]]
[[0.358]
[0.245]
[0.397]]
...
//Reduce upper matrix if need to
...
[[0.375]
[0.231]
[0.393]]

FINAL:
[[0.375]
[0.231]
[0.393]]
sum 0.9999999999999951

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 51
BIG DATA ANALYTICS LAB – 21CSL76

8. Develop a MapReduce to find the maximum electrical consumption in


each year given electrical consumption for each month in each year
package hadoop;
import java.util.*;
import java.io.IOException;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class ProcessUnits {
//Mapper class
public static class E_EMapper extends MapReduceBase implements
Mapper<LongWritable ,/*Input key Type */
Text, /Input value Type/
Text, /Output key Type/
IntWritable> /Output value Type/
{
//Map function
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
String lasttoken = null;
StringTokenizer s = new StringTokenizer(line,"\t");
String year = s.nextToken();
while(s.hasMoreTokens()) {
lasttoken = s.nextToken();
}
int avgprice = Integer.parseInt(lasttoken);
output.collect(new Text(year), new IntWritable(avgprice));
}
}
//Reducer class
public static class E_EReduce extends MapReduceBase implements
Reducer< Text, IntWritable, Text, IntWritable > {
//Reduce function
public void reduce( Text key, Iterator <IntWritable> values,

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 52
BIG DATA ANALYTICS LAB – 21CSL76

OutputCollector<Text, IntWritable> output, Reporter reporter)


throws IOException {
int maxavg = 30;
int val = Integer.MIN_VALUE;
while (values.hasNext()) {
if((val = values.next().get())>maxavg) {
output.collect(key, new IntWritable(val));
}
}
}
}
//Main function
public static void main(String args[])throws Exception {
JobConf conf = new JobConf(ProcessUnits.class);
conf.setJobName("max_eletricityunits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 53
BIG DATA ANALYTICS LAB – 21CSL76

9. Develop a MapReduce to analyze weather data set and print whether


the day is shinny or cool day
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;

public class MyMaxMin {

// Mapper

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 54
BIG DATA ANALYTICS LAB – 21CSL76

/*MaxTemperatureMapper class is static


* and extends Mapper abstract class
* having four Hadoop generics type
* LongWritable, Text, Text, Text.
*/

public static class MaxTemperatureMapper extends


Mapper<LongWritable, Text, Text, Text> {

/**
* @method map
* This method takes the input as a text data type.
* Now leaving the first five tokens, it takes
* 6th token is taken as temp_max and
* 7th token is taken as temp_min. Now
* temp_max > 30 and temp_min < 15 are
* passed to the reducer.
*/

// the data in our data set with


// this value is inconsistent data
public static final int MISSING = 9999;

@Override
public void map(LongWritable arg0, Text Value, Context context)
throws IOException, InterruptedException {

// Convert the single row(Record) to


// String and store it in String
// variable name line

String line = Value.toString();

// Check for the empty line


if (!(line.length() == 0)) {

// from character 6 to 14 we have


// the date in our dataset
String date = line.substring(6, 14);

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 55
BIG DATA ANALYTICS LAB – 21CSL76

// similarly we have taken the maximum


// temperature from 39 to 45 characters
float temp_Max = Float.parseFloat(line.substring(39, 45).trim());

// similarly we have taken the minimum


// temperature from 47 to 53 characters

float temp_Min = Float.parseFloat(line.substring(47, 53).trim());

// if maximum temperature is
// greater than 30, it is a hot day
if (temp_Max > 30.0) {

// Hot day
context.write(new Text("The Day is Hot Day :" + date),
new
Text(String.valueOf(temp_Max)));
}

// if the minimum temperature is


// less than 15, it is a cold day
if (temp_Min < 15) {

// Cold day
context.write(new Text("The Day is Cold Day :" + date),
new Text(String.valueOf(temp_Min)));
}
}
}

// Reducer

/*MaxTemperatureReducer class is static


and extends Reducer abstract class
having four Hadoop generics type
Text, Text, Text, Text.
*/

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 56
BIG DATA ANALYTICS LAB – 21CSL76

public static class MaxTemperatureReducer extends


Reducer<Text, Text, Text, Text> {

/**
* @method reduce
* This method takes the input as key and
* list of values pair from the mapper,
* it does aggregation based on keys and
* produces the final context.
*/

public void reduce(Text Key, Iterator<Text> Values, Context context)


throws IOException, InterruptedException {

// putting all the values in


// temperature variable of type String
String temperature = Values.next().toString();
context.write(Key, new Text(temperature));
}

/**
* @method main
* This method is used for setting
* all the configuration properties.
* It acts as a driver for map-reduce
* code.
*/

public static void main(String[] args) throws Exception {

// reads the default configuration of the


// cluster from the configuration XML files
Configuration conf = new Configuration();

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 57
BIG DATA ANALYTICS LAB – 21CSL76

// Initializing the job with the


// default configuration of the cluster
Job job = new Job(conf, "weather example");

// Assigning the driver class name


job.setJarByClass(MyMaxMin.class);

// Key type coming out of mapper


job.setMapOutputKeyClass(Text.class);

// value type coming out of mapper


job.setMapOutputValueClass(Text.class);

// Defining the mapper class name


job.setMapperClass(MaxTemperatureMapper.class);

// Defining the reducer class name


job.setReducerClass(MaxTemperatureReducer.class);

// Defining input Format class which is


// responsible to parse the dataset
// into a key value pair
job.setInputFormatClass(TextInputFormat.class);

// Defining output Format class which is


// responsible to parse the dataset
// into a key value pair
job.setOutputFormatClass(TextOutputFormat.class);

// setting the second argument


// as a path in a path variable
Path OutputPath = new Path(args[1]);

// Configuring the input path


// from the filesystem into the job
FileInputFormat.addInputPath(job, new Path(args[0]));

// Configuring the output path from


// the filesystem into the job
FileOutputFormat.setOutputPath(job, new Path(args[1]));

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 58
BIG DATA ANALYTICS LAB – 21CSL76

// deleting the context path automatically


// from hdfs so that we don't have
// to delete it explicitly
OutputPath.getFileSystem(conf).delete(OutputPath);

// exiting the job only if the


// flag value becomes false
System.exit(job.waitForCompletion(true) ? 0 : 1);

}
}

Input:
Processinput.txt
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45

EXECUTION STEPS:

1.Start-all.sh

2.jps

3.gedit MyMaxMin.java

4.hadoop com.sun.tools.javac.Main MyMaxMin.java

5.jar cf wcjava.jar MyMaxMin *.class

6.create input file- processinpu.txtt

7.hadoop fs –mkdir /mmm

8.hadoop fs –put processinput.txt /mmm

9.hadoop jar wcjava.jar MyMaxMin /mmm/processinput.txt /mmm/output

10.hadoop fs –cat /mmm/output/part-r-00000


Faculty of Engineering &Technology, (Co-Ed)
Department of Computer Science & Engineering Page 59
BIG DATA ANALYTICS LAB – 21CSL76

OUTPUT:

Faculty of Engineering &Technology, (Co-Ed)


Department of Computer Science & Engineering Page 60

You might also like