Big Data Lab
Big Data Lab
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
extends Reducer<Text,IntWritable,Text,IntWritable> {
Context context
int sum = 0;
sum += val.get();
result.set(sum);
context.write(key, result);
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
EXECUTION STEPS
1.Start-all.sh
2.jps
3.gedit WordCount.java
OUTPUT:
produce (key,value) pairs as ((i,k), (M,j,mij), for k=1,2,3,.. upto the number of columns of N
produce (key,value) pairs as ((i,k),(N,j,Njk), for i = 1,2,3,.. Upto the number of rows of M.
return Set of (key,value) pairs that each key (i,k), has list with values (M,j,mij) and (N, j,njk)
for all possible values of j.
multiply mij and njk for jth value of each list sum
CODE:
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
int m = Integer.parseInt(conf.get("m"));
int p = Integer.parseInt(conf.get("p"));
// implementation
if (indicesAndValue[0].equals("A")) {
+ indicesAndValue[3]);
context.write(outputKey, outputValue);
} else {
+ indicesAndValue[3]);
context.write(outputKey, outputValue);
String[] value;
int n = Integer.parseInt(context.getConfiguration().get("n"));
float a_ij;
float b_jk;
//implementation
value = val.toString().split(",");
if (value[0].equals("A")) {
hashA.put(Integer.parseInt(value[1]),
Float.parseFloat(value[2]));
} else {
hashB.put(Integer.parseInt(value[1]),
Float.parseFloat(value[2]));
if (result != 0.0f) {
context.write(null,
if (args.length != 5) {
System.exit(1);
conf.set("m", args[2]);
conf.set("n", args[3]);
conf.set("p", args[4]);
job.setJobName("MatrixMatrixMultiplicationOneStep");
job.setJarByClass(MatrixMulti.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setNumReduceTasks(1);
job.waitForCompletion(true);
INPUT 1
A,0,0,1.0
A,0,1,2.0
A,1,0,3.0
A,1,1,4.0
INPUT 2
B,0,0,5.0
B,0,1,6.0
B,1,0,7.0
B,1,1,8.0
EXECUTION STEPS:
1.Start-all.sh
2.jps
3.gedit MatrixMulti.java
OUTPUT:
Apache Pig, developed by Yahoo! helps in analyzing large datasets and spend less time in
writing mapper and reducer programs. Pig enables users to write complex data analysis code
without prior knowledge of Java. Pig’s simple SQL-like scripting language is called Pig
Latin and has its own Pig runtime environment where PigLatin programs are executed. For
more details, I would suggest you to go through this blog.
Once you complete this blog, I would suggest you to get your hands dirty with a POC from
this blog.
Below are two datasets that will be used in this post. Employee_details.txt This data set have
4 columns i.e. Emp_id: unique id for each employee Name: name of the employee Salary:
salary of an employee Ratings: Rating of an employee.
Relational Operators:
Load
To load the data either from local filesystem or Hadoop filesystem.
Syntax:
LOAD ‘path_of_data’ [USING function] [AS schema]; Where;
path_of_data : file/directory name in single quotes. USING : is the
keyword.
function : If you choose to omit this, default load function PigStorage() is used.
AS : is the keyword
schema : schema of your data along with data type.
Eg:
The file named employee_details.txt is comma separated file and we are going to load it from local file
system
dump A
Similarly, you can load another data into another relation, say ‘B’
B = LOAD ‘/home/acadgild/pig/employee_expenses.txt’ USING PigStorage(‘\t’) AS
(id:int, expenses:int);
As the fields in this file are tab separated, you need to use ‘\t’
NOTE: If you load this dataset in relation A, the earlier dataset will not be accessible.
Limit
Order
Groups the data based on one or multiple fields. It groups together tuples that have the same
group key (key field). The key field will be a tuple if the group key has more than one field,
otherwise it will be the same type as that of the group key.
Syntax:
alias = GROUP alias {ALL | BY field};
Where;
alias : is the relation GROUP
: is the keyword
ALL : keyword. Use ALL if you want all tuples to go to a single group BY
: keyword
Field : field name on which you want to group your data.
Ex:
We will group our relation A based on ratings.
Grouped = GROUP A BY ratings;
26
You can see that the output is a tuple based on ‘ratings’ with multiple bags. You can also
group the data based on more than one fields. For example,
Multi_group = GROUP A BY (ratings, salary);
Foreach
From the result, we can conclude that there are 4 employees who got 1 as their rating. It is
indicated by the first row. Basically, if want to operate at column level, you can use Foreach.
Filter
27
Use the FILTER operator to work with tuples or rows of data (if you want to work with
columns of data, use the FOREACH …GENERATE operation).
FILTER is commonly used to select the data that you want; or, conversely, to filter out
(remove) the data you don’t want.
We can also use multiple conditions to filter data at one go.
Eg:
Store
Stores and saves the data into a filesystem.
Syntax:
STORE alias INTO ‘directory’ [USING function] Where;
STORE : is a keyword
alias : is the relation which you want to store.
INTO : is the keyword
directory : name of directory where you want to store your result.
NOTE: If directory already exists, you will receive an error and STORE operation will fail.
function : the store function. By default, PigStorage is the default storage function and hence it
is not mandatory to mention this explicitly.
Ex:
We will store our result named “Multi_condition” (achieved in previous operation) into local
file system.
Only catch here is, I have specified PigStorage(‘|’) , means, I will loading result into local fs
and the delimiter will be pipeline i.e. ‘|’
Let’s check the result.
28
29
4. Implementing Database Operations on Hive.
THEORY:
Hive defines a simple SQL-like query language to querying and managing large datasets called
Hive-QL ( HQL ). It’s easy to use if you’re familiar with SQL Language. Hive allows
programmers who are familiar with the language to write the custom MapReduce framework to
perform more sophisticated analysis.
Uses of Hive:
1. The Apache Hive distributed storage.
2. Hive provides tools to enable easy data extract/transform/load (ETL)
3. It provides the structure on a variety of data formats.
4. By using Hive, we can access files stored in Hadoop Distributed File System (HDFS is used
to querying and managing large datasets residing in) or in other data storage systems such as
Apache HBase.
Limitations of Hive:
• Hive is not designed for Online transaction processing (OLTP ), it is only used for the
Online Analytical Processing.
• Hive supports overwriting or apprehending data, but not updates and deletes.
• In Hive, sub queries are not supported.
Why Hive is used inspite of Pig?
The following are the reasons why Hive is used in spite of Pig’s availability:
• Hive-QL is a declarative language line SQL, PigLatin is a data flow language.
• Pig: a data-flow language and environment for exploring very large datasets.
• Hive: a distributed data warehouse.
Components of Hive:
Metastore :
Hive stores the schema of the Hive tables in a Hive Metastore. Metastore is used to hold all the
information about the tables and partitions that are in the warehouse. By default, the metastore
is run in the same process as the Hive service and the default Metastore is DerBy Database.
SerDe :
Serializer, Deserializer gives instructions to hive on how to process a record.
Hive Commands :
Data Definition Language (DDL )
DDL statements are used to build and modify the tables and other objects in the database.
Example :
30
To list out the databases in Hive warehouse, enter the command ‘show databases’.
Copy the input data to HDFS from local by using the copy From Local command.
When we create a table in hive, it creates in the default location of the hive warehouse. –
“/user/hive/warehouse”, after creation of the table we can move the data from HDFS to
hive table.
The following command creates a table with in location of “/user/hive/warehouse/retail.db”
31
Note : retail.db is the database created in the Hive warehouse.
32
Data Manipulation Language (DML )
DML statements are used to retrieve, store, modify, delete, insert and update data in the database.
Example :
Here are some examples for the LOAD data LOCAL command
After loading the data into the Hive table we can apply the Data Manipulation Statements
or aggregate functions retrieve the data.
Example to count number of records:
Count aggregate function is used count the total number of the records in a table.
34
Insert Command:
The insert command is used to load the data Hive table. Inserts can be done to a table or
a partition.
• INSERT OVERWRITE is used to overwrite the existing data in the table or partition.
• INSERT INTO is used to append the data into existing data in a table. (Note: INSERT INTO
When we insert the data Hive throwing errors, the dynamic partition mode is strict and
dynamic partition not enabled (by Jeff at dresshead website). So we need to set the following
parameters in Hive shell.
35
BIG DATA ANALYTICS LAB – 21CSL76
//package freq_item;
import java.io.*;
import java.io.IOException;
import java.util.*;
import java.util.Scanner;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
int f=0;
int count=0,i=0,j=0,nitem=0;
Iterator iterator;
while (items.hasMoreTokens()) {
word.set(items.nextToken());
nitem++;
count=0;
context.write(word, one);
}
StringTokenizer items2 = new StringTokenizer(value.toString()," \t\n\r\f,.:;?![]'");
while (items2.hasMoreTokens()) {
list.add(items2.nextToken());
}
count=0;
clist.clear();
count=1;
for (i=0;i<(list.size()-1);i++)
{
for (j=i+1;j<list.size();j++)
{
items2 = new StringTokenizer(list.get(j));
while (items2.hasMoreTokens())
{
templist2.add(items2.nextToken());
}
f=0;
for (int k=0;k < (templist1.size()-1);k++)
{
if(!(templist1.get(k).equals(templist2.get(k))))
{
f=1;
break;
}
}
if(f == 0)
{
str="";
str=list.get(i)+" "+templist2.get(templist2.size()-1);
clist.add(str);
items2 = new StringTokenizer(str,"\n");
if(items2.hasMoreTokens()){
word.set(items2.nextToken());
context.write(word, one);
}
}
templist2.clear();
}
templist1.clear();
}
list.clear();
for(int k=0;k < clist.size();k++)
list.add(clist.get(k));
clist.clear();
}
}
}
public static class ReducerClass
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
result.set(sum);
if(sum>9)
context.write(key, result);
}
}
}
else
System.exit(1);
}
}
Input:
EXECUTION STEPS:
1.Start-all.sh
2.jps
3.gedit FreqItem.java
import java.io.IOException;
import java.util.*;
import java.io.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapred.Reducer;
@SuppressWarnings("deprecation")
public class kMeansClustering {
public static String OUT = "outfile";
public static String IN = "inputlarger";
public static String CENTROID_FILE_NAME = "/centroid.txt";
public static String OUTPUT_FILE_NAME = "/part-00000";
public static String DATA_FILE_NAME = "/data.txt";
public static String JOB_NAME = "KMeans";
public static String SPLITTER = "\t| ";
public static List<Double> mCenters = new ArrayList<Double>();
/*
* In Mapper class we are overriding configure function. In this we are
* reading file from Distributed Cache and then storing that into instance
Faculty of Engineering &Technology, (Co-Ed)
Department of Computer Science & Engineering Page 42
BIG DATA ANALYTICS LAB – 21CSL76
* variable "mCenters"
*/
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, DoubleWritable, DoubleWritable> {
@Override
public void configure(JobConf job) {
try {
// Fetch the file from Distributed Cache Read it and store the
// centroid in the ArrayList
Path[] cacheFiles = DistributedCache.getLocalCacheFiles(job);
if (cacheFiles != null && cacheFiles.length > 0) {
String line;
mCenters.clear();
BufferedReader cacheReader = new BufferedReader(
new FileReader(cacheFiles[0].toString()));
try {
// Read the file split by the splitter and store it in
// the list
while ((line = cacheReader.readLine()) != null) {
String[] temp = line.split(SPLITTER);
mCenters.add(Double.parseDouble(temp[0]));
}
} finally {
cacheReader.close();
}
}
} catch (IOException e) {
System.err.println("Exception reading DistribtuedCache: " + e);
}
}
/*
* Map function will find the minimum center of the point and emit it to
* the reducer
*/
@Override
public void map(LongWritable key, Text value,
OutputCollector<DoubleWritable, DoubleWritable> output,
Reporter reporter) throws IOException {
/*
* Reduce function will emit all the points to that center and calculate
* the next center for these points
*/
@Override
public void reduce(DoubleWritable key, Iterator<DoubleWritable> values,
OutputCollector<DoubleWritable, Text> output, Reporter reporter)
throws IOException {
double newCenter;
double sum = 0;
int no_elements = 0;
String points = "";
while (values.hasNext()) {
double d = values.next().get();
points = points + " " + Double.toString(d);
sum = sum + d;
++no_elements;
}
conf.setJobName(JOB_NAME);
conf.setMapOutputKeyClass(DoubleWritable.class);
conf.setMapOutputValueClass(DoubleWritable.class);
conf.setOutputKeyClass(DoubleWritable.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,
new Path(input + DATA_FILE_NAME));
FileOutputFormat.setOutputPath(conf, new Path(output));
JobClient.runJob(conf);
String prev;
if (iteration == 0) {
prev = input + CENTROID_FILE_NAME;
} else {
prev = again_input + OUTPUT_FILE_NAME;
}
Path prevfile = new Path(prev);
FileSystem fs1 = FileSystem.get(new Configuration());
BufferedReader br1 = new BufferedReader(new InputStreamReader(
fs1.open(prevfile)));
List<Double> centers_prev = new ArrayList<Double>();
String l = br1.readLine();
while (l != null) {
String[] sp1 = l.split(SPLITTER);
double d = Double.parseDouble(sp1[0]);
centers_prev.add(d);
l = br1.readLine();
}
br1.close();
// Sort the old centroid and new centroid and check for convergence
// condition
Collections.sort(centers_next);
Collections.sort(centers_prev);
Iterator<Double> it = centers_prev.iterator();
for (double d : centers_next) {
double temp = it.next();
if (Math.abs(temp - d) <= 0.1) { //convergence factor
isdone = true;
} else {
isdone = false;
break;
}
}
++iteration;
again_input = output;
output = OUT + System.nanoTime();
}
}
}
Input:
data.txt centroid.txt
20 20.0
23 30.0
19 40.0
29 60.0
33
29
43
35
18
25
27
47
55
63
59
69
15
25
54
89
EXECUTION STEPS:
1. start-all.sh
2. jps
3. gedit kMeansClustering.java
OutPut:
Theory:
PageRank is a way of measuring the importance of website pages. PageRank works by
counting the number and quality of links to a page to determine a rough estimate of how
important the website is. The underlying assumption is that more important websites are
likely to receive more links from other websites.
In the general case, the PageRank value for any page u can be expressed as:
i.e. the PageRank value for a page u is dependent on the PageRank values for each
page v contained in the set Bu (the set containing all pages linking to page u),
divided by the number L(v) of links from page v.
Suppose consider a small network of four web pages: A, B, C and D. Links from a page to
itself, or multiple outbound links from one single page to another single page, are ignored.
PageRank is initialized to the same value for all pages. In the original form of PageRank,
the sum of PageRank over all pages was the total number of pages on the web at that time,
so each page in this example would have an initial value of 1.
The damping factor (generally set to 0.85) is subtracted from 1 (and in some variations of
the algorithm, the result is divided by the number of documents (N) in the collection) and
this term is then added to the product of the damping factor and the sum of the incoming
PageRank scores.
That is,
So any page’s PageRank is derived in large part from the PageRanks of other
pages. The damping factor adjusts the derived value downward.
CODE:
import numpy as np
import scipy as sc
import pandas as pd
from fractions import Fraction
def display_format(my_vector, my_decimal):
return np.round((my_vector).astype(np.float),
decimals=my_decimal) my_dp = Fraction(1,3)
Mat = np.matrix([[0,0,1], [Fraction(1,2),0,0], [Fraction(1,2),1,0]]) Ex = np.zeros((3,3))
Ex[:] = my_dp beta = 0.7
Al = beta * Mat + ((1-beta) * Ex)
r = np.matrix([my_dp, my_dp, my_dp])
r = np.transpose(r)
previous_r = r
for i in range(1,100):
r = Al * r
print(display_format(r,3))
if (previous_r==r).all():
break
previous_r = r
print ("Final:\n", display_format(r,3)) print ("sum", np.sum(r))
OUTPUT:
[[0.333]
[0.217]
[0.45 ]]
[[0.415]
[0.217]
[0.368]]
[[0.358]
[0.245]
[0.397]]
...
//Reduce upper matrix if need to
...
[[0.375]
[0.231]
[0.393]]
FINAL:
[[0.375]
[0.231]
[0.393]]
sum 0.9999999999999951
// Mapper
/**
* @method map
* This method takes the input as a text data type.
* Now leaving the first five tokens, it takes
* 6th token is taken as temp_max and
* 7th token is taken as temp_min. Now
* temp_max > 30 and temp_min < 15 are
* passed to the reducer.
*/
@Override
public void map(LongWritable arg0, Text Value, Context context)
throws IOException, InterruptedException {
// if maximum temperature is
// greater than 30, it is a hot day
if (temp_Max > 30.0) {
// Hot day
context.write(new Text("The Day is Hot Day :" + date),
new
Text(String.valueOf(temp_Max)));
}
// Cold day
context.write(new Text("The Day is Cold Day :" + date),
new Text(String.valueOf(temp_Min)));
}
}
}
// Reducer
/**
* @method reduce
* This method takes the input as key and
* list of values pair from the mapper,
* it does aggregation based on keys and
* produces the final context.
*/
/**
* @method main
* This method is used for setting
* all the configuration properties.
* It acts as a driver for map-reduce
* code.
*/
}
}
Input:
Processinput.txt
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
EXECUTION STEPS:
1.Start-all.sh
2.jps
3.gedit MyMaxMin.java
OUTPUT: