0% found this document useful (0 votes)
5 views39 pages

Bda Lab

The document outlines the installation, configuration, and operation of Hadoop 2.8.0 on Windows 10, including steps for setting up Java, configuring Hadoop files, and running MapReduce programs. It also provides R programming examples for implementing various statistical methods like linear regression, logistic regression, and decision trees, as well as data visualization techniques. Additionally, it includes code snippets for clustering and support vector machine implementations.

Uploaded by

Shekina Satheesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views39 pages

Bda Lab

The document outlines the installation, configuration, and operation of Hadoop 2.8.0 on Windows 10, including steps for setting up Java, configuring Hadoop files, and running MapReduce programs. It also provides R programming examples for implementing various statistical methods like linear regression, logistic regression, and decision trees, as well as data visualization techniques. Additionally, it includes code snippets for clustering and support vector machine implementations.

Uploaded by

Shekina Satheesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

AIM

To Install, configure and run Hadoop and HDFS

These software’s should be prepared to install Hadoop 2.8.0 on window 10 64 bits.

1) Download Hadoop 2.8.0


(Link:
http://wwweu.apache.org/dist/hadoop/common/hadoop2.8.0/h
adoop- 2.8.0.tar.gz OR
http://archive.apache.org/dist/hadoop/core//hadoop-
2.8.0/hadoop2.8.0.tar.gz)

2) Java JDK 1.8.0.zip


(Link: http://www.oracle.com/technetwork/java/javase/downloads/jdk8downloads-
2133151.html)

Set up:

1) Check either Java 1.8.0 is already installed on your system or


not, use “Javac version" to check Java version

2) If Java is not installed on your system then first install java under
"C:\JAVA" Java setup

3) Extract files Hadoop 2.8.0.tar.gz or Hadoop-2.8.0.zip and place


under "C:\Hadoop- 2.8.0" Hadoop

4) Set the path HADOOP_HOME Environment variable on windows


10(see Step 1, 2, 3 and 4 below) hadoop

5) Set the path JAVA_HOME Environment variable on windows


10(see Step 1, 2, 3 and 4 below) java

6) Next we set the Hadoop bin directory path and JAVA bin directory path
Configuration :

a) File C:/Hadoop-2.8.0/etc/hadoop/core-site.xml, paste below xml


paragraph and save this file.
<configuration>
<property>
<name>fs.defaultFS</name><value>hdfs://localhost:9000</value>
</property>
</configuration>

b) Rename "mapred-
site.xml.template" to "mapred-site.xml" and edit this file C:/Hadoop-
2.8.0/etc/hadoop/mapred-site.xml, paste below xml paragraph and save
this file.

<configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

</configuration>

c) Create folder "data" under "C:\Hadoop-2.8.0"

1) Create folder "datanode" under "C:\Hadoop-2.8.0\data"

2) Create folder "namenode" under "C:\Hadoop-2.8.0\data" data

d) Edit file C:\Hadoop-2.8.0/etc/hadoop/hdfs-site.xml, paste below xml


paragraph and save this file.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-2.8.0\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-2.8.0\data\datanode</value>
</property>
</
configuration
>
e) Edit file C:/Hadoop-2.8.0/etc/hadoop/yarn-site.xml, paste below xml
paragraph and save this file.
<configuratio
n>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</
configuration
>
f) Edit file C:/Hadoop-2.8.0/etc/hadoop/hadoop-env.cmd by closing the
command line "JAVA_HOME=%JAVA_HOME%" instead of set
"JAVA_HOME=C:\Java" (On C:\java this is path to file jdk.18.0)
Hadoop Configuration
7) Download file Hadoop Configuration.zip (Link:
https://github.com/MuhammadBilalYar/HADOOP-INSTALLATION-
ONWINDOW- 10/blob/master/Hadoop%20Configuration.zip)
8) Delete file bin on C:\Hadoop-2.8.0\bin, replaced by file bin on file just
download (from Hadoop Configuration.zip).
9) Open cmd and typing command "hdfs namenode –format" .You will see hdfs
namenode
–format
Testing
10) Open cmd and change directory to "C:\Hadoop-2.8.0\sbin" and type
"startall.cmd" to start apache.
11) Make sure these apps are running. a) Name node b)Hadoop data
node c) YARN Resource Manager d)YARN Node Manager hadoop
nodes
12) Open: http://localhost:8088

13) Open: http://localhost:50070


AIM:

To Implement word count / frequency programs using MapReduce


Procedure:
Prepare:
1. Download MapReduceClient.jar
(Link:
https://github.com/MuhammadBilalYar/HADOOPINSTALLATI
ON-ON- WINDOW-10/blob/master/MapReduceClient.jar)
2. Download Input_file.txt
(Link:
https://github.com/MuhammadBilalYar/HADOOPINSTALLATI
ON-ON- WINDOW-10/blob/master/input_file.txt)
Place both files in "C:/"
Hadoop Operation:
1. Open cmd in Administrative mode and move to "C:/Hadoop-2.8.0/sbin" and start
cluster
Start-all.cmd

2. Create an input directory in HDFS.

hadoop fs -mkdir /input_dir


3. Copy the input text file named input_file.txt in the input directory (input_dir) of
HDFS.
hadoop fs -put C:/input_file.txt /input_dir
4. Verify input_file.txt available in HDFS input directory (input_dir).
hadoop fs -ls /input_dir/

5. Verify content of the copied file.

hadoop dfs -cat /input_dir/input_file.txt


6. Run MapReduceClient.jar and also provide input and out directories.

hadoop jar C:/MapReduceClient.jar wordcount /input_dir /output_dir

7. dVerify content for generated output file.

hadoop dfs -cat /output_dir/*


Some Other useful commands
8) To leave Safe mode

hadoop dfsadmin –safemode leave


9) To delete file from HDFS directory

hadoop fs -rm -r
/iutput_dir/input_file.txt
10) To delete directory from HDFS directory

hadoop fs -rm -r /iutput_dir


AIM:

To Implementation of an MR program that processes a weather dataset

Program :

AverageMapper.java

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.*; import java.io.IOException;

public class AverageMapper extends Mapper <LongWritable, Text, Text,


IntWritable>

public static final int MISSING = 9999;

public void map(LongWritable key, Text value, Context context)


throws IOException, InterruptedException

String line = value.toString(); String year = line.substring(15,19);

int temperature;

if (line.charAt(87)=='+')

temperature =
Integer.parseInt(line.substring(88, 92));
else

temperature = Integer.parseInt(line.substring(87, 92));

String quality = line.substring(92, 93);

if(temperature != MISSING && quality.matches("[01459]"))

context.write(new Text(year),new IntWritable(temperature));

AverageReducer.java

Import org.apache.hadoop.mapreduce.*;

import java.io.IOException;
public class AverageReducer extends Reducer <Text, IntWritable,Text,
IntWritable >

public void reduce(Text key, Iterable<IntWritable> values, Context context)


throws IOException,

InterruptedException

int max_temp = 0;

int count = 0;
for (IntWritable value : values)

max_temp += value.get();

count+=1;

context.write(key, new IntWritable(max_temp/count)); }


}

AverageDriver.java

import org.apache.hadoop.io.*import

org.apache.hadoop.fs.*;

import org.apache.hadoop.mapreduce.*;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import

org.apache.hadoop.mapreduce.lib.output.FileOutputF

ormat; public class AverageDriver

public static void main (String[] args) throws Exception

{
if (args.length != 2)

System.err.println("Please Enter the input and output

parameters"); System.exit(-1);
}

Job job = new Job();

job.setJarByClass(AverageDriver.class);

job.setJobName("Max temperature");

FileInputFormat.addInputPath(job,new

Path(args[0]));

FileOutputFormat.setOutputPath(job,new

Path (args[1]));

job.setMapperClass(AverageMapper.class);

job.setReducerClass(AverageReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

System.exit(job.waitForCompletion(true)?0:1);

}
AIM:

To write a R program to implement linear and logistic regression.

PROGRAM:

****SIMPLE LINEAR REGRESSION****


dataset = read.csv("data-marketing-budget-

12mo.csv", header=T, colClasses = c("numeric",

"numeric", "numeric")) head(dataset,5)

#/////Simple Regression/////

simple.fit =

lm(Sales~Spend,data=dataset)

summary(simple.fit)

OUTPUT:
****Logistic Regression ****
#selects some column from mtcars input<- mtcars [,c("am","cyl","hp","wt")]

print(head(input))

input<- mtcars [,c("am","cyl","hp","wt")]

am.data =glm(formula = am ~ cyl+hp+wt,data = input,family = binomial)

print(summary(am.data))

OUTPUT:
AIM

To implement support vector machine (SVM) to find optimum hyper plane


(Line in 2D, 3D hyper plane) which maximize the margin between two
classes.

Program library(e1071) plot(iris)


iris

plot(iris$Sepal.Length, iris$Sepal.width,

col=iris$Species) plot(iris$Petal.Length,

iris$Petal.width, col=iris$Species) s<-

sample(150,100)

col<- c("Petal.Length", "Petal.Width",

"Species") iris_train<- iris[s,col]

iris_test<- iris[-s,col]

svmfit<- svm(Species ~., data = iris_train, kernel = "linear", cost = .1,

scale = FALSE) print(svmfit)


plot(svmfit, iris_train[,col])

tuned <- tune(svm, Species~., data = iris_train, kernel = "linear", ranges=


list(cost=c(0.001,0.01,.1,.1,10,100)))

summary(tuned)

p<-predict(svmfit, iris_test[,col],

type="class") plot(p)
table(p,iris_test[,3] ) mean(p== iris_test[,3])

OUTPUT:
AIM

To implement a decision tree used to representing a decision situation in


visually and show all those factors within the analysis that are considered
relevant to the decision

PROGRAM

library(MASS) library(rpart) head(birthwt)

hist(birthwt$bwt) table(birthwt$low)

cols <- c('low', 'race', 'smoke', 'ht',

'ui') birthwt[cols] <-

lapply(birthwt[cols], as.factor)

set.seed(1)

train<- sample(1:nrow(birthwt), 0.75 * nrow(birthwt))

birthwtTree<- rpart(low ~ . - bwt, data = birthwt[train, ], method =

'class') plot(birthwtTree)

text(birthwtTree,

pretty = 0)

summary(birthwtTree)

birthwtPred<- predict(birthwtTree, birthwt[-train, ], type =

'class') table(birthwtPred, birthwt[-train, ]$low)


OUTPUT:
AIM:

To write a R program for implementing of clustering techniques.

PROGRAM:

library(datasets)

head(iris)

library(ggplot2)

ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) +

geom_point() set.seed(20)

irisCluster <- kmeans(iris[, 3:4], 3,

nstart = 20) irisCluster

table(irisCluster$cluster, iris$Species)
OUTPUT:
AIM

To implement Data visualization is to provide an efficient graphical


display for summarizing and reasoning about quantitative
information.

1. Histogram

Histogram is basically a plot that breaks the data into bins (or breaks) and
shows frequency distribution of these bins. You can change the breaks also
and see the effect it has data visualization in terms of understandability.

Note: We have used par(mfrow=c(2,5)) command to fit multiple graphs in


same page for sake of clarity( see the code below).

PROGRAM:
library(RColorBrewer)
data(VADeaths) par(mfrow=c(2,3))
hist(VADeaths,breaks=10,
col=brewer.pal(3,"Set3"),main="Set3 3 colors")
hist(VADeaths,breaks=3
,col=brewer.pal(3,"Set2"),main="Set2 3 colors")
hist(VADeaths,breaks=7,
col=brewer.pal(3,"Set1"),main="Set1 3 colors")
hist(VADeaths,,breaks= 2,
col=brewer.pal(8,"Set3"),main="Set3 8 colors")
hist(VADeaths,col=brewer.pal(8,"Greys"),main="Greys 8 colors")
hist(VADeaths,col=brewer.pal(8,"Greens"),main="Greens 8 colors")
OUTPUT:
2.1. Line Chart

Below is the line chart showing the increase in air passengers over given
time period. Line Charts are commonly preferred when we are to
analyses a trend spread over a time period. Furthermore, line plot is also
suitable to plots where we need to compare relative changes in quantities
across some variable (like time). Below is the code:

PROGRAM:

data(AirPassengers)

plot(AirPassengers,type="l")

#Simple Line Plot

2.2. Bar Chart

Bar Plots are suitable for showing comparison between cumulative totals
across several groups. Stacked Plots are used for bar plots for various
categories. Here’s the code:
PROGRAM:
data("iris")

barplot(iris$Petal.Length) #Creating simple Bar Graph

barplot(iris$Sepal.Length,col=brewer.pal(3,"Set1"))
barplot(table(iris$Species,iris$Sepal.Length),col = brewer.pal(3,"Set1"))
#Stacked Plot
OUTPUT:

3. Box Plot Box Plot shows 5 statistically significant numbers the


minimum, the 25th percentile, the median, the 75th percentile and the
maximum. It is thus useful for visualizing the spread of the data is and
deriving inferences accordingly.

PROGRAM:

data(iris)
par(mfrow=
c(2,2))
boxplot(iris$Sepal.Length,col="red")
boxplot(iris$Sepal.Length~iris$Species,col="red")
boxplot(iris$Sepal.Length~iris$Species,col=heat.colors(3))
boxplot(iris$Sepal.Length~iris$Species,col=topo.colors(3))
boxplot(iris$Petal.Length~iris$Species) #Creating Box Plot
between two variable
OUTPUT:

4.Scatter Plot (including 3D and other features) Scatter plots help in


visualizing data easily and for simple data inspection. Here’s the code for
simple scatter and multivariate scatter plot:
PROGRAM:

plot(x=iris$Petal.Length) #Simple Scatter Plot

plot(x=iris$Petal.Length,y=iris$Species) #Multivariate Scatter

Plot

OUTPUT:

5. Heat Map one of the most innovative data visualizations in R, the heat
map emphasizes color intensity to visualize relationships between multiple
variables. The result is an attractive 2D image that is easy to interpret. As a
basic example, a heat map highlights the popularity of competing items by
ranking them according to their original market launch date. It breaks it
down further by providing sales statistics and figures over the course of
time.
PROGRAM:

# simulate a dataset of 10 points x<‐rnorm(10,mean=rep(1:5,each=2),sd=0.7)

y<‐rnorm(10,mean=rep(c(1,9),each=5),sd=0.1) dataFrame<‐

data.frame(x=x,y=y) set.seed(143)

dataMatrix<‐as.matrix(dataFrame)[sample(1:10),] # convert to class


'matrix', then shuffle the rows of the matrix

heatmap(dataMatrix) # visualize hierarchical clustering via a heatmap

OUTPUT:
6. Correlogram Correlated data is best visualized through corrplot. The 2D
format is similar to a heat map, but it highlights statistics that are directly
related. Most correlograms highlight the amount of correlation between
datasets at various points in time. Comparing sales data between different
months or years is a basic example.

PROGRAM:

#data("mtcars")

corr_matrix <‐

cor(mtcars)
# with circles corrplot(corr_matrix)

# with numbers and lower corrplot(corr_matrix,method = 'number',type = "lower")

OUTPUT:

7.Area Chart Area charts express continuity between different variables or


data sets. It's akin to the traditional line chart you know from grade school
and is used in a similar fashion. Most area charts highlight trends and their
evolution over the course of time, making them highly effective when
trying to expose underlying trends whether they're positive or negative.

PROGRAM:

data("airquality")

#dataset used

airquality %>%

group_by(Day)%>%

summarise(mean_wind=mean(

Wind)) %>% ggplot() +

geom_area(aes(x = Day, y =

mean_wind)) + labs(title = "Area

Chart of Average Wind per Day",

subtitle = "using airquality data", y =

"Mean Wind")
OUTPUT:
1) To use MongoDB with R, first, we have to download and install
MongoDB Next, start MongoDB. We can start MongoDB like so:

mongod

2) Inserting data

Let’s insert the crimes data from data.gov to MongoDB. The dataset
reflects reported incidents of crime (with the exception of murders where
data exists for each victim) that occurred in the City of Chicago since
2001.

library

(ggplot2)

library

(dplyr)

library

(maps)

library

(ggmap)

library

(mongolite)

library

(lubridate)
library

(gridExtra)

crimes=data.table::fread("Crimes_2001_to_pr

esent.csv") names (crimes)

OUTPUT:
ID' 'Case Number' 'Date' 'Block' 'IUCR' 'Primary Type' 'Description' 'Location
Description' 'Arrest''Domestic' 'Beat' 'District' 'Ward' 'Community Area' 'FBI Code' 'X
Coordinate' 'Y Coordinate' 'Year' 'Updated On' 'Latitude' 'Longitude' 'Location'.

3) Let’s remove spaces in the column names to avoid any problems when
we query it from MongoDB.

names(crimes) = gsub("

","",names(crimes))

names(crimes)

4) Let’s use the insert function from the mongolite package to insert rows
to a collection in MongoDB.Let’s create a database called Chicago and
call the collection crimes.

my_collection = mongo(collection = "crimes", db = "Chicago") # create


connection, database and collection

my_collection$insert(crimes)

OUTPUT:
'ID' 'CaseNumber''Date' 'Block''IUCR' 'PrimaryType' 'Description'
'LocationDescription' 'Arrest' 'Domestic' 'Beat' 'District' 'Ward' 'CommunityArea'
'FBICode' 'XCoordinate' 'YCoordinate' 'Year' 'UpdatedOn' 'Latitude' 'Longitude'
'Location'
5) Let’s check if we have inserted the “crimes” data.

my_collection$count()

OUTPUT:

6261148

We see that the collection has 6261148 records.

6) First, let’s look what the data looks like by

displaying one record: my_collection$iterate()$one()

OUTPUT:
$ID
1454164
$Case Number
' G185744'
$Date
' 04/01/2001 06:00:00 PM'
$Block
' 049XX N MENARD AV'
$IUCR

' 0910'
$Primary Type
' MOTOR VEHICLE THEFT'
$Description
' AUTOMOBILE'
$Location
Description ' STREET'
$Arrest
' false'
$Domestic
' false'
$Beat
1622
$District
16
$FBICode
' 07'
$XCoordinate
1136545

$YCoordinate
1932203
$Year
2001
$Updated On
' 08/17/2015 03:03:40 PM'
$Latitude
41.970129962
$Longitude
87.773302309
$Location
'(41.970129962, -87.773302309)'

7) How many distinct “Primary Type” do we have?

length(my_collection$distinct("Primary Type"))
OUTPUT:

35
As shown above, there are 35 different crime primary types in the
database. We will see the patterns of the most common crime types below.

8) Now, let’s see how many domestic assaults there are in the

collection.

my_collection$count('{"PrimaryType":"ASSAULT",

"Domestic" : "true" }')

OUTPUT:

8247
9) To get the filtered data and we can also retrieve only the columns

of interest. query1= my_collection$find('{"PrimaryType" :

"ASSAULT", "Domestic" : "true" }') query2=

my_collection$find('{"PrimaryType" : "ASSAULT", "Domestic" :

"true" }', fields = '{"_id":0, "PrimaryType":1, "Domestic":1}')

ncol(query1) # with all the columns ncol(query2) # only the selected columns

OUTPUT:

22
2

10) To find out “Where do most crimes take place?” use the

following command. my_collection$aggregate('[{"$group":

{"_id":"$LocationDescription", "Count":

{"$sum":1}}}]')%>%na.omit()%>% arrange(desc(Count))%>%head(10)%>%
ggplot(aes(x=reorder(`_id`,Count),y=Count))
+geom_bar(stat="identity",color='skyblue',fill
='#b35900')+geom_text(aes(label = Count), color = "blue") +coord_flip()
+xlab("Location description")

11) If loading the entire dataset we are working with does not slow down
our analysis, we can use data.table or dplyr but when dealing with big
data, using MongoDB can give us performance boost as the whole data
will not be loaded into memory. We can reproduce the above plot without
using MongoDB, like so:

crimes%>%group_by(`LocationDescription`)%>

%summarise(Total=n())%>% arrange(desc(Total))%>
%head(10)%>%

ggplot(aes(x=reorder(`LocationDescription`,Total),y=Total))+

geom_bar(stat="identity",color='skyblue',fill='#b35900')

+geom_text(aes(label = Total), color = "blue") +coord_flip()

+xlab("Location Description")

You might also like