Bda Lab
Bda Lab
Set up:
2) If Java is not installed on your system then first install java under
"C:\JAVA" Java setup
6) Next we set the Hadoop bin directory path and JAVA bin directory path
Configuration :
b) Rename "mapred-
site.xml.template" to "mapred-site.xml" and edit this file C:/Hadoop-
2.8.0/etc/hadoop/mapred-site.xml, paste below xml paragraph and save
this file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
hadoop fs -rm -r
/iutput_dir/input_file.txt
10) To delete directory from HDFS directory
Program :
AverageMapper.java
import org.apache.hadoop.io.*;
int temperature;
if (line.charAt(87)=='+')
temperature =
Integer.parseInt(line.substring(88, 92));
else
AverageReducer.java
Import org.apache.hadoop.mapreduce.*;
import java.io.IOException;
public class AverageReducer extends Reducer <Text, IntWritable,Text,
IntWritable >
InterruptedException
int max_temp = 0;
int count = 0;
for (IntWritable value : values)
max_temp += value.get();
count+=1;
AverageDriver.java
import org.apache.hadoop.io.*import
org.apache.hadoop.fs.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputF
{
if (args.length != 2)
parameters"); System.exit(-1);
}
job.setJarByClass(AverageDriver.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job,new
Path(args[0]));
FileOutputFormat.setOutputPath(job,new
Path (args[1]));
job.setMapperClass(AverageMapper.class);
job.setReducerClass(AverageReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true)?0:1);
}
AIM:
PROGRAM:
#/////Simple Regression/////
simple.fit =
lm(Sales~Spend,data=dataset)
summary(simple.fit)
OUTPUT:
****Logistic Regression ****
#selects some column from mtcars input<- mtcars [,c("am","cyl","hp","wt")]
print(head(input))
print(summary(am.data))
OUTPUT:
AIM
plot(iris$Sepal.Length, iris$Sepal.width,
col=iris$Species) plot(iris$Petal.Length,
sample(150,100)
iris_test<- iris[-s,col]
summary(tuned)
p<-predict(svmfit, iris_test[,col],
type="class") plot(p)
table(p,iris_test[,3] ) mean(p== iris_test[,3])
OUTPUT:
AIM
PROGRAM
hist(birthwt$bwt) table(birthwt$low)
lapply(birthwt[cols], as.factor)
set.seed(1)
'class') plot(birthwtTree)
text(birthwtTree,
pretty = 0)
summary(birthwtTree)
PROGRAM:
library(datasets)
head(iris)
library(ggplot2)
geom_point() set.seed(20)
table(irisCluster$cluster, iris$Species)
OUTPUT:
AIM
1. Histogram
Histogram is basically a plot that breaks the data into bins (or breaks) and
shows frequency distribution of these bins. You can change the breaks also
and see the effect it has data visualization in terms of understandability.
PROGRAM:
library(RColorBrewer)
data(VADeaths) par(mfrow=c(2,3))
hist(VADeaths,breaks=10,
col=brewer.pal(3,"Set3"),main="Set3 3 colors")
hist(VADeaths,breaks=3
,col=brewer.pal(3,"Set2"),main="Set2 3 colors")
hist(VADeaths,breaks=7,
col=brewer.pal(3,"Set1"),main="Set1 3 colors")
hist(VADeaths,,breaks= 2,
col=brewer.pal(8,"Set3"),main="Set3 8 colors")
hist(VADeaths,col=brewer.pal(8,"Greys"),main="Greys 8 colors")
hist(VADeaths,col=brewer.pal(8,"Greens"),main="Greens 8 colors")
OUTPUT:
2.1. Line Chart
Below is the line chart showing the increase in air passengers over given
time period. Line Charts are commonly preferred when we are to
analyses a trend spread over a time period. Furthermore, line plot is also
suitable to plots where we need to compare relative changes in quantities
across some variable (like time). Below is the code:
PROGRAM:
data(AirPassengers)
plot(AirPassengers,type="l")
Bar Plots are suitable for showing comparison between cumulative totals
across several groups. Stacked Plots are used for bar plots for various
categories. Here’s the code:
PROGRAM:
data("iris")
barplot(iris$Sepal.Length,col=brewer.pal(3,"Set1"))
barplot(table(iris$Species,iris$Sepal.Length),col = brewer.pal(3,"Set1"))
#Stacked Plot
OUTPUT:
PROGRAM:
data(iris)
par(mfrow=
c(2,2))
boxplot(iris$Sepal.Length,col="red")
boxplot(iris$Sepal.Length~iris$Species,col="red")
boxplot(iris$Sepal.Length~iris$Species,col=heat.colors(3))
boxplot(iris$Sepal.Length~iris$Species,col=topo.colors(3))
boxplot(iris$Petal.Length~iris$Species) #Creating Box Plot
between two variable
OUTPUT:
Plot
OUTPUT:
5. Heat Map one of the most innovative data visualizations in R, the heat
map emphasizes color intensity to visualize relationships between multiple
variables. The result is an attractive 2D image that is easy to interpret. As a
basic example, a heat map highlights the popularity of competing items by
ranking them according to their original market launch date. It breaks it
down further by providing sales statistics and figures over the course of
time.
PROGRAM:
y<‐rnorm(10,mean=rep(c(1,9),each=5),sd=0.1) dataFrame<‐
data.frame(x=x,y=y) set.seed(143)
OUTPUT:
6. Correlogram Correlated data is best visualized through corrplot. The 2D
format is similar to a heat map, but it highlights statistics that are directly
related. Most correlograms highlight the amount of correlation between
datasets at various points in time. Comparing sales data between different
months or years is a basic example.
PROGRAM:
#data("mtcars")
corr_matrix <‐
cor(mtcars)
# with circles corrplot(corr_matrix)
OUTPUT:
PROGRAM:
data("airquality")
#dataset used
airquality %>%
group_by(Day)%>%
summarise(mean_wind=mean(
geom_area(aes(x = Day, y =
"Mean Wind")
OUTPUT:
1) To use MongoDB with R, first, we have to download and install
MongoDB Next, start MongoDB. We can start MongoDB like so:
mongod
2) Inserting data
Let’s insert the crimes data from data.gov to MongoDB. The dataset
reflects reported incidents of crime (with the exception of murders where
data exists for each victim) that occurred in the City of Chicago since
2001.
library
(ggplot2)
library
(dplyr)
library
(maps)
library
(ggmap)
library
(mongolite)
library
(lubridate)
library
(gridExtra)
crimes=data.table::fread("Crimes_2001_to_pr
OUTPUT:
ID' 'Case Number' 'Date' 'Block' 'IUCR' 'Primary Type' 'Description' 'Location
Description' 'Arrest''Domestic' 'Beat' 'District' 'Ward' 'Community Area' 'FBI Code' 'X
Coordinate' 'Y Coordinate' 'Year' 'Updated On' 'Latitude' 'Longitude' 'Location'.
3) Let’s remove spaces in the column names to avoid any problems when
we query it from MongoDB.
names(crimes) = gsub("
","",names(crimes))
names(crimes)
4) Let’s use the insert function from the mongolite package to insert rows
to a collection in MongoDB.Let’s create a database called Chicago and
call the collection crimes.
my_collection$insert(crimes)
OUTPUT:
'ID' 'CaseNumber''Date' 'Block''IUCR' 'PrimaryType' 'Description'
'LocationDescription' 'Arrest' 'Domestic' 'Beat' 'District' 'Ward' 'CommunityArea'
'FBICode' 'XCoordinate' 'YCoordinate' 'Year' 'UpdatedOn' 'Latitude' 'Longitude'
'Location'
5) Let’s check if we have inserted the “crimes” data.
my_collection$count()
OUTPUT:
6261148
OUTPUT:
$ID
1454164
$Case Number
' G185744'
$Date
' 04/01/2001 06:00:00 PM'
$Block
' 049XX N MENARD AV'
$IUCR
' 0910'
$Primary Type
' MOTOR VEHICLE THEFT'
$Description
' AUTOMOBILE'
$Location
Description ' STREET'
$Arrest
' false'
$Domestic
' false'
$Beat
1622
$District
16
$FBICode
' 07'
$XCoordinate
1136545
$YCoordinate
1932203
$Year
2001
$Updated On
' 08/17/2015 03:03:40 PM'
$Latitude
41.970129962
$Longitude
87.773302309
$Location
'(41.970129962, -87.773302309)'
length(my_collection$distinct("Primary Type"))
OUTPUT:
35
As shown above, there are 35 different crime primary types in the
database. We will see the patterns of the most common crime types below.
8) Now, let’s see how many domestic assaults there are in the
collection.
my_collection$count('{"PrimaryType":"ASSAULT",
OUTPUT:
8247
9) To get the filtered data and we can also retrieve only the columns
ncol(query1) # with all the columns ncol(query2) # only the selected columns
OUTPUT:
22
2
10) To find out “Where do most crimes take place?” use the
{"_id":"$LocationDescription", "Count":
{"$sum":1}}}]')%>%na.omit()%>% arrange(desc(Count))%>%head(10)%>%
ggplot(aes(x=reorder(`_id`,Count),y=Count))
+geom_bar(stat="identity",color='skyblue',fill
='#b35900')+geom_text(aes(label = Count), color = "blue") +coord_flip()
+xlab("Location description")
11) If loading the entire dataset we are working with does not slow down
our analysis, we can use data.table or dplyr but when dealing with big
data, using MongoDB can give us performance boost as the whole data
will not be loaded into memory. We can reproduce the above plot without
using MongoDB, like so:
crimes%>%group_by(`LocationDescription`)%>
%summarise(Total=n())%>% arrange(desc(Total))%>
%head(10)%>%
ggplot(aes(x=reorder(`LocationDescription`,Total),y=Total))+
geom_bar(stat="identity",color='skyblue',fill='#b35900')
+xlab("Location Description")