DA Lab Manual Final
DA Lab Manual Final
4. Download Hadoop-2.6.x:
a. Put extracted Hadoop-2.6.x files into D drive.
b. Download “hadoop-common-2.6.0-bin-master. Paste all these files into the
“bin” folder of Hadoop-2.6.x.
c. Create a “data” folder inside Hadoop-2.6.x, and also create two more folders in
the “data” folder as “data” and “name.”
d. Create a folder to store temporary data during execution of a project, such as
“D:\hadoop\temp.”
e. Create a log folder, such as “D:\hadoop\userlog”
f. Go to Hadoop-2.6.x etc Hadoop and edit four files:
i. core-site.xml
ii. hdfs-site.xml
iii. mapred.xml
iv. yarn.xml
core-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file. -->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>D:\hadoop\temp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:50071</value>
</property>
</configuration>
hdfs-site.xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property><name>dfs.replication</name><value>1</value></property>
<property>
<name>dfs.namenode.name.dir</name><value>/hadoop-2.6.0/data/na
me</value><final>true</final></property>
<property><name>dfs.datanode.data.dir</name><value>/hadoop-2.6.0/
data/data</value><final>true</final> </property>
</configuration>
mapred.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>/hadoop-2.6.0/share/hadoop/mapreduce/*,
/hadoop-2.6.0/share/hadoop/mapreduce/lib/*,
/hadoop-2.6.0/share/hadoop/common/*,
/hadoop-2.6.0/share/hadoop/common/lib/*,
/hadoop-2.6.0/share/hadoop/yarn/*,
/hadoop-2.6.0/share/hadoop/yarn/lib/*,
/hadoop-2.6.0/share/hadoop/hdfs/*,
/hadoop-2.6.0/share/hadoop/hdfs/lib/*,
</value>
</property>
</configuration>
yarn-site.xml
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.log-dirs</name>
<value>D:\hadoop\userlog</value><final>true</final>
</property>
<property><name>yarn.nodemanager.local-dirs</name><value>D:\hadoop\temp\nm-lo
cal-dir</value></property>
<property>
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value> 600</value>
</property>
<property><name>yarn.application.classpath</name>
<value>/hadoop-2.6.0/,/hadoop-2.6.0/share/hadoop/common/*,/hadoop-2.6.0/share/had
oop/common/lib/*,/hadoop-2.6.0/share/hadoop/hdfs/*,/hadoop-2.6.0/share/hadoop/hdfs/l
ib/*,/hadoop-2.6.0/share/hadoop/mapreduce/*,/hadoop-2.6.0/share/hadoop/mapreduce/l
ib/*,/hadoop-2.6.0/share/hadoop/yarn/*,/hadoop-2.6.0/share/hadoop/yarn/lib/*</value>
</property>
</configuration>
g. Go to the location: “Hadoop-2.6.0->etc->hadoop,” and edit “hadoop-env.cmd” by
writing
set JAVA_HOME=C:\java\jdk1.8.0_91
h. Set environmental variables: Do: My computer -> Properties -> Advance system
settings -> Advanced -> Environmental variables
i. User variables:
● Variable: HADOOP_HOME
● Value: D:\hadoop-2.6.0
ii. System variable
● Variable: Path
● Value: D:\hadoop-2.6.0\bin
D:\hadoop-2.6.0\sbin
D:\hadoop-2.6.0\share\hadoop\common\*
D:\hadoop-2.6.0\share\hadoop\hdfs
D:\hadoop-2.6.0\share\hadoop\hdfs\lib\*
D:\hadoop-2.6.0\share\hadoop\hdfs\*
D:\hadoop-2.6.0\share\hadoop\yarn\lib\*
D:\hadoop-2.6.0\share\hadoop\yarn\*
D:\hadoop-2.6.0\share\hadoop\mapreduce\lib\*
D:\hadoop-2.6.0\share\hadoop\mapreduce\*
D:\hadoop-2.6.0\share\hadoop\common\lib\*
i. Check on cmd;
Result:-
Thus the Hadoop Environment was installed and Configured.
EX.2 Implementation of word count/frequency using MapReduce
Aim:-
To implement word count program using MapReduce.
Procedure:-
Program:-
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Result:-
Thus the word count program was executed using Hadoop environment.
EX.3 Implementation of MR program using Weather dataset
Aim :-
To write a code to find maximum temperature per year from sensor temperature
data sheet, using hadoop mapreduce framework.
Procedure:-
Implement Mapper and Reducer program for finding Maximum temperature in java
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
//Mapper class
class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
//Reducer class
class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
//Driver Class
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.submit();
}
}
Input
Output
Output Text contain year and maximum temperature in that year as 1902 33
Result:-
Thus Maximum temperature of weather dataset was obtained using MapReduce.
EX.4 INSTALL, CONFIGURE AND RUN SPARK
Aim:-
To install and configure spark in standalone machine.
Procedure:-
Step 1: Install Java 8
Apache Spark requires Java 8. You can check to see if Java is installed using the
command prompt.
Open the command line by clicking Start > type cmd > click Command Prompt.
Type the following command in the command prompt:
java -version
If Java is installed, it will respond with the following output:
6. You can leave all boxes checked at this step, or you can uncheck the options you do
not want.
7. Click Next.
8. Select the box Install for all users and leave other boxes as they are.
9. Under Customize install location, click Browse and navigate to the C drive. Add a
new folder and name it Python.
10. Select that folder and click OK.
4. A page with a list of mirrors loads where you can see different servers to download
from. Pick any from the list and save the file to your Downloads folder.
Step 4: Verify Spark Software File
1. Verify the integrity of your download by checking the checksum of the file. This
ensures you are working with unaltered, uncorrupted software.
2. Navigate back to the Spark Download page and open the Checksum link, preferably
in a new tab.
3. Next, open a command line and enter the following command:
certutil -hashfile c:\users\username\Downloads\spark-2.4.5-bin-hadoop2.7.tgz
SHA512
4. Change the username to your username. The system displays a long alphanumeric
code, along with the message Certutil: -hashfile completed successfully.
5. Compare the code to the one you opened in a new browser tab. If they match, your
download file is uncorrupted.
Step 5: Install Apache Spark
Installing Apache Spark involves extracting the downloaded file to the desired location.
1. Create a new folder named Spark in the root of your C: drive. From a command line,
enter the following:
cd \
mkdir Spark
2. In Explorer, locate the Spark file you downloaded.
3. Right-click the file and extract it to C:\Spark using the tool you have on your system
(e.g., 7-Zip).
4. Now, your C:\Spark folder has a new folder spark-2.4.5-bin-hadoop2.7 with the
necessary files inside.
Step 6: Add winutils.exe File
Download the winutils.exe file for the underlying Hadoop version for the Spark
installation you downloaded.
1. Navigate to this URL https://github.com/cdarlint/winutils and inside the bin folder,
locate winutils.exe, and click it.
2. Find the Download button on the right side to download the file.
3. Now, create new folders Hadoop and bin on C: using Windows Explorer or the
Command Prompt.
4. Copy the winutils.exe file from the Downloads folder to C:\hadoop\bin.
Step 7: Configure Environment Variables
Configuring environment variables in Windows adds the Spark and Hadoop locations to
your system PATH. It allows you to run the Spark shell directly from a command prompt
window.
1. Click Start and type environment.
2. Select the result labeled Edit the system environment variables.
3. A System Properties dialog box appears. In the lower-right corner, click Environment
Variables and then click New in the next window.
7. You should see a box with entries on the left. On the right, click New.
8. The system highlights a new line. Enter the path to the Spark folder
C:\Spark\spark-2.4.5-bin-hadoop2.7\bin. We recommend using %SPARK_HOME%\bin
to avoid possible issues with the path.
7. To exit Spark and close the Scala shell, press ctrl-d in the command-prompt window.
Result:-
Thus, the SPARK was installed and configured successfully.
EX.5 IMPLEMENT WORD COUNT / FREQUENCY PROGRAMS USING SPARK
Aim:-
To Implement word count / frequency programs using Spark
Program:-
package org.apache.spark.examples;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;
Output:
{e=1, h=2, b=2, j=1, m=1, d=1, a=2, i=2, c=1, l=2, f=1}
Result:-
Thus, the word count program is executed successfully.
EX.6 IMPLEMENT MACHINE LEARNING USING SPARK
Aim:-
To implement machine learning using spark
Procedure:-
Spark MLlib is a module on top of Spark Core that provides machine learning primitives
as APIs.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.4.3</version>
<scope>provided</scope>
</dependency>
Spark MLlib offers several data types, both local and distributed, to represent the input
data and corresponding labels. The simplest of the data types are Vector:
A training example typically consists of multiple input features and a label, represented
by the class LabeledPoint:
Map<String, Integer> map = new HashMap<>();
map.put("Iris-setosa", 0);
map.put("Iris-versicolor", 1);
map.put("Iris-virginica", 2);
Another important metric to analyze is the correlation between features in the input
data:
Matrix correlMatrix = Statistics.corr(inputData.rdd(), "pearson");
System.out.println("Correlation Matrix:");
System.out.println(correlMatrix.toString());
Iris.data (Input Dataset)
Correlation Matrix:
1.0 -0.10936924995064387 0.8717541573048727 0.8179536333691672
-0.10936924995064387 1.0 -0.4205160964011671 -0.3565440896138163
0.8717541573048727 -0.4205160964011671 1.0 0.9627570970509661
0.8179536333691672 -0.3565440896138163 0.9627570970509661 1.0
Result:-
Thus, the word count program is executed successfully.
Coefficients:
(Intercept) x
-38.4551 0.6746
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-6.3002 -1.6629 0.0412 1.8944 3.9775
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -38.45509 8.04901 -4.778 0.00139 **
x 0.67461 0.05191 12.997 1.16e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
1
76.22869
Logistic Regression:
logistic.r
# Select some columns form mtcars.
input <- mtcars[,c("am","cyl","hp","wt")]
print(head(input))
Output:
am cyl hp wt
Mazda RX4 1 6 110 2.620
Mazda RX4 Wag 1 6 110 2.875
Datsun 710 1 4 93 2.320
Hornet 4 Drive 0 6 110 3.215
Hornet Sportabout 0 8 175 3.440
Valiant 0 6 105 3.460
Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = input)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Output:-
null device
1
Loading required package: methods
Loading required package: grid
Loading required package: mvtnorm
Loading required package: modeltools
Loading required package: stats4
Loading required package: strucchange
Loading required package: zoo
as.Date, as.Date.numeric
> names(cars.pam)
> table(groups.3,cars.pam$clustering)
groups.3 1 2 3
1 8 0 0
2 0 19 1
3 0 0 10
> cars$Car[groups.3 != cars.pam$clustering]
[1] Audi 5000
> cars$Car[cars.pam$id.med]
> cars$Car[cars.pam$id.med]
[1] Dodge St Regis Dodge Omni Ford Mustang Ghia
> plot(cars.pam)
Result:-
Thus, the clustering using Pam was implemented using R.
EX.10 IMPLEMENTATION OF DATA VISUALIZATION
Aim:-
To implement the Data Visualization using R Program.
Procedure:-
Step 1: Read the input
Step 2: Visualize the data using
i)Piechart
ii)3DPiechart
iii)Boxplot
iv)Histogram
v)Linechart
vi)Scatterplot
Program :-
1.Piechart.r
# Create data for the graph.
x <- c(21, 62, 10, 53)
labels <- c("London", "New York", "Singapore", "Mumbai")
Output:-
Executing the program....
$Piechart.r
2.ThreeDPiechart.r
# Get the library.
library(plotrix)
Output:-
Executing the program....
$ThreeDPiechart.r
3.Boxplot.r
Output
4.Histogram.r
# Create data for the graph.
v <- c(9,13,21,8,36,22,12,41,31,33,19)
output
5.Linechart.r
# Create the data for the chart.
v <- c(7,12,28,3,41)
output
6.Scatterplot.r
# Get the input values.
input <- mtcars[,c('wt','mpg')]
# Plot the chart for cars with weight between 2.5 to 5 and mileage between 15 and 30.
plot(x = input$wt,y = input$mpg,
xlab = "Weight",
ylab = "Milage",
xlim = c(2.5,5),
ylim = c(15,30),
main = "Weight vs Milage"
)
output
Result:-
Thus the different data visualization techniques were implemented using R.
EX.11 Implementation of an Application
Aim:-
To implement survival analysis using R
Procedure:-
Step 1: Install survival package
Step 2: Display input to check details
Step 3: Create survival object
Step 4: Display the output
Program:-
Survival.r
# Load the library.
library("survival")
Output:-
>print(head(pbc))
id time status trt age sex ascites hepato spiders edema bili chol
1 1 400 2 1 58.76523 f 1 1 1 1.0 14.5 261
2 2 4500 0 1 56.44627 f 0 1 1 0.0 1.1 302
3 3 1012 2 1 70.07255 m 0 0 0 0.5 1.4 176
4 4 1925 2 1 54.74059 f 0 1 1 0.5 1.8 244
5 5 1504 1 2 38.10541 f 0 1 1 0.0 3.4 279
6 6 2503 2 2 66.25873 f 0 1 0 0.0 0.8 248
albumin copper alk.phos ast trig platelet protime stage
1 2.60 156 1718.0 137.95 172 190 12.2 4
2 4.14 54 7394.8 113.52 88 221 10.6 3
3 3.48 210 516.0 96.10 55 151 12.0 4
4 2.54 64 6121.8 60.63 92 183 10.3 4
5 3.53 143 671.0 113.15 72 136 10.9 3
6 3.98 50 944.0 93.00 63 NA 11.0 3
Result:-
Thus, the
survival analysis
was implemented
using R.