DS&BDA
DS&BDA
THIRD YEAR
Information Technology
(2019 Course)
LABORATORY MANUAL
For
SEMESTER - VI
[Subject code: 314457]
[Prepared By]
Mrs. S. R. Hiray
Dr. A. M. Bagade
Mr. R. B. Murumkar
Mr. A. C. Karve
1
PICT, TE-IT DA & BDAL
VISION
MISSION
VISION
MISSION
2
PICT, TE-IT DA & BDAL
3
PICT, TE-IT DA & BDAL
4
PICT, TE-IT DA & BDAL
5
PICT, TE-IT DS&BDA Laboratory
Part-A
Assignments based on the Hadoop
8
PICT, TE-IT DS&BDA Laboratory
Assignment: 1
AIM: Study of Hadoop Installation.
OBJECTIVE:
THEORY:
a) THE Single Node:
------------------------------------------------------------------------------------------------------------------
-
Introduction
Apache Hadoop is an open-source software framework written in Java for distributed storage
and distributed processing of very large data sets on computer clusters built from commodity
hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware
failures are common and should be automatically handled by the framework. The core of Apache
Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a
processing part called MapReduce. Hadoop splits files into large blocks and distributes them
across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process
in parallel based on the data that needs to be processed. This approach takes advantage of data
locality— nodes manipulating the data they have access to— to allow the dataset to be processed
faster and more efficiently than it would be in a more conventional supercomputer architecture
that relies on a parallel file system where computation and data are distributed via high-speed
networking. The base Apache Hadoop framework is composed of the following modules:
Hadoop Common – contains libraries and utilities needed by other Hadoop
modules;
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data
on commodity machines, providing very high aggregate bandwidth across the
cluster;
Hadoop YARN – a resource-management platform responsible for managing
computing resources in clusters and using them for scheduling of users'
applications; and
Hadoop MapReduce – an implementation of the MapReduce programming model
for large scale data processing.
9
PICT, TE-IT DS&BDA Laboratory
Open Terminal
# Update the source list
aaditya@laptop:~$ sudo apt-get update
Installing SSH
aaditya@laptop:~$su hduser
hduser@laptop:~$ sudo apt-get install ssh
Create and Setup SSH Certificates
hduser@laptop:~$cd /home/hduser/
hduser@laptop:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa): (**press enter)
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
50:6b:f3:fc:0f:32:bf:30:79:c2:41:71:26:cc:7d:e3 hduser@laptop
The key's randomart image is:
+--[ RSA 2048]----+
| .oo.o |
| . .o=. o |
| .+. o.|
| o= E |
| S+ |
| .+ |
| O+ |
| Oo |
| o.. |
+-----------------+
10
PICT, TE-IT DS&BDA Laboratory
hduser@laptop:$ exit
hduser@laptop:$ cd /home/hduser
Download Hadoop
hduser@laptop:~$
wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz
Or from http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz
hduser@laptop:~$ tar xvzf hadoop-2.8.3.tar.gz
hduser@laptop:~$ cd hadoop-2.8.3
11
PICT, TE-IT DS&BDA Laboratory
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
hduser@laptop:~$ cp /usr/local/hadoop/etc/hadoop/mapred-site.xml.template
/usr/local/hadoop/etc/hadoop/mapred-site.xml
hduser@laptop:~$ vim /usr/local/hadoop/etc/hadoop/mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
Save and Exit
12
PICT, TE-IT DS&BDA Laboratory
hduser@laptop$: jps
9026 NodeManager
7348 NameNode
9766 SecondaryNameNode
8887 ResourceManager
7507 DataNode
Conclusion: In this way single node Hadoop was installed & configured on Ubuntu for
BigData analytics.
Reference:http://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-
common/SingleCluster.html
13
PICT, TE-IT DS&BDA Laboratory
Learning Goals
In this activity, you will:
Instructions
Please use the following instructions to download and install the Cloudera Quickstart VM with
VirutalBox before proceeding to the Getting Started with the Cloudera VM Environment video. The
screenshots are from a Mac but the instructions should be the same for Windows. Please see the
discussion boards if you have any issues.
For Windows, select the link "VirtualBox 5.1.X for Windows hosts x86/amd64" where 'X' is
the latest version.
For Ubuntu Select VirtualBox 5.2.44 (released July 14 2020) Virtual box for Linux Hosts:
Ubuntu 32bit|64bit
Open Terminal -
$ sudo apt-get update
$ sudo apt-get install VirtualBox 5.2
$ sudo apt-get -f install
4. Start VirtualBox.
14
PICT, TE-IT DS&BDA Laboratory
15
PICT, TE-IT DS&BDA Laboratory
9. Click Import.
16
PICT, TE-IT DS&BDA Laboratory
10. The virtual machine image will be imported. This can take several minutes.
11. Launch Cloudera VM. When the importing is finished, the quickstart-vm-5.4.2-0 VM will
appear on the left in the VirtualBox window. Select it and click the Start button to launch the VM.
17
PICT, TE-IT DS&BDA Laboratory
12. Cloudera VM booting. It will take several minutes for the Virtual Machine to start. The booting
process takes a long time since many Hadoop tools are started.
13. The Cloudera VM desktop. Once the booting process is complete, the desktop will appear
with a browser.
18
PICT, TE-IT DS&BDA Laboratory
b) MultipleNodes:
Install a Multi Node Hadoop Cluster on
Ubuntu 16.04
This article is about multi-node installation of Hadoop cluster. You would need minimum
of 2 ubuntu machines or virtual images to complete a multi-node installation. If you want
to just try out a single node cluster, follow this article on Installing Hadoop on Ubuntu
16.04.
I used Hadoop Stable version 2.6.0 for this article. I did this setup on a 3 node cluster. For
simplicity, i will designate one node as master, and 2 nodes as slaves (slave-1, and slave-
2). Make sure all slave nodes are reachable from master node. To avoid any unreachable
hosts error, make sure you add the slave hostnames and ip addresses in /etc/hosts file.
Similarly, slave nodes should be able to resolve master hostname.
Installing Java on Master and Slaves
$sudo add-apt-repository ppa:webupd8team/java
$sudo apt-get update
$sudo apt-get install oracle-java7-installer
# Updata Java runtime
$sudo update-java-alternatives -s java-7-oracle
Disable IPv6
As of now Hadoop does not support IPv6, and is tested to work only on IPv4 networks. If
you are using IPv6, you need to switch Hadoop host machines to use IPv4. The Hadoop
Wiki link provides a one liner command to disable the IPv6. If you are not using IPv6,
skip this step:
sudosed -i 's/net.ipv6.bindv6only\ =\ 1/net.ipv6.bindv6only\ =\ 0/'\
/etc/sysctl.d/bindv6only.conf&& sudo invoke-rc.d procps restart
Setting up a Hadoop User
19
PICT, TE-IT DS&BDA Laboratory
Hadoop talks to other nodes in the cluster using no-password ssh. By having Hadoop run
under a specific user context, it will be easy to distribute the ssh keys around in the Hadoop
cluster. Lets’s create a user hadoopuser on master as well as slave nodes.
# Create hadoopgroup
$sudo addgroup hadoopgroup
# Create hadoopuser user
$sudo adduser —ingroup hadoopgroup hadoopuser
Our next step will be to generate a ssh key for password-less login between master and
slave nodes. Run the following commands only on master node. Run the last two
commands for each slave node. Password less ssh should be working before you can
proceed with further steps.
# Login as hadoopuser
$su - hadoopuser
#Generate a ssh key for the user
$ ssh-keygen -t rsa -P ""
#Authorize the key to enable password less ssh
$cat/home/hadoopuser/.ssh/id_rsa.pub>>/home/hadoopuser/.ssh/authorized_keys
$chmod600 authorized_keys
#Copy this key to slave-1 to enable password less ssh
$ssh-copy-id -i ~/.ssh/id_rsa.pub slave-1
#Make sure you can do a password less ssh using following command.
$ssh slave-1
Download and Install Hadoop binaries on Master and Slave nodes
Pick the best mirror site to download the binaries from Apache Hadoop, and download the
stable/hadoop-2.6.0.tar.gz for your installation. Do this step on master and every slave
node. You can download the file once and the distribute to each slave node using scp
command.
$cd/home/hadoopuser
$wget http://www.webhostingjams.com/mirror/apache/hadoop/core/stable/hadoop-2.2.0.tar.gz
$ tar xvf hadoop-2.2.0.tar.gz
$mv hadoop-2.2.0 hadoop
Setup Hadoop Environment on Master and Slave Nodes
Copy and paste following lines into your .bashrc file under /home/hadoopuser. Do this step
on master and every slave node.
# Set HADOOP_HOME
exportHADOOP_HOME=/home/hduser/hadoop
# Set JAVA_HOME
exportJAVA_HOME=/usr/lib/jvm/java-7-oracle
# Add Hadoop bin and sbin directory to PATH
exportPATH=$PATH:$HADOOP_HOME/bin;$HADOOP_HOME/sbin
Update hadoop-env.sh on Master and Slave Nodes
Common Terminologies
Before we start getting into configuration details, lets discuss some of the basic
terminologies used in Hadoop.
● Hadoop Distributed File System: A distributed file system that provides high-
throughput access to application data. A HDFS cluster primarily consists of a
NameNode that manages the file system metadata and DataNodes that store the
actual data. If you compare HDFS to a traditional storage structures ( e.g. FAT,
20
PICT, TE-IT DS&BDA Laboratory
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:54310</value>
<description>Use HDFS as file storage engine</description>
</property>
Add/update mapred-site.xml on Master node only with following options.
/home/hadoopuser/hadoop/etc/hadoop/mapred-site.xml (Other Options)
<property>
<name>mapreduce.jobtracker.address</name>
<value>master:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If “local”, then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>The framework for running mapreduce jobs</description>
</property>
Add/update hdfs-site.xml on Master and Slave Nodes. We will be adding following three
entries to the file.
● dfs.replication– Here I am using a replication factor of 2. That means for every file
stored in HDFS, there will be one redundant replication of that file on some other
node in the cluster.
● dfs.namenode.name.dir – This directory is used by Namenode to store its metadata
file. Here i manually created this directory /hadoop-
data/hadoopuser/hdfs/namenode on master and slave node, and use the directory
location for this configuration.
● dfs.datanode.data.dir – This directory is used by Datanode to store hdfs data
blocks. Here i manually created this directory /hadoop-
data/hadoopuser/hdfs/datanode on master and slave node, and use the directory
location for this configuration.
/home/hadoopuser/hadoop/etc/hadoop/hdfs-site.xml (Other Options)
<property>
<name>dfs.replication</name>
<value>2</value>
21
PICT, TE-IT DS&BDA Laboratory
Add yarn-site.xml on Master and Slave Nodes. This file is required for a Node to work as
a Yarn Node. Master and slave nodes should all be using the same value for the following
properties, and should be pointing to master node only.
/home/hadoopuser/hadoop/etc/hadoop/yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>master:8088</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>master:8033</value>
</property>
Add/update slaves file on Master node only. Add just name, or ip addresses of master and
all slave node. If file has an entry for localhost, you can remove that. This file is just
22
PICT, TE-IT DS&BDA Laboratory
helper file that are used by hadoop scripts to start appropriate services on master and slave
nodes.
/home/hadoopuser/hadoop/etc/hadoop/slave
master
slave-1
slave-2
$ su - hadoopuser
$ jps
The output of this command should list NameNode, SecondaryNameNode,
DataNode on master node, and DataNode on all slave nodes. If you don’t see the
expected output, review the log files listed in Troubleshooting section.
Start the Yarn MapReduce Job tracker
Run the following command to start the Yarn mapreduce framework.
$./home/hadoopuser/hadoop/sbin/start-yarn.sh
To validate the success, run jps command again on master nodes, and
slave node.The output of this command should list NodeManager,
ResourceManager on master node, and NodeManager, on all slave nodes. If you don’t
see the expected output, review the log files listed in Troubleshooting section.
Review Yarn Web console
If all the services started successfully on all nodes, then you should see all of your nodes
listed under Yarn nodes. You can hit the following url on your browser and verify that:
http://master:8088/cluster/nodes
Lets’s execute a MapReduce example now
You should be all set to run a MapReduce example now. Run the following command
http://master:8088/cluster/apps
Troubleshooting
Hadoop uses $HADOOP_HOME/logs directory. In case you get into any issues with your
installation, that should be the first point to look at. In case, you need help with anything
else, do leave me a comment.
23
PICT, TE-IT DS&BDA Laboratory
REFERENCE BOOK:
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-
common/ClusterSetup.html
CONCLUSION:
Thus we have successfully installed and tested single and multimode cluster.
24
PICT, TE-IT DS&BDA Laboratory
Assignment: 2
AIM:Design a distributed application using MapReduce.
PROBLEM STATEMENT /DEFINITION
Design a distributed application using MapReduce which processes a log file of a system. List
out the users who have logged for maximum period on the system. Use simple log file from the
Internet and process it using a pseudo distribution mode on Hadoop platform.
OBJECTIVE:
● To understand the concept of Map Reduce.
● To understand the details of Hadoop File system
● To understand the technique for log file processing
● Analyze the performance of hadoop file system
● To understand use of distributed processing
THEORY:
What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map
as an input and combines those data tuples into a smaller set of tuples. As the sequence of the
name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers
is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of machines in a cluster
is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.
The Algorithm
● Generally MapReduce paradigm is based on sending the computer to where the data resides!
● MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
o Map stage : The map or mapper’s job is to process the input data. Generally the input data
is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input
file is passed to the mapper function line by line. The mapper processes the data and creates
several small chunks of data.
o Reduce stage : This stage is the combination of the Shufflestage and the Reduce stage. The
Reducer’s job is to process the data that comes from the mapper. After processing, it
produces a new set of output, which will be stored in the HDFS.
● During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the
cluster.
● The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
● Most of the computing takes place on nodes with data on local disks that reduces the network traffic.
● After completion of the given tasks, the cluster collects and reduces the data to form an appropriate
result, and sends it back to the Hadoop server.
25
PICT, TE-IT DS&BDA Laboratory
Input Output
Terminology
● PayLoad - Applications implement the Map and the Reduce functions, and form the core of the job.
● Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair.
● NamedNode - Node that manages the Hadoop Distributed File System (HDFS).
● DataNode - Node where data is presented in advance before any processing takes place.
● MasterNode - Node where JobTracker runs and which accepts job requests from clients.
● SlaveNode - Node where Map and Reduce program runs.
● JobTracker - Schedules jobs and tracks the assign jobs to Task tracker.
● Task Tracker - Tracks the task and reports status to JobTracker.
● Job - A program is an execution of a Mapper and Reducer across a dataset.
● Task - An execution of a Mapper or a Reducer on a slice of data.
● Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.
Example Scenario
Given below is the data regarding the electrical consumption of an organization. It contains the
monthly electrical consumption and the annual average for various years.
26
PICT, TE-IT DS&BDA Laboratory
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Avg
1979 23 23 2 43 24 25 26 26 26 26 25 26 25
1980 26 27 28 28 28 30 31 31 31 30 30 30 29
1981 31 32 32 32 33 34 35 36 36 34 34 34 34
1984 39 38 39 39 39 41 42 43 40 39 38 38 40
1985 38 39 39 39 39 41 41 41 00 40 39 39 45
If the above data is given as input, we have to write applications to process it and produce results
such as finding the year of maximum usage, year of minimum usage, and so on. This is a
walkover for the programmers with finite number of records. They will simply write the logic
to produce the required output, and pass the data to the application written.
But, think of the data representing the electrical consumption of all the largescale industries of
a particular state, since its formation.
When we write applications to process such bulk data,
Input Data
The above data is saved as sample.txtand given as input. The input file looks as shown below.
19792323243242526262626252625
198026272828283031313130303029
198131323232333435363634343434
198439383939394142434039383840
198538393939394141410040393945
Example Program
Given below is the program to the sample data using MapReduce framework.
package hadoop;
import java.util.*;
import java.io.IOException;
27
PICT, TE-IT DS&BDA Laboratory
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
publicclassProcessUnits
//Mapper class
//Map function
OutputCollector<Text,IntWritable> output,
Reporter reporter)throwsIOException
StringTokenizer s =newStringTokenizer(line,"\t");
while(s.hasMoreTokens())
lasttoken=s.nextToken();
28
PICT, TE-IT DS&BDA Laboratory
output.collect(newText(year),newIntWritable(avgprice));
//Reducer class
Reducer<Text,IntWritable,Text,IntWritable>
//Reduce function
int maxavg=30;
int val=Integer.MIN_VALUE;
while(values.hasNext())
if((val=values.next().get())>maxavg)
output.collect(key,newIntWritable(val));
//Main function
29
PICT, TE-IT DS&BDA Laboratory
conf.setJobName("max_eletricityunits");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(E_EMapper.class);
conf.setCombinerClass(E_EReduce.class);
conf.setReducerClass(E_EReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf,newPath(args[0]));
FileOutputFormat.setOutputPath(conf,newPath(args[1]));
JobClient.runJob(conf);
Save the above program as ProcessUnits.java. The compilation and execution of the program
is explained below.
Step 1
The following command is to create a directory to store the compiled java classes.
$ mkdir units
Step 2
Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce
program. Visit the following
link http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-core/1.2.1to download the
jar. Let us assume the downloaded folder is /home/hadoop/.
Step 3
The following commands are used for compiling the ProcessUnits.javaprogram and creating
a jar for the program.
30
PICT, TE-IT DS&BDA Laboratory
Step 4
The following command is used to create an input directory in HDFS.
Step 5
The following command is used to copy the input file named sample.txtin the input directory
of HDFS.
Step 6
The following command is used to verify the files in the input directory.
Step 7
The following command is used to run the Eleunit_max application by taking the input files
from the input directory.
Wait for a while until the file is executed. After execution, as shown below, the output will
contain the number of input splits, the number of Map tasks, the number of reducer tasks, etc.
completed successfully
14/10/3106:02:52
INFO mapreduce.Job:Counters:49
FileSystemCounters
31
PICT, TE-IT DS&BDA Laboratory
Map-ReduceFramework
SpilledRecords=10
ShuffledMaps=2
FailedShuffles=0
MergedMap outputs=2
32
PICT, TE-IT DS&BDA Laboratory
FileOutputFormatCounters
BytesWritten=40
Step 8
The following command is used to verify the resultant files in the output folder.
Step 9
The following command is used to see the output in Part-00000 file. This file is generated by
HDFS.
198134
198440
198545
Step 10
The following command is used to copy the output folder from HDFS to the local file system
for analyzing.
Important Commands
All Hadoop commands are invoked by the $HADOOP_HOME/bin/hadoopcommand.
Running the Hadoop script without any arguments prints the description for all commands.
Usage : hadoop [--config confdir] COMMAND
The following table lists the options available and their description.
Options Description
33
PICT, TE-IT DS&BDA Laboratory
34
PICT, TE-IT DS&BDA Laboratory
classpath Prints the class path needed to get the Hadoop jar and the required
libraries.
GENERIC_OPTIONS Description
-status <job-id> Prints the map and reduce completion percentage and all job
counters.
-events <job-id><fromevent- Prints the events' details received by jobtracker for the given range.
#><#-of-events>
-history [all] <jobOutputDir> - Prints job details, failed and killed tip details. More details about
history < jobOutputDir> the job such as successful tasks and task attempts made for each
task can be viewed by specifying the [all] option.
-list[all] Displays all jobs. -list displays only jobs which are yet to complete.
-kill-task <task-id> Kills the task. Killed tasks are NOT counted against failed
attempts.
-fail-task <task-id> Fails the task. Failed tasks are counted against failed attempts.
-set-priority <job-id><priority> Changes the priority of the job. Allowed priority values are
VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW
e.g.
e.g.
e.g.
36
PICT, TE-IT DS&BDA Laboratory
● hadoop-2.2.0.tar.gz\hadoop-2.2.0.tar\hadoop-2.2.0\share\hadoop\common\hadoop-
common-2.2.0.jar
● hadoop-2.2.0.tar.gz\hadoop-2.2.0.tar\hadoop-2.2.0\share\hadoop\mapreduce\hadoop-
mapreduce-client-core-2.2.0.jar
I also use regular expressions in Java to analyze each line in the log. Regular expressions can be
more resilient to variations and allow for grouping, which gives easy access to specific data
elements. As always, I used Kodos to develop the regular expression.
In the example below, I don’t actually use the log value, but instead I just count up how many
occurrences there are by key.
Mapper class
importjava.io.IOException;
importjava.util.regex.Matcher;
importjava.util.regex.Pattern;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.LongWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Mapper;
@Override
publicvoid map(LongWritable key, Text value, Context context)
throwsIOException, InterruptedException{
String line = value.toString();
Matcher m = r.matcher(line);
if(m.find()){
// only consider ERRORs for this example
if(m.group(2).contains("ERROR")){
// example log line
// 2013-11-08 04:06:56,586 DEBUG component.helpers.GenericSOAPConnector - Attempting to connect to:
https://remotehost.com/app/rfc/entry/msg_status
// System.out.println("Found value: " + m.group(0)); //complete line
// System.out.println("Found value: " + m.group(1)); // date
// System.out.println("Found value: " + m.group(2)); // log level
// System.out.println("Found value: " + m.group(3)); // class
// System.out.println("Found value: " + m.group(4)); // message
context.write(new Text(m.group(1)), new Text(m.group(2)+ m.group(3)+
m.group(4)));
}
}
}
}
Reducer class
importjava.io.IOException;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Reducer;
@Override
publicvoid reduce(Text key, Iterable<Text> values, Context context)
37
PICT, TE-IT DS&BDA Laboratory
throwsIOException, InterruptedException{
int countValue =0;
for(Text value : values){
countValue++;
}
context.write(key, new IntWritable(countValue));
}
}
publicclass TomcatLogError {
publicstaticvoid main(String[] args)throwsException{
if(args.length!=2){
System.err.println("Usage: TomcatLogError <input path><output path>");
System.exit(-1);
}
Job job =new Job();
job.setJarByClass(TomcatLogError.class);
job.setJobName("Tomcat Log Error");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(TomcatLogErrorMapper.class);
job.setReducerClass(TomcatLogErrorReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true)?0:1);
}
}
Running Hadoop
In netbeans I made sure that the Main Class was TomcatLogError in the compiled jar. I then ran
Clean and Build to get a jar which I transferred up to the server where I installed Hadoop.
38
PICT, TE-IT DS&BDA Laboratory
The output folder now contains a file named part-r-00000 with the results of the processing.
Based on this analysis, there were a number of errors produced around the hour 18:51:17. It is then
easy to change the Mapper class to emit based on a different key, such as Class or Message to
identify more precisely what the error is, now that I know when the errors happened.
References:
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-
mapreduce-client-core/MapReduceTutorial.html
CONCLUSION:
39
PICT, TE-IT DS&BDA Laboratory
Assignment: 3
THEORY:
· Hive and HBase Architecture
· Explanation of Hive Architecture
· HBase Architecture
· Explanation of HBase Architecture
· List and details of DDL and DML Commands in HBase and Hive
SQL queries are submitted to Hive and they are executed as follows:
1. Hive compiles the query.
2. An execution engine, such as Tez or MapReduce, executes the compiled query.
3. The resource manager, YARN, allocates resources for applications across the
cluster.
4. The data that the query acts upon resides in HDFS (Hadoop Distributed File
System). Supported data formats are ORC, AVRO, Parquet, and text.
5. Query results are then returned over a JDBC/ODBC connection.
40
PICT, TE-IT DS&BDA Laboratory
The following diagram shows a detailed view of the HDP query execution architecture:
41
PICT, TE-IT DS&BDA Laboratory
The following sections explain major parts of the query execution architecture.
Hive Clients
You can connect to Hive using a JDBC/ODBC driver with a BI tool, such as Microstrategy,
Tableau, BusinessObjects, and others, or from another type of application that can access Hive
over a JDBC/ODBC connection. In addition, you can also use a command-line tool, such as
Beeline, that uses JDBC to connect to Hive. The Hive command-line interface (CLI) can also
be used, but it has been deprecated in the current release and Hortonworks does not
recommend that you use it for security reasons.
SQL in Hive
Hive supports a large number of standard SQL dialects. In a future release, when SQL:2011 is
adopted, Hive will support ANSI-standard SQL.
HiveServer2
Clients communicate with HiveServer2 over a JDBC/ODBC connection, which can handle
multiple user sessions, each with a different thread. HiveServer2 can also handle long-running
sessions with asynchronous threads. An embedded metastore, which is different from the
MetastoreDB, also runs in HiveServer2. This metastore performs the following tasks:
● Get statistics and schema from the MetastoreDB
● Compile queries
● Generate query execution plans
● Submit query execution plans
● Return query results to the client
Because HiveServer2 uses its own settings file, using one for ETL operations and another for
interactive queries is a common practice. All HiveServer2 instances can share the same
MetastoreDB. Consequently, setting up multiple HiveServer2 instances that have embedded
metastores is a simple operation.
Tez Execution
After query compilation, HiveServer2 generates a Tez graph that is submitted to YARN. A Tez
Application Master (AM) monitors the query while it is running.
Security
HiveServer2 performs standard SQL security checks when a query is submitted, including
connection authentication. After the connection authentication check, the server runs
authorization checks to make sure that the user who submits the query has permission to access
the databases, tables, columns, views, and other resources required by the query. Hortonworks
recommends that you use SQLStdAuth or Ranger to implement security. Storage-based access
controls, which is suitable for ETL workloads only, is also available.
File Formats
Hive supports many file formats. You can write your own SerDes (Serializers, Deserializers)
interface to support new file formats.
HBase architecture has 3 important components- HMaster, Region Server and ZooKeeper.
43
PICT, TE-IT DS&BDA Laboratory
i. HMaster
HBase HMaster is a lightweight process that assigns regions to region servers in the Hadoop
cluster for load balancing. Responsibilities of HMaster –
These are the worker nodes which handle read, write, update, and delete requests from clients.
Region Server process, runs on every node in the hadoop cluster. Region Server runs on HDFS
DataNode and consists of the following components –
● Block Cache – This is the read cache. Most frequently read data is stored in the read
cache and whenever the block cache is full, recently used data is evicted.
● MemStore- This is the write cache and stores new data that is not yet written to the
disk. Every column family in a region has a MemStore.
● Write Ahead Log (WAL) is a file that stores new data that is not persisted to permanent
storage.
● HFile is the actual storage file that stores the rows as sorted key values on a disk.
i. Zookeeper
HBase uses ZooKeeper as a distributed coordination service for region assignments and to
recover any region server crashes by loading them onto other region servers that are
functioning. ZooKeeper is a centralized monitoring server that maintains configuration
information and provides distributed synchronization. Whenever a client wants to
communicate with regions, they have to approach Zookeeper first. HMaster and Region servers
are registered with ZooKeeper service, client needs to access ZooKeeper quorum in order to
connect with region servers and HMaster. In case of node failure within an HBase cluster,
ZKquoram will trigger error messages and start repairing failed nodes.
Build an Awesome Job Winning Project Portfolio with Solved End-to-End Big Data Projects
ZooKeeper service keeps track of all the region servers that are there in an HBase cluster-
tracking information about how many region servers are there and which region servers are
holding which DataNode. HMaster contacts ZooKeeper to get the details of region servers.
Various services that Zookeeper provides include –
Implementation
44
PICT, TE-IT DS&BDA Laboratory
45
PICT, TE-IT DS&BDA Laboratory
46
PICT, TE-IT DS&BDA Laboratory
Done.
ROW COLUMN+CELL
47
PICT, TE-IT DS&BDA Laboratory
ROW COLUMN+CELL
48
PICT, TE-IT DS&BDA Laboratory
Done.
ROW COLUMN+CELL
49
PICT, TE-IT DS&BDA Laboratory
#Drop Table
#Create Table for dropping
hbase(main):047:0> list
TABLE
flight
table1
table2
tb1
#Drop Table
50
PICT, TE-IT DS&BDA Laboratory
#Disable table
hbase(main):051:0> list
TABLE
flight
table1
table2
COLUMN CELL
COLUMN CELL
51
PICT, TE-IT DS&BDA Laboratory
COLUMN CELL
ROW COLUMN+CELL
"/home/hduser/Desktop/empdata2";
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
Managed table and External table in Hive. There are two types of tables in Hive ,one is
Managed
table and second is external table. the difference is , when you drop a table, if it is
managed table
hive deletes both data and meta data,if it is external table Hive only deletes metadata.
hive>create table empdbnew(eno int, ename string, esal int) row format delimited fields
terminated
OK
1 deepali120000
2 mahesh 30000
3 mangesh 25000
4 ram 39000
5 brijesh 40000
6 john 300000
53
PICT, TE-IT DS&BDA Laboratory
OK
1 deepali120000
6 john 300000
#Check hbase for updates(The records are available in associated Hbase table)
ROW COLUMN+CELL
string,delay int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
":key,finfo:source,finfo:dest,fsch:at,fsch:dt,fsch:delay,delay:dl")
54
PICT, TE-IT DS&BDA Laboratory
hive> CREATE external TABLE hbase_flight_new(fno int, fsource string,fdest string,fsh_at string,fsh_dt
=":key,finfo:source,finfo:dest,fsch:at,fsch:dt,fsch:delay,delay:dl")
OK
OK
abc
ddl_hive
emp
empdata
empdata1
empdata2
empdbnew
hbase_flight
hbase_flight1
hbase_flight_new
hbase_table_1
hive_table_emp
OK
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future
versions.
Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hduser_20180328130004_47384e9a-7490-4dfb-809d-ae240507bfab
Total jobs = 1
set hive.exec.reducers.bytes.per.reducer=<number>
set hive.exec.reducers.max=<number>
set mapreduce.job.reduces=<number>
http://localhost:8088/proxy/application_1522208646737_0003/
2018-03-28 13:00:28,747 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.68 sec
2018-03-28 13:00:35,101 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 6.26 sec
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 6.26 sec HDFS Read: 9095 HDFS
Write:
56
PICT, TE-IT DS&BDA Laboratory
102 SUCCESS
OK
35
hive>
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
> AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
OK
OK
default__hbase_flight_new_hbasefltnew_index__ compact
57
PICT, TE-IT DS&BDA Laboratory
hive> create table empinfo(empno int, empgrade string) row format delimited fields
terminated by
OK
# Table A empdbnew
OK
1 deepali120000
2 mahesh 30000
3 mangesh 25000
4 ram 39000
5 brijesh 40000
6 john 300000
# Table B empinfo
OK
1A
2B
3B
4B
5B
6A
hive> SELECT eno, ename, empno, empgrade FROM empdbnew JOIN empinfo ON eno
= empno;
#Join==> Result
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future
versions.
Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hduser_20180328153258_bc345f46-a1f1-4589-ac5e-4c463834731a
Total jobs = 1
Number of reduce tasks not specified. Estimated from input data size: 1
set hive.exec.reducers.bytes.per.reducer=<number>
set hive.exec.reducers.max=<number>
set mapreduce.job.reduces=<number>
http://localhost:8088/proxy/application_1522208646737_0005/
2018-03-28 15:33:18,231 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 8.17 sec
2018-03-28 15:33:24,476 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 10.61 sec
Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 10.61 sec HDFS Read: 15336 HDFS
Write:
59
PICT, TE-IT DS&BDA Laboratory
235 SUCCESS
OK
1 deepali1 A
2 mahesh 2 B
3 mangesh 3 B
4 ram 4 B
5 brijesh 5 B
6 john 6 A
REFERENCES:
https://cwiki.apache.org/confluence/display/Hive/Tutorial
https://cwiki.apache.org/confluence/display/Hive/GettingStarted
https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.5.0/bk_hive-
performance-tuning/content/ch_hive_architectural_overview.html
https://hbase.apache.org/book.html
CONCLUSION:
60
PICT, TE-IT DS&BDA Laboratory
PartB
Assignments based on Data Analytics using
Python
61
PICT, TE-IT DS&BDA Laboratory
Assignment: 4
AIM: Perform the different operations using Python on the Facebook metrics data set.
Python iloc() function enables us to create subset choosing specific values from rows and
columns based on indexes. That is, unlike loc() function which works on labels, iloc()
function works on index values. We can choose and create a subset of a Python dataframe
from the data providing the index numbers of the rows and columns.
Syntax: pandas.dataframe.iloc[]
In a simple manner, we can make use of an indexing operator i.e. square brackets to
create a subset of the data.
Syntax:dataframe[['col1','col2','colN']]
b. Merge Data
62
PICT, TE-IT DS&BDA Laboratory
Pandas has full-featured, high performance in-memory join operations idiomatically very
similar to relational databases like SQL.
Pandas provides a single function, merge, as the entry point for all
standard database join operations between DataFrame objects −
d. Transposing Data
pandas.DataFrame.transpose
DataFrame.transpose(*args, copy=False)[source]
Reflect the DataFrame over its main diagonal by writing rows as columns and
vice-versa. The property T is an accessor to the method transpose().
Parameters
*argstuple, optional
Whether to copy the data after transposing, even for DataFrames with a single
dtype.
Note that a copy is always required for mixed dtype DataFrames, or for
DataFrames with any extension types.
Returns
DataFrame
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
This function is useful to massage a DataFrame into a format where one or more
columns are identifier variables (id_vars), while all other columns, considered
measured variables (value_vars), are “unpivoted” to the row axis, leaving just two
non-identifier columns, ‘variable’ and ‘value’.
Parameters
Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
var_namescalar
64
PICT, TE-IT DS&BDA Laboratory
If True, original index is ignored. If False, the original index is retained. Index
labels will be repeated as necessary.
Returns
DataFrame
Unpivoted DataFrame.
pandas.DataFrame.pivot
DataFrame.pivot(index=None, columns=None, values=None)
Reshape data (produce a “pivot” table) based on column values. Uses unique
values from specified index / columns to form axes of the resulting DataFrame.
This function does not support data aggregation, multiple values will result in a
MultiIndex in the columns. See the User Guide for more on reshaping.
Parameters
Column to use to make new frame’s index. If None, uses existing index.
columnsstr or object or a list of str
Column to use to make new frame’s columns.
Column(s) to use for populating new frame’s values. If not specified, all remaining
columns will be used and the result will have hierarchically indexed columns.
Returns
DataFrame
Raises
ValueError:
When there are any index, columns combinations with multiple values.
DataFrame.pivot_table when you need to aggregate.
Implementation
import pandas as pd
65
PICT, TE-IT DS&BDA Laboratory
uploaded = files.upload()
import io
df = pd.read_csv(io.BytesIO(uploaded['dataset_Facebook.csv']),sep = ';')
#print(df)
df.head(5)
df['Type']
video_sub=df['Type']
print(df.columns)
Bydf_subset = df[['like','share']]
print(Bydf_subset)
Hydf_subset1 = df[df['like']>100]
print(Hydf_subset1)
df.loc[1:7,['like','share']]
df.sort_values("like")
Byresult = df.transpose()
print(Byresult)
selective_df = pd.DataFrame(df,columns
=['like','share','Category','Type'])
selective_df.head(5)
print(pivot_table)
REFERENCES:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html
66
PICT, TE-IT DS&BDA Laboratory
CONCLUSION:
67
PICT, TE-IT DS&BDA Laboratory
Assignment: 5
AIM: Perform different data cleaning and data model building operations using Python
PROBLEM STATEMENT /DEFINITION
Perform the following operations using Python on the Air quality and Heart Diseases data sets
a. Data cleaning
b. Data integration
c. Data transformation
d. Error correcting
e. Data model building
OBJECTIVE:
To learn data processing methods
To learn building data model
THEORY:
Statistical analysis in five steps In this tutorial a statistical analysis is viewed as the result of a
number of data processing steps where each step increases the ``value'' of the data* . Raw data .
Technically correct data Consistent data Statistical results Formatted output type checking,
normalizing fix and impute estimate, analyze, derive, etc. tabulate, plot data cleaning
Figure 1 shows an overview of a typical data analysis project. Each rectangle represents data in
a certain state while each arrow represents the activities needed to get from one state to the other.
The first state (Raw data) is the data as it comes in. Raw data files may lack headers, contain
wrong data types (e.g. numbers stored as strings), wrong category labels, unknown or unexpected
character encoding and so on.
If you had to choose to be proficient in just one R-skill, it should be indexing. By indexing we
mean all the methods and tricks in R that allow you to select and manipulate data using logical,
integer or named indices. Since indexing skills are important for data cleaning, we quickly
68
PICT, TE-IT DS&BDA Laboratory
review vectors, data.frames and indexing techniques. The most basic variable in R is a vector.
An R vector is a sequence of values of the same type. All basic operations in R act on vectors
(think of the element-wise arithmetic, for example). The basic types in R are as follows. numeric
Numeric data (approximations of the real numbers, ℝ) integer Integer data (whole numbers, ℤ)
factor Categorical data (simple classifications, like gender) ordered Ordinal data (ordered
classifications, like educational level) character Character data (strings) raw Binary data All
basic operations in R work element-wise on vectors where the shortest argument is recycled if
necessary. This goes for arithmetic operations (addition, subtraction,…), comparison operators
(==, <=,…), logical operators (&, |, !,…) and basic math functions like sin, cos, exp and so on.
If you want to brush up your basic knowledge of vector and recycling properties, you can execute
the following code and think about why it works the way it does.
(1:4) * (1:3)
Each element of a vector can be given a name. This can be done by passing named arguments to
the c() function or later with the names function. Such names can be helpful giving meaning to
your variables. For example compare the vector
Obviously the second version is much more suggestive of its meaning. The names of a vector
need not be unique, but in most applications you'll want unique names (if any). Elements of a
vector can be selected or replaced using the square bracket operator [ ]. The square brackets
accept either a vector of names, index numbers, or a logical. In the case of a logical, the index is
recycled if it is shorter than the indexed vector. In the case of numerical indices, negative indices
omit, in stead of select elements. Negative and positive indices are not allowed in the same index
vector. You can repeat a name or an index number, which results in multiple instances of the
same value. You may check the above by predicting and then verifying the result of the following
statements.
capColor["louie"]
names(capColor)[capColor == "blue"]
x <- c(4, 7, 6, 5, 2, 8)
I <- x < 6
J <- x > 7
x[I | J]
x[c(TRUE, FALSE)]
x[c(-1, -2)]
Replacing values in vectors can be done in the same way. For example, you may check that in
the following assignment
69
PICT, TE-IT DS&BDA Laboratory
x <- 1:10
every other value of x is replaced with 1. A list is a generalization of a vector in that it can contain
objects of different types, including other lists. There are two ways to index a list. The single
bracket operator always returns a sub-list of the indexed list. That is, the resulting type is again
a list. The double bracket operator ([[ ]]) may only result in a single item, and it returns the object
in the list itself. Besides indexing, the dollar operator $ can be used to retrieve a single element.
To understand the above, check the results of the following statements.
L[["z"]]
Especially, use the class function to determine the type of the result of each statement. A
data.frame is not much more than a list of vectors, possibly of different types, but with every
vector (now columns) of the same length. Since data.frames are a type of list, indexing them
with a single index returns a sub-data.frame; that is, a data.frame with less columns. Likewise,
the dollar operator returns a vector, not a sub-data.frame. Rows can be indexed using two indices
in the bracket operator, separated by a comma. The first index indicates rows, the second
indicates columns. If one of the indices is left out, no selection is made (so everything is
returned). It is important to realize that the result of a two-index selection is simplified by R as
much as possible. Hence, selecting a single column using a two-index results in a vector. This
behaviour may be switched off using drop=FALSE as an extra parameter. Here are some short
examples demonstrating the above.
1.2.2 Special values Like most programming languages, R has a number of Special values that
are exceptions to the normal values of a type. These are NA, NULL, ±Inf and NaN. Below, we
quickly illustrate the meaning and differences between them. NA Stands for not available. NA
is a placeholder for a missing value. All basic operations in R handle NA without crashing and
mostly return NA as an answer whenever one of the input arguments is NA. If you understand
NA, you should be able to predict the result of the following R statements.
NA + 1
sum(c(NA, 1, 2))
median(c(NA, 1, 2, 3), na.rm = TRUE)
length(c(NA, 2, 3, 4))
3 == NA
NA == NA
TRUE | NA
70
PICT, TE-IT DS&BDA Laboratory
The function is.na can be used to detect NA's. NULL You may think of NULL as the empty set
from mathematics. NULL is special since it has no class (its class is NULL) and has length 0 so
it does not take up any space in a vector. In particular, if you understand NULL, the result of the
following statements should be clear to you without starting R.
c(x, 2)
The function is.null can be used to detect NULL variables. Inf Stands for infinity and only applies
to vectors of class numeric. A vector of class integer can never be Inf. This is because the Inf in
R is directly derived from the international standard for floating point arithmetic 1 . Technically,
Inf is a valid numeric that results from calculations like division of a number by zero. Since Inf
is a numeric, operations between Inf and a finite numeric are well-defined and comparison
operators work as expected. If you understand Inf, the result of the following statements should
be clear to you.
pi/0
2 * Inf
Inf - 1e+10
Inf + Inf
3 < -Inf
Inf == Inf
NaN Stands for not a number. This is generally the result of a calculation of which the result is
unknown, but it is surely not a number. In particular operations like 0/0, Inf-Inf and Inf/Inf result
in NaN. Technically, NaN is of class numeric, which may seem odd since it is used to indicate
that something is not numeric. Computations involving numbers and NaN always result in NaN,
so the result of the following computations should be clear.
NaN + 1
exp(NaN)
c.Data Transformations
A number of reasons can be attributed to when a predictive model crumples such as:
(Kuhn, 2013)
Before we dive into data preprocessing, let me quickly define a few terms that I will be commonly using.
● Predictor/Independent/Attributes/Descriptors – are the different terms that are used as input for the
prediction equation.
● Response/Dependent/Target/Class/Outcome – are the different terms that are referred to the
outcome event that is to be predicted.
71
PICT, TE-IT DS&BDA Laboratory
In this article, I am going to summarize some common data pre-processing approaches with examples in R
Variable centering is perhaps the most intuitive approach used in predictive modeling. To center a predictor
variable, the average predictor value is subtracted from all the values. as a result of centering, the predictor
has zero mean.
To scale the data, each predictor value is divided by its standard deviation (sd). This helps in coercing the
predictor value to have a sd of one. Needless to mention, centering and scaling will work for continuous
data. The drawback of this activity is loss of interpretability of the individual values.
An R example:
1
2
3
4
5
6
7
8 # Load the default datasets
9 > library(datasets)
1 > data(mtcars)
0 > dim(mtcars)
1 32 11
> str(mtcars)
1 'data.frame': 32 obs. of 11 variables:
1 $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
2 $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
1 $ disp: num 160 160 108 258 360 ...
3 $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
1 $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
4 $ qsec: num 16.5 17 18.6 19.4 17 ...
1 $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
5 $ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
1 $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
6 > cov(mtcars$disp, mtcars$cyl) # check for covariance
1 [1] 199.6603
7 > mtcars$disp.scl<-scale(mtcars$disp, center = TRUE, scale = TRUE) > mtcars$cyl.scl<-
1 check for covariance in scaled data
[,1]
8 [1,] 0.9020329
1
9
2
0
2
1
2
2
2. Resolving Skewness
Skewness is a measure of shape. A common appraoch to check for skewness is to plot the predictor
variable. As a rule, negative skewness indicates that the mean of the data values is less than the median, and
the data distribution is left-skewed. Positive skewness would indicates that the mean of the data values is
larger than the median, and the data distribution is right-skewed.
● If the skewness of the predictor variable is less than -1 or greater than +1, the data is highly skewed,
● If the skewness of the predictor variable is between -1 and -1/2 or between +1 and +1/2 then the
data is moderately skewed,
● If the skewness of the predictor variable is -1/2 and +1/2, the data is approximately symmetric.
I will use the function skewness from the e1071 package to compute the skewness coefficient
An R example:
1 > library(e1071)
> engine.displ<-skewness(mtcars$disp) >
2 engine.displ
3 [1] 0.381657
3. Resolving Outliers
The outliers package provides a number of useful functions to systematically extract outliers. Some of these
are convenient and come handy, especially the outlier() and scores() functions.
Outliers
The function outliers() gets the extreme most observation from the mean.
If you set the argument opposite=TRUE, it fetches from the other side.
An R example:
1
> set.seed(4680) # for code reproducibility
2 > y<- rnorm(100) # create some dummy data > library(outliers) # load the library
3 > outlier(y)
4 [1] 3.581686
5 > dim(y)<-c(20,5) # convert it to a matrix > head(y,2)# Look at the first 2 rows of
data
6 [,1] [,2] [,3] [,4] [,5]
7 [1,] 0.5850232 1.7782596 2.051887 1.061939 -0.4421871
8 [2,] 0.5075315 -0.4786253 -1.885140 -0.582283 0.8159582
9 > outlier(y) # Now, check for outliers in the matrix
10 [1] -1.902847 -2.373839 3.581686 1.583868 1.877199
> outlier(y, opposite = TRUE)
11 [1] 1.229140 2.213041 -1.885140 -1.998539 -1.571196
12
1 > set.seed(4680)
2 > x = rnorm(10)
3 > scores(x) # z-scores => (x-mean)/sd
[1] 0.9510577 0.8691908 0.6148924 -0.4336304 -1.6772781...
4 > scores(x, type="chisq") # chi-sq scores => (x - mean(x))^2/var(x)
5 [1] 0.90451084 0.75549262 0.37809269 0.18803531 2.81326197 . . .
6 > scores(x, type="t") # t scores
7 [1] 0.9454321 0.8562050 0.5923010 -0.4131696 -1.9073009
> scores(x, type="chisq", prob=0.9) # beyond 90th %ile based on chi-
8 sq
9 [1] FALSEFALSEFALSEFALSE TRUEFALSEFALSEFALSEFALSEFALSE
10 > scores(x, type="chisq", prob=0.95) # beyond 95th %ile
11 [1] FALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSEFALSE
12 > scores(x, type="z", prob=0.95) # beyond 95th %ile based on z-scores
13 [1] FALSEFALSEFALSEFALSE TRUEFALSEFALSEFALSEFALSEFALSE
14 > scores(x, type="t", prob=0.95) # beyond 95th %ile based on t-scores
[1] FALSEFALSEFALSEFALSE TRUEFALSEFALSEFALSEFALSEFALSE
73
PICT, TE-IT DS&BDA Laboratory
● A. Imputation
● B. Capping
For missing values that lie outside the 1.5 * IQR limits, we could cap it by replacing those
observations outside the lower limit with the value of 5th %ile and those that lie above the upper
limit, with the value of 95th %ile. For example, it can be done like this as shown;
1
> par(mfrow=c(1, 2)) # for side by side plotting
2 > x <- mtcars$mpg > plot(x)
3 > qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
4 > caps <- quantile(x, probs=c(.05, .95), na.rm = T)
5 > H <- 1.5 * IQR(x, na.rm = T)
> x[x < (qnt[1] - H)] <- caps[1]
6 > x[x > (qnt[2] + H)] <- caps[2]
7 > plot(x)
8
Use the library DMwR or mice or rpart. If using DMwR, for every observation to be imputed, it identifies
‘k’ closest observations based on the euclidean distance and computes the weighted average (weighted based
on distance) of these ‘k’ obs. The advantage is that you could impute all the missing values in all variables
with one call to the function. It takes the whole data frame as the argument and you don’t even have to
specify which variabe you want to impute. But be cautious not to include the response variable while
imputing.
There are many other types of transformations like treating colinearity, dummy variable encoding,
covariance treatment
REFERENCES:
https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-
Introduction_to_data_cleaning_with_R.pdf
https://www.r-bloggers.com/data-transformations/
CONCLUSION:
74
PICT, TE-IT DS&BDA Laboratory
Assignment: 6
Aim: Integrate Python and Hadoop and perform the following operations on forest fire dataset
a. Data analysis using the Map Reduce in PyHadoop
b. Data mining in Hive
Theory:
Example
REFERENCES:
CONCLUSION:
Assignment: 6
AIM : To Perform Data Preprocessing and model building.
PROBLEM STATEMENT /DEFINITION
Perform the following operations using Python on the Air quality and Heart Diseases data sets
1. Data cleaning
2. Data integration
3. Data transformation
4. Error correcting
5. Data model building
OBJECTIVE
● To understand the concept of Data Preprocessing
● To understand the methods of Data preprocessing.
THEORY:
Data Preprocessing:
Data preprocessing is the process of transforming raw data into an understandable format. It is
also an important step in data mining as we cannot work with raw data. The quality of the data
should be checked before applying machine learning or data mining algorithms.
Preprocessing of data is mainly to check the data quality. The quality can be checked by the
following
75
PICT, TE-IT DS&BDA Laboratory
1.Data cleaning:
Data Cleaning means the process of identifying the incorrect, incomplete, inaccurate, irrelevant
or missing part of the data and then modifying, replacing or deleting them according to the
necessity. Data cleaning is considered a foundational element of basic data science.
76
PICT, TE-IT DS&BDA Laboratory
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 303 non-null int64
1 sex 303 non-null int64
2 cp 303 non-null int64
3 trestbps 303 non-null int64
4 chol 303 non-null int64
5 fbs 303 non-null int64
6 restecg 303 non-null int64
7 thalach 303 non-null int64
8 exang 303 non-null int64
9 oldpeak 303 non-null float64
10 slope 303 non-null int64
11 ca 303 non-null int64
12 thal 303 non-null int64
13 target 303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB
Inconsistent column :
If your DataFrame contains columns that are irrelevant or you are never going to use them then
you can drop them to give more focus on the columns you will work on. Let’s see an example
of how to deal with such a data set.
Changing the types to uniform format: When you see the dataset, you may notice that the
‘type’ column has values such as ‘Industrial Area’ and ‘Industrial Areas’ — both actually mean
the same, so let’s remove such type of stuff and make it uniform.
77
PICT, TE-IT DS&BDA Laboratory
Creating a year column To view the trend over a period of time, we need year values for each
row and also when you see in most of the values in date column only has ‘year’ value. So, let’s
create a new column holding year values.
Missing data:
It is rare to have a real world dataset without having any missing values. When you start to
work with real world data, you will find that most of the dataset contains missing values.
Handling missing values is very important because if you leave the missing values as it is, it
may affect your analysis and machine learning models. So, you need to be sure that your
dataset contains missing values or not. If you find missing values in your dataset you must
handle it. If you find any missing values in the dataset you can perform any of these three task
on it:
1. Leave as it is
3. Drop them
For filling the missing values we can perform different methods. For example, Figure 4 shows
that airquality dataset has missing values.
The column such as SO2, NO2, rspm, spm, pm2_5 are the ones which contribute much to our
analysis. So, we need to remove null from those columns to avoid inaccuracy in the prediction.
We use the Imputer from sklearn.preprocessing to fill the missing values in every column with
the mean.
Outliers:
“In statistics, an outlier is a data point that differs significantly from other observations.”
78
PICT, TE-IT DS&BDA Laboratory
That means an outlier indicates a data point that is significantly different from the other data
points in the data set. Outliers can be created due to the errors in the experiments or the
variability in the measurements. Let’s look an example to clear the concept.
So, now the question is how can we detect the outliers in the dataset.
For detecting the outliers we can use :
1. Box Plot
2. Scatter plot
3. Z-score etc.
Before we plot the outliers, let's change the labeling for better visualization and
interpretation for heart dataset.
df['target'] = df.target.replace({1: "Disease", 0: "No_disease"})
df['sex'] = df.sex.replace({1: "Male", 0: "Female"})
df['cp'] = df.cp.replace({0: "typical_angina",
1: "atypical_angina",
2:"non-anginal pain",
3: "asymtomatic"})
df['exang'] = df.exang.replace({1: "Yes", 0: "No"})
df['fbs'] = df.fbs.replace({1: "True", 0: "False"})
df['slope'] = df.slope.replace({0: "upsloping", 1:
"flat",2:"downsloping"})
df['thal'] = df.thal.replace({1: "fixed_defect", 2: "reversable_defect",
3:"normal"})
79
PICT, TE-IT DS&BDA Laboratory
outliers(df[continous_features])
Drop Outliers
outliers(df[continous_features],drop=True)
Outliers from age feature removed
Outliers from trestbps feature removed
Outliers from chol feature removed
Outliers from thalach feature removed
Outliers from oldpeak feature removed
Duplicate rows:
Datasets may contain duplicate entries. It is one of the easiest tasks to delete duplicate rows. To
delete the duplicate rows you can use —dataset_name.drop_duplicates().
80
PICT, TE-IT DS&BDA Laboratory
duplicated=df.duplicated().sum()
if duplicated:
else:
print("No duplicates")
Duplicated rows :1
duplicates=df[df.duplicated(keep=False)]
duplicates.head()
2.Data integration
So far, we've made sure to remove the impurities in data and make it clean. Now, the next step is to
combine data from different sources to get a unified structure with more meaningful and valuable
information. This is mostly used if the data is segregated into different sources. To make it simple,
let's assume we have data in CSV format in different places, all talking about the same scenario.
Say we have some data about an employee in a database. We can't expect all the data about the
employee to reside in the same table. It's possible that the employee's personal data will be located
in one table, the employee's project history will be in a second table, the employee's time-in and
time-out details will be in another table, and so on. So, if we want to do some analysis about the
employee, we need to get all the employee data in one common place. This process of bringing data
together in one place is called data integration. To do data integration, we can merge multiple
pandas DataFrames using the merge function.
In this exercise, we'll merge the details of students from two datasets, namely student.csv and
marks.csv. The student dataset contains columns such as Age, Gender, Grade, and Employed. The
marks.csv dataset contains columns such as Mark and City. The Student_id column is common
between the two datasets. Follow these steps to complete this exercise:
dataset1 ="https://github.com/TrainingByPackt/Data-Science-with-
Python/blob/master/Chapter01/Data/student.csv"
dataset2 = "https://github.com/TrainingByPackt/Data-Science-with-
81
PICT, TE-IT DS&BDA Laboratory
Python/blob/master/Chapter01/Data/mark.csv"
df1 = pd.read_csv(dataset1, header = 0)
df2 = pd.read_csv(dataset2, header = 0)
To print the first five rows of the first DataFrame, add the following code:
df1.head()
The preceding code generates the following output:
To print the first five rows of the second DataFrame, add the following code:
df2.head()
The preceding code generates the following output:
Student_id is common to both datasets. Perform data integration on both the DataFrames with
respect to the Student_id column using the pd.merge() function, and then print the first 10 values
of the new DataFrame:
df = pd.merge(df1, df2, on = 'Student_id')
df.head(10)
82
PICT, TE-IT DS&BDA Laboratory
We have now learned how to perform data integration. In the next section, we'll explore another
pre-processing task, data transformation.
3.Data transformation
Previously, we saw how we can combine data from different sources into a unified dataframe. Now,
we have a lot of columns that have different types of data. Our goal is to transform the data into a
machine-learning-digestible format. All machine learning algorithms are based on mathematics. So,
we need to convert all the columns into numerical format. Before that, let's see all the different
types of data we have.
Taking a broader perspective, data is classified into numerical and categorical data:
83
PICT, TE-IT DS&BDA Laboratory
● Discrete: To explain in simple terms, any numerical data that is countable is called
discrete, for example, the number of people in a family or the number of students in a
class. Discrete data can only take certain values (such as 1, 2, 3, 4, etc).
● Continuous: Any numerical data that is measurable is called continuous, for example,
the height of a person or the time taken to reach a destination. Continuous data can take
virtually any value (for example, 1.25, 3.8888, and 77.1276).
● Ordered: Any categorical data that has some order associated with it is called ordered
categorical data, for example, movie ratings (excellent, good, bad, worst) and feedback
(happy, not bad, bad). You can think of ordered data as being something you could
mark on a scale.
● Nominal: Any categorical data that has no order is called nominal categorical data.
Examples include gender and country.
From these different types of data, we will focus on categorical data. In the next section, we'll
discuss how to handle categorical data.
● High cardinality: Cardinality means uniqueness in data. The data column, in this case,
will have a lot of different values. A good example is User ID – in a table of 500
different users, the User ID column would have 500 unique values.
● Rare occurrences: These data columns might have variables that occur very rarely and
therefore would not be significant enough to have an impact on the model.
84
PICT, TE-IT DS&BDA Laboratory
● Frequent occurrences: There might be a category in the data columns that occurs many
times with very low variance, which would fail to make an impact on the model.
● Won't fit: This categorical data, left unprocessed, won't fit our model.
Encoding
To address the problems associated with categorical data, we can use encoding. This is the
process by which we convert a categorical variable into a numerical form. Here, we will look
at three simple methods of encoding categorical data.
Replacing
This is a technique in which we replace the categorical data with a number. This is a simple
replacement and does not involve much logical processing. Let's look at an exercise to get a
better idea of this.
df['type'].value_counts()
RRO 179013
I 148069
RO 86791
S 15010
RIRUO 1304
R 158
The column type in the dataframe has 6 unique values, which we will be replacing with
numbers.
df['type'].replace({"RRO":1, "I":2, "RO":3,"S":4,"RIRUO":5,"R":6},
inplace= True)
df['type']
0 1.0
1 2.0
2 1.0
3 1.0
4 2.0
...
435734 5.0
435735 5.0
85
PICT, TE-IT DS&BDA Laboratory
435736 5.0
435737 5.0
435738 5.0
Name: type, Length: 435735, dtype: float64
This is a technique in which we replace each value in a categorical column with numbers from
0 to N-1. For example, say we've got a list of employee names in a column. After performing
label encoding, each employee name will be assigned a numeric label. But this might not be
suitable for all cases because the model might consider numeric values to be weights assigned
to the data. Label encoding is the best method to use for ordinal data. The scikit-learn library
provides LabelEncoder(), which helps with label encoding.
df['state'].value_counts()
Maharashtra 60382
Uttar Pradesh 42816
Andhra Pradesh 26368
Punjab 25634
Rajasthan 25589
Kerala 24728
Himachal Pradesh 22896
West Bengal 22463
Gujarat 21279
Tamil Nadu 20597
Madhya Pradesh 19920
Assam 19361
Odisha 19278
Karnataka 17118
Delhi 8551
Chandigarh 8520
Chhattisgarh 7831
Goa 6206
Jharkhand 5968
Mizoram 5338
Telangana 3978
Meghalaya 3853
Puducherry 3785
Haryana 3420
Nagaland 2463
Bihar 2275
Uttarakhand 1961
Jammu & Kashmir 1289
Daman & Diu 782
Dadra & Nagar Haveli 634
Uttaranchal 285
Arunachal Pradesh 90
Manipur 76
Sikkim 1
Name: state, dtype: int64
86
PICT, TE-IT DS&BDA Laboratory
In label encoding, categorical data is converted to numerical data, and the values are assigned
labels (such as 1, 2, and 3). Predictive models that use this numerical data for analysis might
sometimes mistake these labels for some kind of order (for example, a model might think that a
label of 3 is "better" than a label of 1, which is incorrect). In order to avoid this confusion, we
can use one-hot encoding. Here, the label-encoded data is further divided into n number of
columns. Here, n denotes the total number of unique labels generated while performing label
encoding. For example, say that three new labels are generated through label encoding. Then,
while performing one-hot encoding, the columns will be divided into three parts. So, the value
of n is 3.
dfAndhra=df[(df['state']==0)]
dfAndhra
dfAndhra['location'].value_counts()
87
PICT, TE-IT DS&BDA Laboratory
88
PICT, TE-IT DS&BDA Laboratory
You have successfully converted categorical data to numerical data using the OneHotEncoder
method.
4.Error Correction
In heart dataset it can be observed that feature ‘ca’ should range from 0–3, however,
df.nunique() listed 0–4. So let's find the ‘4’ and change them to NaN.
df['ca'].unique()
df[df['ca']==4]
df.loc[df['ca']==4,'ca']=np.NaN
Similarly, Feature ‘thal’ ranges from 1–3, however, df.nunique() listed 0–3. There are two
values of ‘0’. They are also changed to NaN using above mentioned technique.
df = df.fillna(df.median())
df.isnull().sum()
89
PICT, TE-IT DS&BDA Laboratory
Once you've pre-processed your data into a format that's ready to be used by your model, you
need to split up your data into train and test sets. This is because your machine learning algorithm
will use the data in the training set to learn what it needs to know. It will then make a prediction
about the data in the test set, using what it has learned. You can then compare this prediction
against the actual target variables in the test set in order to see how accurate your model is. The
exercise in the next section will give more clarity on this.
We will do the train/test split in proportions. The larger portion of the data split will be the train
set and the smaller portion will be the test set. This will help to ensure that you are using enough
data to accurately train your model.
In general, we carry out the train-test split with an 80:20 ratio, as per the Pareto principle. The
Pareto principle states that "for many events, roughly 80% of the effects come from 20% of the
causes." But if you have a large dataset, it really doesn't matter whether it's an 80:20 split or
90:10 or 60:40. (It can be better to use a smaller split set for the training set if our process is
computationally intensive, but it might cause the problem of overfitting – this will be covered
later in the book.)
Create a variable called X to store the independent features. Use the drop() function to include
all the features, leaving out the dependent or the target variable, which in this case is named
‘target’ for heart dataset. Then, print out the top five instances of the variable. Add the
following code to do this:
X = df.drop('target', axis=1)
X.head()
The preceding code generates the following output:
90
PICT, TE-IT DS&BDA Laboratory
1. Print the shape of your new created feature matrix using the X.shape command:
X.shape
The preceding code generates the following output:
91
PICT, TE-IT DS&BDA Laboratory
df=df.apply(preprocessing.LabelEncoder().fit_transform)
5. Print the shape of X_train, X_test, y_train, and y_test. Add the following code to do
this:
print("X_train : ",X_train.shape)
print("X_test : ",X_test.shape)
print("y_train : ",y_train.shape)
print("y_test : ",y_test.shape)
The preceding code generates the following output:
92
PICT, TE-IT DS&BDA Laboratory
In the next section, you will complete an activity wherein you'll perform pre-processing on
a dataset.
Supervised Learning
Supervised learning is a learning system that trains using labeled data (data in which the target
variables are already known). The model learns how patterns in the feature matrix map to the
target variables. When the trained machine is fed with a new dataset, it can use what it has
learned to predict the target variables. This can also be called predictive modeling.
Supervised learning is broadly split into two categories. These categories are as follows:
Classification mainly deals with categorical target variables. A classification algorithm helps to
predict which group or class a data point belongs to.
When the prediction is between two classes, it is known as binary classification. An example is
predicting whether or not a person has a heart disease (in this case, the classes are yes and no).
If the prediction involves more than two target classes, it is known as multi-classification; for
example, predicting all the items that a customer will buy.
93
PICT, TE-IT DS&BDA Laboratory
Regression deals with numerical target variables. A regression algorithm predicts the
numerical value of the target variable based on the training dataset.
Linear regression measures the link between one or more predictor variables and one outcome
variable. For example, linear regression could help to enumerate the relative impacts of age,
gender, and diet (the predictor variables) on height (the outcome variable).
Time series analysis, as the name suggests, deals with data that is distributed with respect to
time, that is, data that is in a chronological order. Stock market prediction and customer churn
prediction are two examples of time series data. Depending on the requirement or the
necessities, time series analysis can be either a regression or classification task.
Unsupervised Learning
Unlike supervised learning, the unsupervised learning process involves data that is neither
classified nor labeled. The algorithm will perform analysis on the data without guidance. The
job of the machine is to group unclustered information according to similarities in the data. The
aim is for the model to spot patterns in the data in order to give some insight into what the data
is telling us and to make predictions.
An example is taking a whole load of unlabeled customer data and using it to find patterns to
cluster customers into different groups. Different products could then be marketed to the
different groups for maximum profitability.
● Clustering: A clustering procedure helps to discover the inherent patterns in the data.
● Association: An association rule is a unique way to find patterns associated with a large
amount of data, such as the supposition that when someone buys product 1, they also
tend to buy product 2.
CONCLUSION:
This assignment covers the most crucial step in Machine learning is data preprocessing.
94
PICT, TE-IT DS&BDA Laboratory
References:
[1]What is Data Cleaning? How to Process Data for Analytics and Machine Learning
Modeling? | by Awan-Ur-Rahman | Towards Data Science
95
PICT, TE-IT DS&BDA Laboratory
Assignment: 7
Aim: Visualize the data using R/Python by plotting the graphs for assignment no. 2 and 3
OBJECTIVE:
1. To understand and apply the analytical concept of Big data using Python.
2. To study detailed data visualization techniques in Python programing.
SOFTWARE REQUIREMENTS:
1. Ubuntu 16.04 / 18.04
2. Python
THEORY:
1) R – Pie Charts
In R the pie chart is created using the pie() function which takes positive numbers as a vector
input. The additional parameters are used to control labels, color, title etc.
Syntax
The basic syntax for creating a pie-chart using the R is –
pie(x, labels, radius, main, col, clockwise)
Following is the description of the parameters used –
● x is a vector containing the numeric values used in the pie chart.
● labels is used to give description to the slices.
● radius indicates the radius of the circle of the pie
chart.(value between −1 and +1).
● main indicates the title of the chart.
● col indicates the color palette.
● clockwise is a logical value indicating if the slices are drawn clockwise or anti
clockwise.
Example
A very simple pie-chart is created using just the input vector and labels. The below script will
create and save the pie chart in the current R working directory.
# Create data for the graph.
X<- c(21, 62, 10, 53)
labels<- c(“London”, “New York”, “Singapore”, “Mumbai”)
42
# Give the chart file a name.
png(file = “city.jpg”)
# Plot the chart.
Pie(x,labels)
# Save the file.
Dev.off()
When we execute the above code, it produces the following result –
96
PICT, TE-IT DS&BDA Laboratory
Syntax
The basic syntax to create a bar-chart in R is –
barplot(H, xlab, ylab, main, names.arg, col)
Following is the description of the parameters used –
● H is a vector or matrix containing numeric values used in bar chart.
● xlabis the label for x axis.
● ylabis the label for y axis.
● mainis the title of the bar chart.
● names.argis a vector of names appearing under each bar.
● colis used to give colors to the bars in the graph.
Example
A simple bar chart is created using just the input vector and the name of each bar.
The below script will create and save the bar chart in the current R working directory.
# Create the data for the chart.
H <- c(7,12,28,3,41)
# Give the chart file a name.
png(file = “barchart.png”)
# Plot the bar chart.
Barplot(H)
# Save the file.
Dev.off()
When we execute the above code, it produces the following result –
97
PICT, TE-IT DS&BDA Laboratory
Example
We use the data set “mtcars” available in the R environment to create a basic boxplot. Let’s look
at the columns “mpg” and “cyl” in mtcars.
Input <- mtcars[,c(‘mpg’,’cyl’)]
print(head(input))
When we execute above code, it produces following result –
mpg cyl
Mazda RX4 21.0 6
Mazda RX4 Wag 21.0 6
Datsun 710 22.8 4
Hornet 4 Drive 21.4 6
Hornet Sportabout 18.7 8
Valiant 18.1 6
Creating the Boxplot
The below script will create a boxplot graph for the relation between mpg (miles per gallon) and
cyl (number of cylinders).
45
# Give the chart file a name.
98
PICT, TE-IT DS&BDA Laboratory
png(file = “boxplot.png”)
# Plot the chart.
Boxplot(mpg ~ cyl, data = mtcars, xlab = “Number of Cylinders”,
ylab = “Miles Per Gallon”, main = “Mileage Data”)
# Save the file.
Dev.off()
When we execute the above code, it produces the following result –
4) R – Histograms
A histogram represents the frequencies of values of a variable bucketed into ranges. Histogram
is similar to bar chat but the difference is it groups the values into continuous ranges. Each bar
in histogram represents the height of the number of values present in that range.
R creates histogram using hist() function. This function takes a vector as an input and uses some
more parameters to plot histograms.
Syntax
The basic syntax for creating a histogram using R is –
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Following is the description of the parameters used –
● v is a vector containing numeric values used in histogram.
● main indicates title of the chart.
● col is used to set color of the bars.
● border is used to set border color of each bar.
● xlab is used to give description of x-axis.
● xlim is used to specify the range of values on the x-axis.
● ylim is used to specify the range of values on the y-axis.
● breaks is used to mention the width of each bar.
Example
A simple histogram is created using input vector, label, col and border parameters.
The script given below will create and save the histogram in the current R working directory.
99
PICT, TE-IT DS&BDA Laboratory
Fig. 4 Hostogram
Syntax
The basic syntax to create a line chart in R is –
plot(v,type,col,xlab,ylab)
Following is the description of the parameters used –
● v is a vector containing the numeric values.
● type takes the value “p” to draw only the points, “l” to draw only the lines and “o” to
draw both points and lines.
● xlab is the label for x axis.
● ylab is the label for y axis.
● main is the Title of the chart.
● col is used to give colors to both the points and lines.
Example
A simple line chart is created using the input vector and the type parameter as “O”. The below
script will create and save a line chart in the current R working directory.
# Create the data for the chart.
V<- c(7,12,28,3,41)
# Give the chart file a name.
png(file = “line_chart.jpg”)
# Plot the bar chart.
Plot(v,type = “o”)
# Save the file.
Dev.off()
100
PICT, TE-IT DS&BDA Laboratory
Example
We use the data set “mtcars” available in the R environment to create a basic scatterplot. Let’s
use the columns “wt” and “mpg” in mtcars.
Input <- mtcars[,c(‘wt’,’mpg’)]
print(head(input))
When we execute the above code, it produces the following result –
Wt mpg
Mazda RX4 2.620 21.0
Mazda RX4 Wag 2.875 21.0
Datsun 710 2.320 22.8
Hornet 4 Drive 3.215 21.4
Hornet Sportabout 3.440 18.7
Valiant 3.460 18.1
Creating the Scatter plot
The below script will create a scatterplot graph for the relation between wt(weight) and
mpg(miles per gallon).
101
PICT, TE-IT DS&BDA Laboratory
Example
Each variable is paired up with each of the remaining variable. A scatterplot is plotted for each
pair.
# Give the chart file a name.
png(file = “scatterplot_matrices.png”)
# Plot the matrices between 4 variables giving 12 plots.
# One variable with 3 others and total 4 variables.
Pairs(~wt+mpg+disp+cyl,data = mtcars,
102
PICT, TE-IT DS&BDA Laboratory
CONCLUSION: Thus we have learnt Visualize the data using R/Python by plotting the
graphs.
103
PICT, TE-IT DS&BDA Laboratory
Assignment: 8
AIM: Perform different data visualization operations using Tableau
PROBLEM STATEMENT /DEFINITION
Study following data visualization operations using Tableau on Adult, Iris datasets or retail
dataset.
1) 1D (Linear) Data visualization
2) 2D (Planar) Data Visualization
3) 3D (Volumetric) Data Visualization
4) Temporal Data Visualization
5) Multidimensional Data Visualization
6) Tree/ Hierarchical Data visualization
7) Network Data visualization
Perform the data visualization operations using Tableau to get answers to various
business questions on Retail dataset.
a. Find and Plot top 10 products based on total sale
b. Find and Plot product contribution to total sale
c. Find and Plot the month wise sales in year 2010 in descending order
d. Find and Plot most loyal customers based on purchase order
e. Find and Plot yearly sales comparison
f. Find and Plot country wise total sales price and show on Geospatial graph
g. Find and Plot country wise popular product
h. Find and Plot bottom 10 products based on total sale
i. Find and Plot top 5 purchase order
j. Find and Plot most popular products based on sales
k. Find and Plot half yearly sales for the year 2011
l. Find and Plot country wise total sales quantity and show on Geospatial graph
OBJECTIVE:
To learn specialized data visualization tools.
To learn different types of data visualization techniques.
SOFTWARE REQUIREMENTS:
1. Ubuntu 16.04 / 18.04
2. Tableau
THEORY:
Introduction:
Data visualization or data visualization is viewed by many disciplines as a modern equivalent
of visual communication. It involves the creation and study of the visual representation of data,
meaning “information that has been abstracted in some schematic form, including attributes or
variables for the units of information”.
Data visualization refers to the techniques used to communicate data or information by encoding
it as visual objects (e.g., points, lines or bars) contained in graphics. The goal is to communicate
information clearly and efficiently to users. It is one of the steps in data analysis or data science
1D/Linear
Examples:
● lists of data items, organized by a single feature (e.g., alphabetical order) (not
commonly visualized)
104
PICT, TE-IT DS&BDA Laboratory
● Computer Simulations
Temporal
Examples:
● Timeline
Fig.3Time Series
Multidimensional:
Examples (category proportions, counts):
● Histogram
Fig. 4 Histogram
● pie chart
106
PICT, TE-IT DS&BDA Laboratory
● Dendrogram:
107
PICT, TE-IT DS&BDA Laboratory
Fig. 8 Dendogram
● Network:
o node-link diagram (link-based layout algorithm)
108
PICT, TE-IT DS&BDA Laboratory
There are three basic steps involved in creating any Tableau data analysis report.
These three steps are −
● Connect to a data source − It involves locating the data and using an
appropriate type of connection to read the data.
● Choose dimensions and measures − This involves selecting the
required columns from the source data for analysis.
109
PICT, TE-IT DS&BDA Laboratory
110
PICT, TE-IT DS&BDA Laboratory
You can apply a technique of adding another dimension to the existing data. This will add
more colors to the existing bar chart as shown in the following screenshot.
111
PICT, TE-IT DS&BDA Laboratory
Conclusion: Thus we have learnt how to visualize the data in different types (1 1D (Linear) Data
visualization,2D (Planar) Data Visualization, 3D (Volumetric)) Data Visualization, Temporal
Data Visualization,Multidimensional Data Visualization, Tree/ Hierarchical Data visualization,
Network Data visualization) by using Tableau Software.
112
PICT, TE-IT DS&BDA Laboratory
Assignment: 9
AIM : To create a review scrapper for any ecommerce website
PROBLEM STATEMENT /DEFINITION
Create a review scrapper for any ecommerce website to fetch real time comments, reviews,
ratings, comment tags, customer name using Python
OBJECTIVE
● To understand the concept of Web scraping
● To understand the methods used in web scraping.
THEORY:
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data
from websites. The web scraping software may directly access the World Wide Web using the
Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a
software user, the term typically refers to automated processes implemented using a bot or web
crawler. It is a form of copying in which specific data is gathered and copied from the web,
typically into a central local database or spreadsheet, for later retrieval or analysis. Web scraping
a web page involves fetching it and extracting from it. [1]
Beautifulsoup4 library is used here to scrape the data.the required Python libraries:
The request library to make network requests
To scrape data from a website, we need to extract the content of the webpage. Once the request
is made to a website, the entire content of the webpage is available, and we can then evaluate the
web content to extract data out from it. The content is made available in the form of plain text.
2. Thehtml5lib library for parsing HTML
Once the content is available, we need to specify the library that represents the parsing logic for
the text available. We’ll be using the html5lib library to parse the text content to HTML DOM-
based representation.
3. The beautifulsoup4 library for navigating the HTML tree structure
beautifulsoup4 takes the raw text content and parsing library as the input parameters. In our
example, we have exposed html5lib as a parsing library. It can then be used to navigate and
search for elements from the parsed HTML nodes. It can pull data out from the HTML nodes
and extract/search required nodes from HTML structure.
Making the Request for the Web Content
Let's make the web request for the website to be scraped. We will be using the requests library.
To start using the requests library, we need to install the third-party library using the following
command
pip install requests
We will be scrapping the amazon e-commerce website for customer name,ratings and reviews.
Let’s first make a request to extract the content for the specified website. request.get makes a
request to the webpage, which returns back the raw HTML content.
CODE:
from bs4 import BeautifulSoup as bs
import requests
link='https://www.amazon.in/OnePlus-Mirror-Black-128GB-Storage/product-
reviews/B07DJHV6VZ/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_
reviews'
page = requests.get(link)
page
soup = bs(page.content,'html.parser')
print(soup.prettify())
names = soup.find_all('span',class_='a-profile-name')
113
PICT, TE-IT DS&BDA Laboratory
names
[<span class="a-profile-name">Tanmay Shukla</span>,
<span class="a-profile-name">Surbhi Garg</span>,
<span class="a-profile-name">Tanmay Shukla</span>,
<span class="a-profile-name">Surbhi Garg</span>,
<span class="a-profile-name">Saroj N.</span>,
<span class="a-profile-name">klknow</span>,
<span class="a-profile-name">abdulkadir garari</span>,
<span class="a-profile-name">Mani</span>,
<span class="a-profile-name">Anshu K.</span>,
<span class="a-profile-name">Sneha</span>,
<span class="a-profile-name">nagaraj s.</span>,
<span class="a-profile-name">Aakash Sinha</span>]
['Tanmay Shukla',
'Surbhi Garg',
'Tanmay Shukla',
'Surbhi Garg',
'Saroj N.',
'klknow',
'abdulkadir garari',
'Mani',
'Anshu K.',
'Sneha',
'nagaraj s.',
'Aakash Sinha']
114
PICT, TE-IT DS&BDA Laboratory
</a>,
<a class="a-size-base a-link-normal review-title a-color-base review-
title-content a-text-bold" data-hook="review-title" href="/gp/customer-
reviews/R15DF987OXR3ES?ASIN=B07DJHV6VZ">
<span>Worst phone</span>
</a>,
<a class="a-size-base a-link-normal review-title a-color-base review-
title-content a-text-bold" data-hook="review-title" href="/gp/customer-
reviews/R33GOMNOW0H21X?ASIN=B07DJHV6VZ">
<span>Dead on arrival</span>
</a>,
<a class="a-size-base a-link-normal review-title a-color-base review-
title-content a-text-bold" data-hook="review-title" href="/gp/customer-
reviews/R22HPI22GRD028?ASIN=B07DJHV6VZ">
<span>Not worth to buy 6T</span>
</a>,
<a class="a-size-base a-link-normal review-title a-color-base review-
title-content a-text-bold" data-hook="review-title" href="/gp/customer-
reviews/R1EJO3TGNM3R9X?ASIN=B07DJHV6VZ">
<span>Battery problem and disappointing customer support.</span>
</a>,
<a class="a-size-base a-link-normal review-title a-color-base review-
title-content a-text-bold" data-hook="review-title" href="/gp/customer-
reviews/RP84LWZU2465S?ASIN=B07DJHV6VZ">
<span>Beautiful phone</span>
</a>,
<a class="a-size-base a-link-normal review-title a-color-base review-
title-content a-text-bold" data-hook="review-title" href="/gp/customer-
reviews/R2LUMMIRPSJBC0?ASIN=B07DJHV6VZ">
<span>Only one side of speakers is working</span>
</a>,
<a class="a-size-base a-link-normal review-title a-color-base review-
title-content a-text-bold" data-hook="review-title" href="/gp/customer-
reviews/R17C4CZRJA22GP?ASIN=B07DJHV6VZ">
<span>Awesome value for the money</span>
</a>,
<a class="a-size-base a-link-normal review-title a-color-base review-
title-content a-text-bold" data-hook="review-title" href="/gp/customer-
reviews/R3E54TMP6XTNA4?ASIN=B07DJHV6VZ">
<span>OnePlust 6T McLaren Edition - Salute to Speed!</span>
</a>]
#Similarly all the required fields are scraped and dataframe is formed as
given below
import pandas as pd
115
PICT, TE-IT DS&BDA Laboratory
df = pd.DataFrame()
df['Customer Name']=cust_name
df['Review title']=review_title
df['Ratings']=rate
df['Reviews']=review_content
df
CONCLUSION:
Beautiful Soup was used to scrape customer reviews from amazon Website.
REFERENCES :
[1]Web scraping,Wikipedia
116
PICT, TE-IT DS&BDA Laboratory
Part C
Model Implementation
117
PICT, TE-IT DS&BDA Laboratory
Assignment: 09
Aim: Create a review scrapper for any ecommerce website to fetch real time comments,
reviews, ratings, comment tags, customer name using Python.
Theory:
Example:
References:
Assignment: 10
AIM: Design and Implement Mini Project in Data Science
PROBLEM STATEMENT /DEFINITION
Design and Implement any Data science Application using R/Python. Obtain Data, Scrub data
(Data cleaning), Explore data, Prepare and validate data model and Interpret data (Data
Visualization). Visualize data using any visualization tool like Matplotlib, ggplot, Tableau etc.
Prepare Project Report.
OBJECTIVE:
1. To explore the Data science project life cycle.
2. To identify need of project and define problem statement.
3. To extract and process data.
4. To interpret and analyze results using data visualization.
118
PICT, TE-IT DS&BDA Laboratory
119
PICT, TE-IT DS&BDA Laboratory
120