Data Using R
Data Using R
SYLLABUS
What is Data Science, Scenarios on Data Science, Data Science and Organization, Different
types of data, Structured data, Unstructured data, Machine generated data, Understanding on
Data Science Process, Explain on Research Goal, Data Processing on Data Science, Getting
Start With R, Overview of R, Why R for Data Science, Eclipse, Live-R, Project Workspace
Setup, Understanding on R Packages, Load Libraries and Installed Packages.
Data Types and Syntax, Processing on Variables, Data Items on Structure, Classes and
Manipulate Objects, Control statements IF, ELSE, SWITCH, Loop statements, FOR, WHILE,
REPEAT, Working with String and Date, Understanding on Vector, List, Data Frames,
Working with Arrays, Read and Write data from CSV, Tabular Data and Database.
UNIT 3: CLASSIFICATION IN R
UNIT 4: CLUSTERING IN R
Overview of Data Visualization, Packages, Interactive Graphics, Plotting, Scatter plot , Box
plot, Bar plot, Pie chart, Histogram, XKD-Style Plots, Heat Maps, Introduction to predictive
models, What is Model and how to build a model.
TEXTBOOK:
REFERENCE:
MODULE -1
MODULE- 2
MODULE- 3
MODULE -5
MODULE -6
MODULE -7
MODULE-8
MODULE-10
MODULE 1
1.5 Data
Data science is a deep study of the massive amount of data, which involves
extracting meaningful insights from raw, structured, and unstructured data
that is processed using the scientific method, different technologies, and
algorithms.
o Gaming world:
In the gaming world, the use of Machine learning algorithms is
increasing day by day. EA Sports, Sony, Nintendo, are widely using
data science for enhancing user experience.
o Internet Search:
While searching on the internet, using different types of search
engines such as Google, Yahoo, Bing, Ask, etc. All these search engines
use the data science technology to make the search experience better,
and you can get a search result with a fraction of seconds.
o Recommendation Systems:
Most of the companies, such as Amazon, Netflix, Google Play, etc., are
using data science technology for making a better user experience
with personalized recommendations.
o Risk Detection:
Finance industries always had an issue of fraud and risk of losses, but
with the help of data science, this can be rescued. Most of the finance
companies are looking for the data scientist to avoid risk and any type
of losses with an increase in customer satisfaction.
1.5 DATA
1) Qualitative Data
2) Quantitative Data
1.5.1.1Qualitative Data :
1. Nominal Data
2.Ordinal Data
1. Nominal Data
They don’t provide any quantitative They provide sequence and can assign
value, neither can we perform any numbers to ordinal data but cannot
arithmetical operation perform the arithmetical operation
Quantitative data can be measured and not simply observed. They can be
numerically represented and calculations can be performed on them
It answers the questions like “how much,” “how many,” and “how often.” For
example, the price of a phone, the computer’s ram, the height or weight of a
person, etc., falls under quantitative data.
1. Discrete Data
2. Continuous Data
1. Discrete Data
• The discrete data contain the values that fall under integers or whole
numbers.
• These data can’t be broken into decimal or fraction values.
• The discrete data are countable and have finite values; their
subdivision is not possible.
• These data are represented mainly by a bar graph, number line, or
frequency table.
Examples of Discrete Data :
2.Continuous Data
• Height of a person
• Speed of a vehicle
• “Time-taken” to finish the work
• Wi-Fi Frequency
• Market share price
Discrete data are represented mainly Continuous data are represented in the
by bar graphs form of a histogram
The values cannot be divided into The values can be divided into
subdivisions into smaller pieces subdivisions into smaller pieces
Discrete data have spaces between Continuous data are in the form of a
the values continuous sequence
1.6.3 Advantages:
1.6.4 Disadvantages:
Machine generated data can often used to describe data which has
been generated by an organization’s industrial control systems and
mechanical devices that are designed to carry out a single function
Computers generate log files that include information about the system's
operation. A log file is made up of a series of log lines that show various
system actions, such as saving or deleting a file, connecting to a Wi-Fi
network, installing new software, opening an application, attaching a
Bluetooth device, emptying a recycle bin, and more.
Some types of computer log data are shared with the manufacturers of
computers, operating systems, applications, and programs, while others are
kept locally and confidentially.
3. Geotag Data
The Machine Data connected with telephone calls is referred to as a call log
or call detail record. The automated process of gathering, recording, and
evaluating data regarding phone calls is known as call logging.
The call duration, start and finish times of the call, the caller and recipient's
locations, as well as the network utilized, are all recorded in the logs.
An application log is a file that keeps track of the activities that occur within a
software application. Despite the fact that human users initiate the actions,
the Machine Data referred to here is generated automatically rather than
being manually entered.
2.2 Introduction to R
2.3 Applications of R
2.4 R Installation
2.5 Packages in R
2.6 Working With R Environment
2.3 APPLICATIONS OF R:
• Tech giants like Google, Facebook, bing, Twitter, Accenture, Wipro
and many more using R nowadays.
• The Consumer Financial Protection Bureau uses R for data analysis
• Bank of America uses R for reporting.
• R is part of technology stack behind Foursquare’s famed
recommendation engine.
2.4 R INSTALLATION
1. Go to www.cran.r-project.org website
Run the R executable file to start installation, and allow the app to
make changes to your device.
1. Go to www.posit.com website:
Repositories:
A repository is a place where packages are located and stored ,the packages
from install from it. Organizations and Developers have a local repository,
typically they are online and accessible to everyone. Some of the most
popular repositories for R packages are:
• Installing Package from CRAN is the most common and easiest way as
we just have to use only one command. In order to install more than a
package at a time, we just have to write them as a character vector in the
first argument of the install.packages() function:
Example:
install.packages(“ggplot”)
Under Packages, type, and search Package which we want to install and
then click on install button.
Loading a Package
Syntax;
library(“package_name”)
Example:
library("ggplot2")
Output:
• After pressing the Enter key, the R interpreter executes and returns the
answer to the user.
Output:
> 1/2
> 11^2
Output:
var_name, Valid Variable can start with a dot, but dot should not
var.name be followed by a number. In this case, the
variable will be invalid.
Assignment of variable
In R programming, there are three operators which can use to assign the
values to the variable. Use leftward, rightward, and equal_to operator for this
purpose.
There are two functions which are used to print the value of the variable i.e.,
print() and cat(). The cat() function combines multiples values into a
continuous print output.
Variables can store data of different types, and different types can do
different things.
In R, variables do not need to be declared with any particular type, and can
even change type after they have been set:
Example
# numeric
x <- 10.5
class(x)
# integer
x <- 1000L
class(x)
# complex
x <- 9i + 3
class(x)
# character/string
x <- "R is exciting"
class(x)
# logical/boolean
x <- TRUE
class(x)
[1] "numeric"
[1] "integer"
[1] "complex"
[1] "character"
[1] "logical"
Control statements are expressions used to control the execution and flow
of the program based on the conditions provided in the statements.
R if Statement
The block of code inside the if statement will be executed only when the
boolean expression evaluates to be true. If the statement evaluates false, then
the code which is mentioned after the condition will run.
1. if(boolean_expression) {
2. // If the boolean expression is true, then statement(s) will be executed.
3. }
Let see some examples to understand how if statements work and perform a
certain task in R.
Example 1
1. x <-24L
2. y <- "shubham"
3. if(is.integer(x))
4. {
5. print("x is an Integer")
6. }
Output:
If-else statement
1. if(boolean_expression) {
2. // statement(s) will be executed if the boolean expression is true.
3. } else {
4. // statement(s) will be executed if the boolean expression is false.
5. }
Flow Chart
Example 1
1. # local variable definition
2. a<- 100
3. #checking boolean condition
4. if(a<20){
5. # if the condition is true then print the following
6. cat("a is less than 20\n")
7. }else{
8. # if the condition is false then print the following
9. cat("a is not less than 20\n")
10. }
11. cat("The value of a is", a)
Output:
1. if statement can have either zero or one else statement and it must
come after any else if's statement.
2. if statement can have many else if's statement and they come before
the else statement.
1. if(boolean_expression 1) {
2. // This block executes when the boolean expression 1 is true.
3. } else if( boolean_expression 2) {
4. // This block executes when the boolean expression 2 is true.
5. } else if( boolean_expression 3) {
6. // This block executes when the boolean expression 3 is true.
7. } else {
8. // This block executes when none of the above condition is true.
9. }
Example 1
1. age <- readline(prompt="Enter age: ")
2. age <- as.integer(age)
3. if(age<18)
4. print("You are child")
5. else if(age>30)
6. print("You are old guy")
7. else
8. print("You are adult")
Output:
o If there is more than one match, the first match element is used.
There are basically two ways in which one of the cases is selected:
1) Based on Index
If the cases are values like a character vector, and the expression is evaluated
to a number than the expression's result is used as an index to select the case.
When the cases have both case value and output value like
["case_1"="value1"], then the expression value is matched against case
values. If there is a match with the case, the corresponding value is the
output.
Flow Chart
Example 1
1. x <- switch( 3, "Shubham", "Nishka", "Gunjan", "Sumit" )
2. print(x)
Output:
SYNTAX:
Flowchart
Output
A repeat loop constructs with the help of the repeat keyword in R. It is very
easy to construct an infinite loop in R.
1. repeat {
2. commands
3. if(condition) {
4. break
5. }
6. }
1. First, initialize the variables than it will enter into the Repeat loop.
2. This loop will execute the group of statements inside the loop.
4. It will check for the condition. It will execute a break statement to exit
from the loop
6. The statements inside the repeat loop will be executed again if the
condition is false.
Example 1:
1. v <- c("Hello","repeat","loop")
2. cnt <- 2
3. repeat {
4. print(v)
5. cnt <- cnt+1
6. if(cnt > 5) {
7. break
8. }
9. }
In while loop, firstly the condition will be checked and then after the body of
the statement will execute. In this statement, the condition will be checked
n+1 time, rather than n times.
1. while (test_expression) {
2. statement
3. }
Example 1:
3.8 R – STRINGS
• Strings are a bunch of character variables. It is a one-dimensional
array of characters.
Creation of String
Output:
String 1 is: OK1
String 2 is: OK2
String 3 is: This is 'acceptable and 'allowed' in R
String 4 is: Hi, Wondering "if this "works"
Error: unexpected symbol in " str5 <- 'hi, ' this"
Execution halted
Length of String
The length of strings indicates the number of characters present in the string.
The function str_length() belonging to the ‘string’ package
or nchar() inbuilt function of R can be used to determine the length of strings
in R.
Example 1: Using the str_length() function
# Importing package
library(stringr)
Output:
5
Output:
6
or
# R program to access
# characters in a string
Output:
"L"
If the starting index is equal to the ending index, the corresponding character
of the string is accessed. In this case, the first character, ‘L’ is printed.
# substring() function
Output:
[1] "e"
The following R code indicates the mechanism of String Slicing, where in the
substrings of a string are extracted:
print(substr(str, 1, 4))
Output:
[1]"Lear"
[1]"ode"
The first print statement prints the first four characters of the string. The
second print statement prints the substring from the indexes 8 to 10, which
is “ode”.
• These functions are used to format and convert the date from one
form to another form.
Specifier Description
%a Abbreviated weekday
%A Full weekday
%b Abbreviated month
%B Full month
%C Century
Note:To get the Today date, R provides a method called sys.Date() which
returns the today date.
Weekday:
In this, look into the %a, %A, and %u specifiers which give the abbreviated
weekday, full weekday, and numbered weekday starting from Monday.
Example:
# today date
date<-Sys.Date()
# abbreviated month
format(date,format="%a")
# fullmonth
format(date,format="%A")
# weekday
Output
[1] "Sat"
[1] "Saturday"
[1] "6"
Date:
Let’s look into the day, month, and year format specifiers to represent dates
in different formats.
Example:
# today date
date<-Sys.Date()
date
# day in month
format(date,format="%d")
# month in year
format(date,format="%m")
format(date,format="%b")
# full month
format(date,format="%B")
# Date
format(date,format="%D")
format(date,format="%d-%b-%y")
Output
[1] "2022-04-02"
[1] "02"
[1] "04"
[1] "Apr"
[1] "April"
[1] "04/02/22"
[1] "02-Apr-22"
Year:
The year can format in different forms. %y, %Y, and %C are the few format
specifiers that return the year without century, a year with century, and
century of the given date respectively.
Example:
date<-Sys.Date()
format(date,format="%y")
format(date,format="%Y")
# century
format(date,format="%C")
Output
[1] "22"
[1] "2022"
[1] "20"
4.1.5 R Factors
R’s base data structures are often organized by their dimensionality (1D,
2D, or nD) and whether they’re homogeneous (all elements must be of the
identical type) or heterogeneous (the elements are often of various types).
This gives rise to the six data types which are most frequently utilized in
data analysis.
1. Vector
2. List
3. Array
4. Matrices
5. Data Frame
4.1.1 R VECTORS
A vector is the basic data structure in R that stores data of similar types.
For example,
Create a Vector in R
For example,
# create vector of string types
employees <- c("Max", "James", "Stacy")
print(employees)
Output
In the above example, a vector named employees with elements: Max, James,
and Stacy.
Here, the c() function creates a vector by combining three different elements
of employees together.
print(languages[1]) # "Swift"
print(languages[3]). # "R"
In the above example, a vector named languages. Each element of the vector
is associated with an integer number.
Vector Indexing in R
Here, the vector index to access the vector elements
Note: In R, the vector index always starts with 1. Hence, the first element of a
vector is present at index 1, second element at index 2 and so on.
Output
Numeric Vector in R
Here, the C() function to create a vector of numeric sequence called numbers.
Output
Here, we have used the : operator to create the vector named num with
numerical values in sequence i.e. 1 to 10.
Repeat Vectors in R
Syntax:
rep(numbers, times=2)
Here,
Output
print(number)
Output
The length() function to find the number of elements present inside the
vector. For example,
lang <- c("R", "Swift", "Python", "Java")
# find total elements in languages using length()
cat("Total Elements:", length(lang))
Output
4.1.2 R LIST
A List is a collection of similar or different types of data. In R,
the list() function to create a list. For example,
Here,
Access elements of a list using the index number (1, 2, 3 …). For example,
Note: In R, the list index always starts with 1. Hence, the first element of a list
is present at index 1, second element at index 2 and so on.
To change a list element, simply reassign a new value to the specific index.
For example,
print(list1)
Output
The append() function to add an item at the end of the list. For example,
list1 <- list(26, "Sam")
# using append() function
append(list1, 3.14)
Output
In the above example, we have created a list named list1. Notice the line,
append(list1, 3.14)
Output
Length of R List
In R, use the length() function to find the number of elements present inside
the list. For example,
Output
In R, the for loop used to access all the element in the list. example,
items <- list(24, "Nova", 5.4, "SriLanka")
# iterate through each elements of numbers
for (i in items) {
print(i)
}
Output
Here,
Here,
• first_col - a vector with values val1, val2, ... of same data type
• second_col - another vector with values val1, val2, ... of same data type and so
on
Let's see an example,
data.frame (
Name = c("James", "Harry", "Steve"),
Age = c(25, 10, 34),
Vote = c(TRUE, FALSE, TRUE)
)
Here, Name, Age, and Vote are column names for vectors of String, Numeric,
and Boolean type respectively.
And finally the data is represented in tabular format are printed.
There are different ways to extract columns from a data frame. Use [ ], [[ ]],
or $ to access specific column of a data frame in R. For example,
Output
In R, the rbind() and the cbind() function to combine two data frames
together.
• rbind() - combines two data frames vertically
• cbind() - combines two data frames horizontally
To combine two data frames vertically, the column name of the two
data frames must be the same. For example,
Output
The cbind() function combines two or more data frames horizontally. For
example,
Output
In R, the length() function to find the number of columns in a data frame. For
example,
Output
Here, length() used to find the total number of columns in dataframe1. Since
there are 3 columns, the length() function returns 3.
Example
# Create a matrix
>>thismatrix <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2)
# Print the matrix
>>thismatrix
Output
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
Example
Output
[,1] [,2]
The items are accessed using [ ] brackets. The first number "1" in the bracket
specifies the row-position, while the second number "2" specifies the
column-position:
Example
Output
[1] “cherry”
The whole row can be accessed by specifying a comma after the number in
the bracket:
More than one row can be accessed by using the c() function:
thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "
melon", "fig"), nrow = 3, ncol = 3)
thismatrix[c(1,2),]
More than one column can be accessed if you use the c() function:
Example
thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "
melon", "fig"), nrow = 3, ncol = 3)
thismatrix[, c(1,2)]
Output
Example
thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "
melon", "fig"), nrow = 3, ncol = 3)
Output
Note: The cells in the new column must be of the same length as the existing
matrix.
Example
thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "
melon", "fig"), nrow = 3, ncol = 3)
newmatrix <- rbind(thismatrix, c("strawberry", "blueberry", "raspberry"))
# Print the new matrix
newmatrix
Note: The cells in the new row must be of the same length as the existing
matrix.
Example
thismatrix <-
matrix(c("apple", "banana", "cherry", "orange", "mango", "pineapple"), nrow
= 3, ncol =2)
#Remove the first row and the first column
thismatrix <- thismatrix[-c(1), -c(1)]
thismatrix
Output
To find out if a specified item is present in a matrix, use the %in% operator:
Example
[1] TRUE
Use the dim() function to find the number of rows and columns in a Matrix:
Example
dim(thismatrix)
Output
[1] 2 2
Matrix Length
Example
length(thismatrix)
Output
[1] 4
Matrix can use a for loop. The loop will start at the first row, moving right:
Example
Output
[1] "apple"
[1] "cherry"
[1] "banana"
[1] "orange"
Example
# Combine matrices
Matrix1 <- matrix(c("apple", "banana", "cherry", "grape"), nrow = 2, ncol = 2)
# Adding it as a rows
Matrix_Combined <- rbind(Matrix1, Matrix2)
Matrix_Combined
# Adding it as a columns
Matrix_Combined <- cbind(Matrix1, Matrix2)
Matrix_Combined
Output
[,1] [,2]
[1,] "apple" "cherry"
[2,] "banana" "grape"
[3,] "orange" "pineapple"
[4,] "mango" "watermelon"
[,1] [,2] [,3] [,4]
[1,] "apple" "cherry" "orange" "pineapple"
[2,] "banana" "grape" "mango" "watermelon"
4.1.4 R ARRAYS
Arrays can have more than two dimensions. The array() function to create an
array, and the dim parameter to specify the dimensions.
Example
thisarray
multiarray
Output
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
,,1
,,2
Example Explained
dim=c(4,3,2)
The first and second number in the bracket specifies the amount of rows and
columns.
The last number in the bracket specifies how many dimensions we want.
Example
Output
[1] 22
Example
2 %in% multiarray
Output
[1] TRUE
Use the dim() function to find the amount of rows and columns in an array:
Example
Output
[1] 4 3 2
Array Length
Example
[1] 24
Example
Output
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 1
4.5 R FACTORS
• Demography: Male/Female
• Music: Rock, Pop, Classic, Jazz
• Training: Strength, Stamina
To create a factor, use the factor() function and add a vector as argument:
Example
# Create a factor
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"))
# Print the factor
music_genre
Output
From the example above that that the factor has four levels (categories):
Classic, Jazz, Pop and Rock.
Example
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"))
levels(music_genre)
Output
Factor Length
Use the length() function to find out how many items there are in the factor:
Example
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"))
length(music_genre)
Output
[1] 8
R allows us to read data from files which are stored outside the R
environment.
In R, getwd() and setwd() are the two useful functions. The getwd() function
is used to check on which directory the R workspace is pointing. And the
setwd() function is used to set a new working directory to read and write
files from that directory.
Let's see an example to understand how getwd() and setwd() functions are
used.
Example
Output
Example: record.csv
1. id,name,salary,start_date,dept
2. 1,Shubham,613.3,2012-01-01,IT
3. 2,Arpita,525.2,2013-09-23,Operations
4. 3,Vaishali,63,2014-11-15,IT
5. 4,Nishka,749,2014-05-11,HR
6. 5,Gunjan,863.25,2015-03-27,Finance
7. 6,Sumit,588,2013-05-21,IT
8. 7,Anisha,932.8,2013-07-30,Operations
9. 8,Akash,712.5,2014-06-17,Finance
Output
Let's use our record.csv file to read records from it using read.csv() function.
Example
Output
When we read data from the .csv file using read.csv() function, by default, it
gives the output as a data frame. Before analyzing data, let's start checking
the form of our output with the help of is.data.frame() function. After that,
we will check the number of rows and number of columns with the help
of nrow() and ncol() function.
Example
csv_data<- read.csv("record.csv")
print(is.data.frame(csv_data))
print(ncol(csv_data))
print(nrow(csv_data))
Output
Output
Example: Getting the details of the person who have a maximum salary
Output
Example: Getting the details of all the persons who are working in the IT
department
Output
Example: Getting the details of the persons whose salary is greater than
600 and working in the IT department.
Output
Output
Like reading and analyzing, R also allows to write into the .csv file. For this
purpose, R provides a write.csv() function. This function creates a CSV file
from an existing data frame. This function creates the file in the current
working directory.
sExample
csv_data<- read.csv("record.csv")
#Getting details of those peoples who joined on or after 2014
details <- subset(csv_data,as.Date(start_date)>as.Date("2014-01-01"))
Output
y=f(x),
where y = categorical output
1. Lazy Learners: Lazy Learner firstly stores the training dataset and
wait until it receives the test dataset. In Lazy learner case,
classification is done on the basis of the most related data stored in
the training dataset. It takes less time in training but more time for
predictions.
Example: K-NN algorithm, Case-based reasoning
Classification Algorithms can be further divided into the Mainly two category:
o Linear Models
o Logistic Regression
o Non-linear Models
o Kernel SVM
o Naïve Bayes
1. Confusion Matrix:
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive
Rate) on Y-axis and FPR(False Positive Rate) on X-axis.
o Speech Recognition
o Drugs Classification
Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal
nodes represent the features of a dataset, branches represent the
decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node. Decision nodes are used to make any decision
and have multiple branches, whereas Leaf nodes are the output of
those decisions and do not contain any further branches.
In a decision tree, for predicting the class of the given dataset, the algorithm
starts from the root node of the tree. This algorithm compares the values of
root attribute with the record (real dataset) attribute and, based on the
comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the
other sub-nodes and move further. It continues the process until it reaches
the leaf node of the tree. The complete process can be better understood
using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
o Step-3: Divide the S into subsets that contains possible values for the
best attributes.
o Step-4: Generate the decision tree node, which contains the best
attribute.
Example: Suppose there is a candidate who has a job offer and wants to
decide whether he should accept the offer or Not. So, to solve this problem,
the decision tree starts with the root node (Salary attribute by ASM). The
root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node
further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and
Declined offer). Consider the below diagram:
While implementing a Decision tree, the main issue arises that how to select
the best attribute for the root node and for sub-nodes. So, to solve such
problems there is a technique which is called as Attribute selection
measure or ASM. By this measurement, we can easily select the best
attribute for the nodes of the tree. There are two popular techniques for ASM,
which are:
o Gini Index
1. Information Gain:
o According to the value of information gain, we split the node and build
the decision tree.
Where,
o P(no)= probability of no
2. Gini Index:
A too-large tree increases the risk of overfitting, and a small tree may not
capture all the important features of the dataset. Therefore, a technique that
decreases the size of the learning tree without reducing accuracy is known as
Pruning. There are mainly two types of tree pruning technology used:
EXAMPLE:
The data used for this analysis is from the National Institute of Diabetes and
Digestive and Kidney Diseases and is made available on Kaggle.
The Dataset contains entries from only women of at least 21 years of age with
Pima Indian heritage with the following features;
>set.seed(123)
> n <- nrow(df)
> train <- sample(n, trunc(0.70*n))
> df_train <- df[train, ]
> df_test <- df[-train, ]
> install.packages("rpart")
> library(rpart)
> model<-rpart(Outcome ~ .,data=df_train)
> p<-predict(model,df_test,type="class")
> library(caret)
> confusionMatrix(p,df_test$Outcome)
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 129 46
Yes 21 35
Accuracy : 0.71
95% CI : (0.6468, 0.7676)
No Information Rate : 0.6494
P-Value [Acc > NIR] : 0.029993
Kappa : 0.3144
'Positive' Class : No
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes,
Which can be described as:
Bayes' Theorem:
Where,
Working of Naïve Bayes' Classifier can be understood with the help of the
below example:
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Learn more
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Sunny)= 0.35
P(Yes)=0.71
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
EXAMPLE:
NAÏVE BAYES CLASSIFIER FOR IRIS DATASET
ABOUT DATASET
Iris dataset consists of 50 samples from each of 3 species of Iris(Iris setosa,
Iris virginica, Iris versicolor)
# Loading data
data(iris)
# Structure
str(iris)
# Installing Packages
install.packages("e1071")
install.packages("caTools")
install.packages("caret")
# Loading package
library(e1071)
library(caTools)
library(caret)
# Splitting data into train
# and test data
split <- sample.split(iris, SplitRatio = 0.7)
train_cl <- subset(iris, split == "TRUE")
test_cl <- subset(iris, split == "FALSE")
• The model achieved 90% accuracy with a p-value of less than 1. With
Sensitivity, Specificity, and Balanced accuracy, the model build is good.
o K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then
it can be easily classified into a well suite category by using K- NN
algorithm.
o KNN algorithm at the training phase just stores the dataset and when
it gets new data, then it classifies that data into a category that is much
similar to the new data.
Suppose there are two categories, i.e., Category A and Category B, and we
have a new data point x1, so this data point will lie in which of these
categories. To solve this type of problem, we need a K-NN algorithm. With the
help of K-NN, we can easily identify the category or class of a particular
dataset. Consider the below diagram:
The K-NN working can be explained on the basis of the below algorithm:
o Step-4: Among these k neighbors, count the number of the data points
in each category.
o Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
Suppose we have a new data point and we need to put it in the required
category. Consider the below image:
o It is simple to implement.
EXAMPLE:
The data used for this analysis is from the National Institute of Diabetes and
Digestive and Kidney Diseases and is made available on Kaggle.
library(dplyr)
library(tidyr)
library(forcats)
library(ggplot2)
library(janitor)
library(gmodels)
library(class)
library(corrplot)
str(diabetes_df)
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
The dataset contains data entries from 768 patients, with all the
features/attributes being numeric values.
get_dupes(diabetes_df)
## [10] dupe_count
Here, the values of the outcome attributes are transformed from 1’s and 0’s
to “Diabetic” and “Non Diabetic” to make the outcomes/diagnosis more
clearer to understand.
There are moderate positive correlations between the Age and Pregnancy,
and the Insulin and Skin Thickness attributes. This indicates that as the age of
the patients increased so did the number of pregnancies, also as the quantity
The ages of the patients are skewed to the right with most of the patients
being between the ages of 20 to 40.
From the histogram above, the BMI attribute is symmetric but it is quite
visible that outliers exist in the dataset having BMI’s with 0 values. To have a
BMI of Zero(0) is impossible, indicating that there might be an error in this
field.
(x-min(x))/(max(x)-min(x))
Finally, the dataset is then split into two where the larger half will be used to
train the model and the second half utilized to test the accuracy of the model.
The kNN factor is utilized above to train the model, the value used for k is the
square-root of the total sample size used for the analysis(768).
Removing Outliers
The accuracy of the model was not hampered by the presence of the outliers,
as the accuracy reduced to 76% with the removal of outliers.
The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points
(Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and
assign the new data points to the category that wins the majority votes.
Example: Suppose there is a dataset that contains multiple fruit images. So,
this dataset is given to the Random forest classifier. The dataset is divided
into subsets and given to each decision tree. During the training phase, each
decision tree produces a prediction result, and when a new data point occurs,
then based on the majority of results, the Random Forest classifier predicts
the final decision. Consider the below image:
There are mainly four sectors where Random forest mostly used:
2. Medicine: With the help of this algorithm, disease trends and risks of
the disease can be identified.
3. Land Use: We can identify the areas of similar land use by this
algorithm.
The Pima Indians Diabetes database to predict the onset of diabetes based
on diagnostic
measures: https://www.kaggle.com/hconner2/diabetes/data
About Dataset
head(diabetes)
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 6 5 116 74 0 0 25.6
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
summary(diabetes)
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
str(diabetes)
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
Here, we have to be careful about how the data has been coded. First, we
see that the Outcome is numeric while it should be categorical:
diabetes$Outcome = factor(diabetes$Outcome)
summary(diabetes$Outcome)
## 0 1
## 500 268
Secondly, that in many variables there are zeros that do not make sense,
for example, BloodPressure, BMI, etc. Assume that the zeros represent
the NA values and need to be recoded correctly:
diabetes$Glucose[diabetes$Glucose == 0 ] = NA
diabetes$BloodPressure[diabetes$BloodPressure == 0 ] = NA
diabetes$SkinThickness[diabetes$SkinThickness == 0 ] = NA
diabetes$Insulin[diabetes$Insulin == 0 ] = NA
diabetes$BMI[diabetes$BMI == 0 ] = NA
diabetes = na.omit(diabetes)
summary(diabetes)
## 3rd Qu.: 5.000 3rd Qu.:143.0 3rd Qu.: 78.00 3rd Qu.:37.00
## Outcome
## negative:262
## positive:130
Divide the data into two different sets: a training dataset that will be used
to build the model and a test dataset that will be used to estimate the
predictive accuracy of the model.
The dataset will be divided into training (70%) and testing (30%) sets, we
create the data sets using the caret package:
library(caret)
set.seed(123)
train = diabetes[train_ind,]
head(train)
## 4 1 89 66 23 94 28.1
## 19 1 103 30 38 83 43.3
## 20 1 115 70 30 96 34.6
## 4 0.167 21 negative
## 5 2.288 33 positive
## 14 0.398 59 positive
## 15 0.587 51 positive
## 19 0.183 33 negative
## 20 0.529 32 positive
test = diabetes[-train_ind,]
head(test)
## 7 3 78 50 32 88 31.0
## 7 0.248 26 positive
## 9 0.158 53 positive
## 17 0.551 31 positive
## 21 0.704 27 negative
## 25 0.254 51 positive
## 26 0.205 41 positive
The training set has 275 samples, and the testing set has 117 samples.
-ntree:number of trees
#install.packages('randomForest')
library(randomForest)
rf
## Call:
## Number of trees: 50
## Confusion matrix:
## positive 38 53 0.4175824
plot(rf)
head(predictions)
## 7 9 17 21 25 26
library(caret)
confu1
## Reference
## negative 68 15
## positive 10 24
## Accuracy : 0.7863
##
## Kappa : 0.5033
##
## Specificity : 0.8718
## Prevalence : 0.3333
rf
## Call:
## Confusion matrix:
## positive 37 54 0.4065934
plot(rf)
head(predictions)
## 7 9 17 21 25 26
library(caret)
confu2
S## Reference
## negative 70 13
## positive 8 26
## Accuracy : 0.8205
## Kappa : 0.5828
## Sensitivity : 0.6667
## Specificity : 0.8974
## Prevalence : 0.3333
MODULE-7
The below diagram explains the working of the clustering algorithm. We can
see the different fruits are divided into several groups with similar
properties.
o Market Segmentation
o Image segmentation
Hence each cluster has datapoints with some commonalities, and it is away
from other clusters.
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the
new closest centroid of each cluster.
The two variables M1 and M2. The x-y axis scatter plot of these two variables
is given below:
o Now assign each data point of the scatter plot to its closest K-point or
centroid. Consider the below image:
o Next, reassign each datapoint to the new centroid. For this, repeat the
same process of finding a median line. The median will be like below
From the above image, we can see, one yellow point is on the left side of the
line, and two blue points are right to the line. So, these three points will be
assigned to new centroids.
o The new centroids so again will draw the median line and reassign the
data points. So, the image will be:
o In the above image; there are no dissimilar data points on either side
of the line, which means our model is formed. Consider the below
image:
The model is ready, so remove the assumed centroids, and the two final
clusters will be as shown in the below image:
data(iris)
# Structure
str(iris)
# Installing Packages
install.packages("ClusterR")
install.packages("cluster")
# Loading package
library(ClusterR)
library(cluster)
kmeans.re
# each observation
kmeans.re$cluster
# Confusion Matrix
cm <-table(iris$Species, kmeans.re$cluster)
cm
plot(iris_1[c("Sepal.Length", "Sepal.Width")])
plot(iris_1[c("Sepal.Length", "Sepal.Width")],
col =kmeans.re$cluster)
plot(iris_1[c("Sepal.Length", "Sepal.Width")],
kmeans.re$centers
## Visualizing clusters
Output:
• Model kmeans_re:
The 3 clusters are made which are of 50, 62, and 38 sizes respectively.
Within the cluster, the sum of squares is 88.4%.
• Cluster identification:
• Confusion Matrix:
In the plot, centers of clusters are marked with cross signs with the same
color of the cluster.
• Plot of clusters:
So, 3 clusters are formed with varying sepal length and sepal width. Hence,
the K-Means clustering algorithm is widely used in the industry.
8.1DBScan Clustering
DBScan Algorithm:
EXAMPLE :
THE DATASET
Iris dataset consists of 50 samples from each of 3 species of Iris(Iris setosa,
Iris virginica, Iris versicolor).
# Loading data
data(iris)
# Structure
str(iris)
# Installing Packages
install.packages("fpc")
# Loading package
library(fpc)
iris_1 <-iris[-5]
# to training dataset
Dbscan_cl
Dbscan_cl$cluster
# Table
table(Dbscan_cl$cluster, iris$Species)
# Plotting Cluster
Output:
• Model dbscan_cl:
In the model, there are 150 Pts with Minimum points are 5 and eps is
0.5.
• Cluster identification:
So, the DBScan clustering algorithm can also form unusual shapes that are
useful for finding a cluster of non-linear shapes in the industry.
In this algorithm, develop the hierarchy of clusters in the form of a tree, and
this tree-shaped structure is known as the dendrogram.
The working of the AHC algorithm can be explained using the below steps:
o Step-2: Take two closest data points or clusters and merge them to
form one cluster. So, there will now be N-1 clusters.
o Step-3: Again, take the two closest clusters and merge them together
to form one cluster. There will be N-2 clusters.
o Step-4: Repeat Step 3 until only one cluster left. So, we will get the
following clusters. Consider the below images:
As we have seen, the closest distance between the two clusters is crucial for
the hierarchical clustering. There are various ways to calculate the distance
between two clusters, and these ways decide the rule for clustering. These
measures are called Linkage methods. Some of the popular linkage methods
are given below:
From the above-given approaches, we can apply any of them according to the
type of problem or business requirement.
The dendrogram is a tree-like structure that is mainly used to store each step
as a memory that the HC algorithm performs. In the dendrogram plot, the Y-
axis shows the Euclidean distances between the data points, and the x-axis
shows all the data points of the given dataset.
The working of the dendrogram can be explained using the below diagram:
o Again, two new dendrograms are created that combine P1, P2, and P3
in one dendrogram, and P4, P5, and P6, in another dendrogram.
o At last, the final dendrogram is created that combines all the data
points together.
EXAMPLE :
THE DATASET
install.packages("dplyr")
library(dplyr)
head(mtcars)
Output:
distance_mat
# to training dataset
Hiear_cl
# Plotting dendrogram
fit
table(fit)
Output:
• Distance matrix:
• The values are shown as per the distance matrix calculation with the
method as euclidean.
• Plot dendrogram:
• The plot dendrogram is shown with x-axis as distance matrix and y-axis
as height.
• Cutted tree:
• The plot denotes dendrogram after being cut. The green lines show the
number of clusters as per the thumb rule.
MODULE 9
9.1 Data Visualization
Data visualization convert large and small data sets into visuals, which is easy
to understand and process for humans.
5. To competitive analyze.
6. To improve insights.
1. Understanding
2. Efficiency
3. Location
Its app utilizing features such as Geographic Maps and GIS can be particularly
relevant to wider business when the location is a very relevant factor. We will
use maps to show business insights from various locations, also consider the
1) plotly
2) ggplot2
3) tidyquant
The tidyquant is a financial package that is used for carrying out quantitative
financial analysis. This package adds under tidyverse universe as a financial
package that is used for importing, analyzing, and visualizing the data.
4) taucharts
5) ggiraph
6) geofacets
7) googleVis
googleVis provides an interface between R and Google's charts tools. With the
help of this package, we can create web pages with interactive charts based
on R data frames.
8) RColorBrewer
This package provides color schemes for maps and other graphics, which are
designed by Cynthia Brewer.
9) dygraphs
10) shiny
Graphics play an important role in carrying out the important features of the
data. Graphics are used to examine marginal distributions, relationships
between variables, and summary of very large data. It is a very important
complement for many statistical and computational techniques.
Standard Graphics
o Piecharts
o Boxplots
o Barplots etc.
9.4.1 R BOXPLOT
Here,
1. x It is a vector or a formula.
4. varwidth It is also a logical value set as true to draw the width of the box same
as the sample size.
5. names It is the group of labels that will be printed under each boxplot.
Example
# Giving a name to the chart file.
png(file = "boxplot.png")
# Plotting the chart.
boxplot(mpg ~ cyl, data = mtcars, xlab = "Quantity of Cylinders",
ylab = "Miles Per Gallon", main = "R Boxplot Example")
Output:
In R, create a bar chart to visualize the data in an efficient manner. For this
purpose, R provides the barplot() function, which has the following syntax:
barplot(h,x,y,main, names.arg,col)
Output:
Here,
1. X is a vector that contains the numeric values used in the pie chart.
There are two additional properties of the pie chart, i.e., slice percentage and
chart legend.
The data in the form of percentage as well as we can add legends to plots in R
by using the legend() function.
Here,
o fill is the color to use for filling the boxes beside the legend text.
o col defines the color of line and points besides the legend text.
Example
# Creating data for the graph.
x <- c(20, 65, 15, 50)
labels <- c("India", "America", "Shri Lanka", "Nepal")
pie_percent<- round(100*x/sum(x), 1)
# Giving the chart file a name.
png(file = "per_pie.jpg")
# Plotting the chart.
pie(x, labels = pie_percent, main = "Country Pie Chart",col = rainbow(length(x)))
Output:
A histogram is a type of bar chart which shows the frequency of the number
of values which are compared with a set of values ranges.
The histogram is used for the distribution, whereas a bar chart is used for
comparing different entities. In the histogram, each bar represents the height
of the number of values present in the given range.
hist(v,main,xlab,ylab,xlim,ylim,breaks,col,border)
Here,
Example
# Creating data for the graph.
v <- c(12,24,16,38,21,13,55,17,39,10,60)
# Giving a name to the chart file.
png(file = "histogram_chart.png")
# Creating the histogram.
hist(v,xlab = "Weight",ylab="Frequency",col = "green",border = "red")
Output:
R – heatmap() Function
Syntax: heatmap(data)
Parameters:
• data: It represent matrix data, such as values of rows and columns
Return: This function draws a heatmap.
Example :-
set.seed(110)
# Draw a heatmap
heatmap(data)
Here, in the above example number of rows and columns are specified to
draw heatmap
• Data analysis & manipulation: Create new data sets, tools for data
analysis, categorize, club, merge and filter data sets.
• Visualization: This includes interactive graphics and reports.
• Statistics: To confirm and create relationships between variables in
the data.
• Hypothesis testing: Creating models, evaluating and choosing the
right models.
The various predictive models that help make forecasts using machine
learning and data mining approaches.
1. Classification Model
The classification model is one of the most popular predictive analytics
models. These models perform categorical analysis on historical data.
Various industries adopt classification models because they can retrain these
models with current data and as a result, they obtain useful and detailed
insights that help them build appropriate solutions. Classification models are
customizable and are helpful across industries, including banking and retail.
2. Clustering Model
The clustering model gathers data and divides it into groups based on
common characteristics. Hard clustering facilitates data classification,
determining if each data point belongs to a cluster, and soft clustering
allocates a probability to each data point.
3. Outliers Model
Unlike the classification and forecast models, the outlier model deals with
anomalous data items within a dataset. It works by detecting anomalous data,
either on its own or with other categories and numbers. Outlier models are
essential in industries like retail and finance, where detecting abnormalities
4. Forecast Model
One of the most prominent predictive analytics models is the forecast model.
It manages metric value predictions by calculating new data values based on
historical data insights. Forecast models also generate numerical values in
historical data if none are present. One of the most powerful features of
forecast models is that they can manage multiple parameters at a time. As a
result, they're one of the most popular predictive models in the market.
Various industries can use a forecast model for different business purposes.
For example, a call center can use forecast analytics to predict how many
support calls they will receive in a day, or a retail store can forecast inventory
for the upcoming holiday sales periods, etc.
Time Series predictive models are helpful if organizations need to know how
a specific variable changes over time. For example, if a small business owner
wishes to track sales over the last four quarters, they will need to use a Time
Series model. It can also look at external factors like seasons or periodical
variations that could influence future trends.
6. Linear Regression
One of the simplest machine learning techniques is linear regression. A
generalized linear model simulates the relationship between one or more
independent factors and the target response (dependent variable). Linear
7. Logistic Regression
Logistic regression is a statistical technique for describing and explaining
relationships between binary dependent variables and one or more nominal,
interval, or ratio-level independent variables. Logistic regression allows you
to predict the unknown values of a discrete target variable based on the
known values of other variables.
8. Decision Trees
A decision tree is an algorithm that displays the likely outcomes of various
actions by graphing structured or unstructured data into a tree-like
structure. Decision trees divide different decisions into branches and then list
alternative outcomes beneath each one. It examines the training data and
chooses the independent variable that separates it into the most diverse
logical categories. The popularity of decision trees stems from the fact that
they are simple to understand and interpret.