Data Analytics - Unit 5
Data Analytics - Unit 5
Association Rules analysis attempts to find relationships between items. The most common
example of this is market basket analysis.
The rules can be described as the antecedent itemset implies the consequent itemset. The
antecedent and consequent are itemsets, which are sets of items. In other words, the antecedent is
a combination of items that are analyzed to determine what other items are implied by this
combination. These implied items are the consequent of the analysis.
Association rule analysis is a robust data mining technique for identifying intriguing
connections and patterns between objects in a collection.
Association rule analysis is widely used in retail, healthcare, and finance industries. These rules
enable organisations to uncover hidden relationships and patterns in data that would otherwise
go unnoticed, providing valuable insights that can inform decision-making and drive
improvement.
Association rule analysis is commonly used for market basket analysis, product
recommendation, fraud detection, and other applications in various domains.
In other words, it helps to find the association between different events or items in a dataset.
Association rule analysis plays a vital role in data mining by providing insights into complex data
relationships that would be difficult to identify manually. It is an important tool for businesses to
understand customer behaviour, preferences, and trends.
For example, retail businesses use association rule analysis to determine which products are
frequently purchased together and to improve product placement and promotion strategies.
Association rule analysis can also be used in medical research to identify potential drug
interactions or adverse effects.
Data Preprocessing
Before performing association rule analysis, it is necessary to preprocess the data. This involves
data cleaning, transformation, and formatting to ensure that the data is in a suitable format for
analysis.
Association rule analysis generates a large number of potential rules, and it is important to
evaluate and select the most relevant rules.
Support:
o Rules with high support are more significant as they occur more frequently in the dataset
Confidence:
o Rules with high confidence are more reliable, as they have a higher probability of being true
Lift:
o Rules with high lift indicate a strong association between the antecedent and consequent, as
they occur together more frequently than expected by chance
An association rule mining algorithm is a tool used to find patterns and relationships in data.
Several algorithms are used in association rule mining, each with its own strengths and
weaknesses.
Apriori Algorithm
One of the most popular association rule mining algorithms is the Apriori algorithm. The Apriori
algorithm is based on the concept of frequent itemsets, which are sets of items that occur together
frequently in a dataset.
The algorithm works by first identifying all the frequent itemsets in a dataset, and then
generating association rules from those itemsets.
These association rules can then be used to make predictions or recommendations based on the
patterns and relationships discovered.
FP-Growth Algorithm
In large datasets, FP-growth is a popular method for mining frequent item sets.
It generates frequent itemsets efficiently without generating candidate itemsets using a tree-
based data structure called the FP-tree. As a result, it is faster and more memory efficient than
the Apriori algorithm when dealing with large datasets.
First, the algorithm constructs an FP-tree from the input dataset, then recursively generates
frequent itemsets from it.
Eclat Algorithm
Equivalence Class Transformation, or Eclat is another popular algorithm for Association Rule
Mining.
Compared to Apriori, Eclat is designed to be more efficient at mining frequent itemsets. There
are a few key differences between the Eclat algorithm and the Apriori algorithm.
To mine the frequent itemsets, Eclat uses a depth-first search strategy instead of candidate
generation. Eclat is also designed to use less memory than the Apriori algorithm, which can be
important when working with large datasets.
Decision trees are a popular machine learning algorithm that can be used for both regression and
classification tasks. They are easy to understand, interpret, and implement, making them an ideal
choice for beginners in the field of machine learning.
A decision tree is a hierarchical model used in decision support that depicts decisions and
their potential outcomes, incorporating chance events, resource expenses, and utility. This
algorithmic model utilizes conditional control statements and is non-parametric, supervised
learning, useful for both classification and regression tasks. The tree structure is comprised
of a root node, branches, internal nodes, and leaf nodes, forming a hierarchical, tree-like
structure.
It is a tool that has applications spanning several different areas. Decision trees can be used
for classification as well as regression problems. The name itself suggests that it uses a
flowchart like a tree structure to show the predictions that result from a series of feature-based
splits. It starts with a root node and ends with a decision made by leaves.
Before learning more about decision trees let’s get familiar with some of the terminologies:
Root Node: The initial node at the beginning of a decision tree, where the entire
population or dataset starts dividing based on various features or conditions.
Decision Nodes: Nodes resulting from the splitting of root nodes are known as
decision nodes. These nodes represent intermediate decisions or conditions within the
tree.
Leaf Nodes: Nodes where further splitting is not possible, often indicating the final
classification or outcome. Leaf nodes are also referred to as terminal nodes.
Sub-Tree: Similar to a subsection of a graph being called a sub-graph, a sub-section
of a decision tree is referred to as a sub-tree. It represents a specific portion of the
decision tree.
Pruning: The process of removing or cutting down specific nodes in a decision tree to
prevent overfitting and simplify the model.
Branch / Sub-Tree: A subsection of the entire decision tree is referred to as a branch
or sub-tree. It represents a specific path of decisions and outcomes within the tree.
Parent and Child Node: In a decision tree, a node that is divided into sub-nodes is
known as a parent node, and the sub-nodes emerging from it are referred to as child
nodes. The parent node represents a decision or condition, while the child nodes
represent the potential outcomes or further decisions based on that condition.
Decision trees are upside down which means the root is at the top and then this root is split
into various several nodes. Decision trees are nothing but a bunch of if-else statements in
layman terms. It checks if the condition is true and if it is then it goes to the next node attached
to that decision.
In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or rainy?
If yes then it will go to the next feature which is humidity and wind. It will again check if
there is a strong wind or weak, if it’s a weak wind and it’s rainy then the person may go and
play.
Did you notice anything in the above flowchart? We see that if the weather is cloudy then we
must go to play. Why didn’t it split more? Why did it stop there?
To answer this question, we need to know about few more concepts like entropy, information
gain, and Gini index. But in simple terms, I can say here that the output for the training dataset
is always yes for cloudy weather, since there is no disorderliness here we don’t need to split
the node further.
The goal of machine learning is to decrease uncertainty or disorders from the dataset and for
this, we use decision trees.
Now you must be thinking how do I know what should be the root node? what should be the
decision node? when should I stop splitting? To decide this, there is a metric called “Entropy”
which is the amount of uncertainty in the dataset.
1. Starting at the Root: The algorithm begins at the top, called the “root node,”
representing the entire dataset.
2. Asking the Best Questions: It looks for the most important feature or question that
splits the data into the most distinct groups. This is like asking a question at a fork in
the tree.
3. Branching Out: Based on the answer to that question, it divides the data into smaller
subsets, creating new branches. Each branch represents a possible route through the
tree.
4. Repeating the Process: The algorithm continues asking questions and splitting the
data at each branch until it reaches the final “leaf nodes,” representing the predicted
outcomes or classifications.
Several assumptions are made to build effective models when creating decision trees. These
assumptions help guide the tree’s construction and impact its performance. Here are some
common assumptions and considerations when creating decision trees:
Binary Splits
Decision trees typically make binary splits, meaning each node divides the data into two
subsets based on a single feature or condition. This assumes that each decision can be
represented as a binary choice.
Recursive Partitioning
Decision trees use a recursive partitioning process, where each node is divided into child
nodes, and this process continues until a stopping criterion is met. This assumes that data can
be effectively subdivided into smaller, more manageable subsets.
Feature Independence
Decision trees often assume that the features used for splitting nodes are independent. In
practice, feature independence may not hold, but decision trees can still perform well if
features are correlated.
Homogeneity
Decision trees aim to create homogeneous subgroups in each node, meaning that the samples
within a node are as similar as possible regarding the target variable. This assumption helps
in achieving clear decision boundaries.
Decision trees are constructed using a top-down, greedy approach, where each split is chosen
to maximize information gain or minimize impurity at the current node. This may not always
result in the globally optimal tree.
Decision trees can handle both categorical and numerical features. However, they may require
different splitting strategies for each type.
Overfitting
Decision trees are prone to overfitting when they capture noise in the data. Pruning and setting
appropriate stopping criteria are used to address this assumption.
Impurity Measures
Decision trees use impurity measures such as Gini impurity or entropy to evaluate how well
a split separates classes. The choice of impurity measure can impact tree construction.
No Missing Values
Decision trees assume that there are no missing values in the dataset or that missing values
have been appropriately handled through imputation or other methods.
Decision trees may assume equal importance for all features unless feature scaling or
weighting is applied to emphasize certain features.
No Outliers
Decision trees are sensitive to outliers, and extreme values can influence their construction.
Preprocessing or robust methods may be needed to handle outliers effectively.
Small datasets may lead to overfitting, and large datasets may result in overly complex trees.
The sample size and tree depth should be balanced.
R is a powerful programming language and environment for data analysis. It is one of the most
popular data science tools because it is designed from ground up for statistics and data analysis.
It is the programming language used throughout this book.
This chapter is primarily designed for readers who have little to no experience with
programming, and hence we devote quite a bit of space to topics like variables and data types.
If you have programming experience, you may quickly skim through this chapter to just learn
the basic R syntax, and how to use RStudio.
RStudio let’s you to edit your program, colors your code in a way to make understanding it
easier (syntax coloring), allows you to execute it with a simple keypress, explore data and
workspace variables, your command history, install packages, and much more.
2.3 Basic R
Here we introduce the very basics of R language. We start with typing simple commands on
console, and thereafter switch to scripts. If your task requires just 1-2 commands, then it is often
easier to type those directly on the console (the lower-left pane in RStudio) while longer
sequences are typically better to be written as a separate script (see below).
One can open a new script through RStudio menus, the corresponding keyboard shortcut is
visible as well.
Next, let’s re-write these calculations as a script. The easiest way to write scripts is using the
RStudio script editor. Depending on your exact configuration, an “Untitled” script may already
be open, or you can choose from menu File -> New File -> R Script (or Ctrl - Shift - N). This
opens a new R script in a dedicated window (top left in RStudio).
Let’s put the same R command in that window. Now the command (or more often, a collection
of commands) is called script or computer program.2 So the content of your script window will
look like
300000*60*60*24*365
This is a script, a very simple, one-line computer program.
“Source” (Ctrl + Shift + S) will execute (source in R parlance) the program. It will not show
the code that you execute, nor any results that are not explicitly printed (see Section 2.6).
“Source with Echo” (Ctrl + Shift + Enter) will also execute the code, but will show both the
code and output, even if not explicitly printed.
2.3.3 Comments
One of the extremely handy and simple features of scripts (and computer programs in general)
are comments. These are part of the code that are ignored by computer. These are just notes for
the human reader (including you!) to make it easier to understand what the code does. Since
programs can be opaque and difficult to understand, comments are widely used to add
explanations. Even your own code may be quite incomprehensible a few months after writing
it.
Comments should be clear, concise, and helpful—they should provide information that is not
otherwise present or “obvious” in the code itself.
In R, we mark text as a comment by putting it after the pound/hashtag symbol (#). Everything
from the # until the end of the line is a comment. It is common to put descriptive comments
immediately above the code it describes, and sometimes immediately aftewards. One can also
put short notes at the end of the line of code:
So the commented light-year script might look like this:
## Length of light-year:
## c by seconds in minute by minutes in hour by
## .. by hours in day by days in year
300000*60*60*24*365
Note that these comments start with double hash sign ##: only one is needed, but as the
computer ignores everything after the first one, it will also ignore the second one. So any
number of has signs is fine!
See Section 7.5.2 for more about how to write good comments.
You can “execute” comments and enter those on the console, but it is not very useful as they
do not do anything.
Comments are also used for temporarily “deleting” parts of the code–if you add comment
signs # in front of every line in some parts of your code, these lines will be ignored by the
computer. But you can easily get these back if you need those again.
In RStudio, you can turn highlighted lines into comments and back by pressing Ctrl - Shift - C.
See more in Section J.
From now on, you can write (or copy) the example code directly into the script window and
execute it using “Source” or “Run”.
2.4 Variables
Since computer programs involve working with lots of data, we need a way to store and refer
to this information. We do this using variables.
However, now our code stores the numbers, “2” and “7”, in memory under two separate labels
(variable names) “x” and “y”. You can think of variabls as labeled “boxes” for data. You can
use the label to refer to the data inside. The numbers can be stored into the boxes (variables)
using a special assignment operator <-, it is like an arrow that puts number “2” into a box
labelled “x” and number “5” into the box “y”. This process is called assignment. Note
that variable names goes left, value comes right.4 Later, we just use the box labels (variable
names) to perform the tasks with data that is inside of the boxes (variables).
In RStudio, use Alt-- (Alt-minus) to get the <- operator.
See Section J for more.
Now you can imagine that instead of x <- 2 and y <- 5, we may instead write code that
asks x from the user, and reads y from a dataset. But computation, adding x and y, will remain
the same. This is the beauty of variables: as long as the computations are the same, we can use
the same code. 5
But variables can also be used to remember and retrieve the values later. This requires a slightly
different code, for instance:
x <- 2
y <- 5
z <- x + y
z
## [1] 7
Note that we store the result of x + y in “z” in a fairly similar manner as how we stored numbers
into “x” and “y”. Just what goes into the box “z” is a result of a calculation, not a given number
as above. Now we have an additional “box” in memory, labeled as “z”. You can see your
variables in RStudio “Environment” pane. You can also see all the variables using
command ls():
ls()
## [1] "r" "x" "y" "z"
This shows that we have defined three variables: “x”, “y” and “z”.
More specifically, we are talking here about workspace variables or environment variables.
These are the variables that are part of R workspace, and that you can see on the top-right
“Environment” tab in RStudio. These are what programming languages typically call
just variables. Later, in Section 11, we will encounter data variables, stored in the datasets and
not in the workspace.
A note about the last line–it is just “z” and nothing else. This is for printing the result. R console
normally only prints the result if it is not assigned to a variable. If we were writing the code
instead like
x <- 2
y <- 5
z <- x + y
then we do not see any result. The result is still computed, just not printed on screen. The last
lonely “z” prints it in a simple manner (see Section 2.6 for more about printing).
We can use any variable to do computations and store it in any variable. So we can also do like
this:
## to begin with, 'z' contains value '7'
z <- z + 1 # take z, add 1, and store result back in z
z # now it is '8'
## [1] 8
Here we take the number form the “box z”, add “1” to it, and “put it back into the same box”.
This is perfectly valid computer code, and in fact widely used for various tasks, such as
counting.
x
## [1] 1
2.5.1 Numeric
The default computational data type in R is numeric data. It can represent real numbers
(numbers that contain decimals). We can use use mathematical operators (such as +, -, *, ^,
see below in Section 2.5.1) to do computations with numeric data. There are also numerous
functions that work on numeric data (such as calculating sums, averages and square roots).
Numeric data is normally printed in a fairly obvious way, e.g.
1/2
## [1] 0.5
In case of non-finite fraction, only the first few digits are printed:
-1/7
## [1] -0.1428571
If numbers are too large, or too small, then they are printed in exponential form:
1000*2000*3000*4000/1.1
## [1] 2.181818e+13
1/1000/2000/3000
## [1] 1.666667e-10
The exponential form must be understood as 2.181818⋅10132.181818⋅1013 in the former case,
and as 1.666667⋅10−101.666667⋅10−10 in the latter case. Exponential form can also be used to
enter numbers, e.g.
x <- -3e-2 # -0.03
x
## [1] -0.03
Naturally, there are various ways to adjust the way the numbers are printed.
There is also a special mathematical constants: pi is π=3.1415927, and Inf is infinity. You can
get infinities when you do certain operations, e.g. divide by zero. You can also use infinity if
you need a constant that is larger than any number.
One can use Mathematical operators with numeric values. Mathematical operators are the
common signs like + and - that allow to do basic mathematics (to “operate”), plus a few others:
+: addition
-: subtraction
*: multiplication
/: division
%/% is integer division: e.g. 7 %/% 2 equals 3. This is a division that only returns the
integer part and ignores the remainder.
%% is modulo, e.g. 7 %% 2 equals 1–when you divide 7 by 2, then 1 is “left over”.
There are many more mathematical operators, such as matrix product or outer product.
We do not discuss these in this book.
2.5.2 Character
Another very common task we do is to perform simple text manipulations. Text data is
called character or string data in R. This may include simple tasks like storing a single letter in
a variable, or changing words to upper case; but it may also include quite complicated text
analysis.
You can tell that something is character data by putting this in quotes (both single quotes ' and
double quotes " will do). For instance, we can store the name of a certain well-known playwriter
in a variable: r famous_poet <- "Bill Shakespeare" Note that character data is still data, so it
can be assigned to a variable just like numeric data! We can print its value by just typing its
name on the console, or using dedicated printing functions (see Section 2.6). There are no
special operators for character data, though there are a many functions for working with strings.
Note that it is not the content but the type of the content that decides if the variable is numeric
or character:
x <- 1 # this is numeric
y <- "1" # this is character
Both variables contain “one”, but in case of “x” this is stored as number, in “y” it is stored as
string. This is because 1 (without quotes) is a number and "1" (with quotes) is a character, and
the variable automatically “knows” what type data you put in there. Hence we can do
mathematical operations with “x” but not with “y”, and text functions with “y” but not with
“x”:
x+1
## [1] 2
will work but y + 1 will give an error. If you are unsure what type of a particular variable is,
you can query it with function class(), e.g.
class(y)
## [1] "character"
There are no dedicated character operators but there is a plethora of functions dedicated to
manipulating text.
2.5.3 Logical
The third extremely important variable type is logical variables (a.k.a Boolean variables).
These can only store two values–“true” or “false”. In R, these two values are written
as TRUE and FALSE. Importantly, these are not the strings "TRUE" or "FALSE"; logical
values are a different type! If you write these values in RStudio script window, you see that it
has a special color for these “logical constants”.
logical values are called “booleans” after mathematician and logician George Boole.
But why do we need such “powerless” variables that only can contain two values? Weren’t it
more useful to use numbers or strings that can contain much more? It turns out that logical
values are extremely important. Namely, most of decision-making is logical. We either do this,
or we do not do this. And there is a lot of decision-making in the computer code. We have to
check if our results are correct (or not), if the user input makes sense (or not), if we are done
with all inputs or not, so forth. All these decisions involve only two values, and R has many
decisionmaking tools that rely on such logical values.
You can create logical variables directly, like a <- TRUE but that is rarely useful. Most
commonly we see those as the result of applying comparison operators to data. These are
<=: less-than-or-equal
>=: greater-than-or-equal
==: equal
!=: not-equal
Note that equality is tested with double equal signs ==, not with single equal sign! For instance
2 == 3
## [1] FALSE
gives you FALSE but you cannot use single equal sign for comparison, 2 = 3 gives an error
instead.
Comparison operators behave in many ways exactly as mathematical operators like + and *,
just they result in logical values:
3<4
## [1] TRUE
3.14 < 3
## [1] FALSE
We can store these values in variables exactly like in case of numbers or strings:
a <- 3
b <- 4
c <- a == b # does 3 equal 4?
c
## [1] FALSE
One can also compare strings. While equality is fairly obvious, then for instance
"cat" > "dog"
## [1] FALSE
turns out to be false. This has nothing to do with the size of the corresponding mammals–the
fact that cat is “smaller” here means it is located before dog when written in alphabetic order.
Logical values have also additional operators, called logical operators or boolean operators.
These work only with logical values and they produce logical values. This allows you to make
more complex logical expressions. Although their behavior is very similar to that of
mathematical operators, logical operators are often confusing for beginners. We are used to
work with numbers but not with logical values.
Logical operators include & (logical and), | (logical or), and ! (logical not). The meaning of
these logical operators corresponds rather closely (but not exactly!) to their meaning in
everyday language. In particular true AND true is true, for instance
x <- 3
y <- 5
x < 4 # TRUE
## [1] TRUE
y > 4 # TRUE
## [1] TRUE
x < 4 & y > 4 # TRUE and TRUE is TRUE
## [1] TRUE
But if any of the involved logical values is false, then logical AND will produce false:
x > 4 & y > 4 # FALSE and TRUE is FALSE
## [1] FALSE
However, you can use logical NOT, ! to reverse the condition:
!(x > 4) & y > 4 # not FALSE and TRUE is TRUE
## [1] TRUE
Note that we need to put x > 4 in parenthesis to tell R that ! applies to x > 4, not on x alone!
Logical OR behaves otherwise similarly, but it is true if at least one of the values involved is
true:
pet <- "dog"
weather <- "rain"
2.5.4 Integer
The final “atomic” data type we encounter in this book is integer. These are numbers like
“numeric”, but these can only hold integer values. Now again, one may ask why do we need
such limited numbers, but there are a few reasons for this.
First, and most importantly, integer arithmetic is precise. This is not guaranteed to be the case
of floating point “numerics”–computers cannot represent infinite number of decimals, and
hence usually only produce results that are close to, but not exactly right.
The other reason why integers is sometime preferred is that integer arithmetic may be faster and
consume less memory. However, for computations we encounter in this class, the storage and
computation speed does not matter.
Integers are produced by certain operations, e.g when creating sequences.
Base R has two additional “basic types” that we do not discuss in this book:
Complex: Complex (imaginary) numbers have their own data storage type in R, they are are
created using the i syntax: c <- 1 + 2i.
Raw: is a sequence of “raw” data. It is good for storing a “raw” sequence of bytes, such as
image data. R does not interpret raw data in any particular way.
2.6 Producing output: cat and print
When you just compute on R console, or even when you write small scripts, it is not necessary
to dedicate any extra effort to printing. The results are automatically printed. This is a common
behavior in R console: the last result will be printed. It is a handy but limited feature.
Output depends on the way the code is executed. The same script is first “run”, that produces
the first lines of output on the console, including the result “1”. Thereafter it is “sourced”. The
only prints the source() command, but no output.
First, it only prints the “last” value (unless assigned to a variable). Second, this only works in
certain environments, e.g. in RStudio console when running the program, but not when
“sourcing” it (see Section 2.3.2). Third, when writing longer programs, you may want to see
more results than the last one, and maybe also add some explanatory notes. Finally, the result
depends on what exactly does the “last” value mean–the code can either be fed line-by-line, in
which case every value is the last one, or all at once, in which case only the last line is the last
one…
All this suggests that instead on relying automatic printing, in more complex projects you may
want to use dedicated printing functions. R has two printing commands: cat and print. cat is
useful if you want to print simple objects, but potentially more than one object. These may be
one or more numbers, strings, and explanatory text. print can output complex objects but only
one at time.
Next, we illustrate the usage of cat:
## Compute length of light-year
ly <- 300000*60*60*24*365
cat("Length of light-year is", ly, "km\n")
R Vectors are the same as the arrays in R language which are used to hold multiple data values
of the same type. One major key point is that in R Programming Language the indexing of the
vector will start from ‘1’ and not from ‘0’. We can create numeric vectors and character vectors
as well.
R – Vector
Creating a vector
A vector is a basic data structure that represents a one-dimensional array. to create a array we
use the “c” function which the most common method use in R Programming Language.
# of continuous values.
Z<- 2:7
cat('using colon', Z)
Output:
using c function 61 4 21 67 89 2
using seq() function 1 3.25 5.5 7.75 10
using colon 2 3 4 5 6 7
Types of R vectors
Vectors are of different types which are used in R. Following are some of the types of vectors:
Numeric vectors
Numeric vectors are those which contain numeric values such as integer, float, etc.
v1<- c(4, 5, 6, 7)
typeof(v1)
typeof(v2)
Output:
[1] "double"
[1] "integer"
Character vectors
Character vectors in R contain alphanumeric values and special characters.
typeof(v1)
Output:
[1] "character"
Logical vectors
Logical vectors in R contain Boolean values such as TRUE, FALSE and NA for Null values.
typeof(v1)
Output:
[1] "logical"
Length of R vector
In R, the length of a vector is determined by the number of elements it contains. we can use
the length() function to retrieve the length of a vector.
x <- c(1, 2, 3, 4, 5)
length(x)
length(y)
length(z)
Output:
> length(x)
[1] 5
> length(y)
[1] 3
> length(z)
[1] 4
Output:
Using Subscript operator 5
Using combine() function 1 4
Modifying a R vector
Modification of a Vector is the process of applying some operation on an individual element of
a vector to change its value in the vector. There are different ways through which we can modify
a vector:
# Ceating a vecto
X<- c(2, 7, 9, 7, 8, 2)
X[3] <- 1
X[2] <-9
X[1:5]<- 0
# Modify by specifying
cat('combine() function', X)
Output:
subscipt opeato 2 9 1 7 8 2
Logical indexing 0 0 0 0 0 2
combine() function 0 0 0
Deleting a vector
Deletion of a Vector is the pocess of deleting all of the elements of the vector. This can be done
by assigning it to a NULL value.
# Ceating a Vecto
M<- NULL
cat('Output vecto', M)
Output:
Output vector NULL
# Ceation of Vecto
A<- sort(X)
cat('descending ode', B)
Output:
ascending ode 1 2 2 7 8 11
descending ode 11 8 7 2 2 1
The measure of central tendency in R Language represents the whole set of data by a single
value. It gives us the location of the central points. There are three main measures of central
tendency:
Mean
Median
Mode
R
myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors=F)
print(head(myData))
Output:
Product Age Gender Education MaritalStatus Usage Fitness Income Miles
R
# R program to illustrate
# Descriptive Analysis
myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors=F)
mean = mean(myData$Age)
print(mean)
Output:
[1] 28.78889
R
# R program to illustrate
# Descriptive Analysis
myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors=F)
median = median(myData$Age)
print(median)
Output:
[1] 26
myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors=F)
mode = function(){
return(sort(-table(myData$Age))[1])
mode()
Output:
25: -25
We can use the modeest package of the R. This package provides methods to find the mode of
the univariate data and the mode of the usual probability distribution.
Example:
# R program to illustrate
# Descriptive Analysis
library(modeest)
myData = read.csv("CardioGoodFitness.csv",
stringsAsFactors=F)
mode = mfv(myData$Age)
print(mode)
Output:
[1] 25
Standard deviation R is the measure of the dispersion of the values. It can also be defined as
the square root of variance.
Formula of sample standard deviation:
where,
s = sample standard deviation
N = Number of entities
= Mean of entities
Basically, there are two different ways to calculate standard Deviation in R Programming
language, both of them are discussed below.
In this method of calculating the standard deviation, we will be using the above standard
formula of the sample standard deviation in R language.
Example 1:
v <- c(12,24,74,32,14,29,84,56,67,41)
s<-sqrt(sum((v-mean(v))^2/(length(v)-1)))
print(s)
Output:
[1] 25.53886
Example 2:
v <- c(1.8,3.7,9.2,4.7,6.1,2.8,6.1,2.2,1.4,7.9)
s<-sqrt(sum((v-mean(v))^2/(length(v)-1)))
print(s)
Output:
[1] 2.676004
Example 1:
v <- c(12,24,74,32,14,29,84,56,67,41)
s<-sd(v)
print(s)
Output:
[1] 25.53886
Example 2:
v <- c(71,48,98,65,45,27,39,61,50,24,17)
s1<-sqrt(sum((v-mean(v))^2/(length(v)-1)))
print(s1)
s2<-sd(v)
print(s2)
Output:
[1] 23.52175
Example 3:
v <- c(1.8,3.7,9.2,4.7,6.1,2.8,6.1,2.2,1.4,7.9)
s1<-sqrt(sum((v-mean(v))^2/(length(v)-1)))
print(s1)
s2<-sd(v)
print(s2)
Output:
[1] 2.676004
Variance
The variance is a numerical measure of how the data values is dispersed around the mean. In
particular, the sample variance is defined as:
Similarly, the population variance is defined in terms of the population mean μ and
population size N:
Example:
Let’s consider there are 8 data points,
2, 4, 4, 4, 5, 5, 7, 9
Parameters:
x: numeric vector
Example 1:
list = c(2, 4, 4, 4, 5, 5, 7, 9)
print(var(list))
Output:
[1] 4.571429
Example 2:
print(var(list))
Output:
[1] 22666.7
In statistics, Logistic Regression is a model that takes response variables (dependent variable) and
features (independent variables) to determine the estimated probability of an event. A logistic
model is used when the response variable has categorical values such as 0 or 1. For example, a
student will pass/fail, a mail is a spam or not, determining the images, etc. In this article, we’ll
discuss regression analysis, types of regression, and implementation of logistic regression in R
programming.
Regression Analysis in R
Regression analysis is a group of statistical processes used in R programming and statistics to
determine the relationship between dataset variables. Generally, regression analysis is used to
determine the relationship between the dependent and independent variables of the dataset.
Regression analysis helps to understand how dependent variables change when one of the
independent variables changes and other independent variables are kept constant. This helps in
building a regression model and further, helps in forecasting the values with respect to a change in
one of the independent variables. On the basis of types of dependent variables, a number of
independent variables, and the shape of the regression line, there are 4 types of regression analysis
techniques i.e., Linear Regression, Logistic Regression, Multinomial Logistic Regression, and
Ordinal Logistic Regression.
Types of Regression Analysis
1. Linear Regression
Linear Regression is one of the most widely used regression techniques to model the
relationship between two variables. It uses a linear relationship to model the regression line.
There are 2 variables used in the linear relationship equation i.e., predictor variable and the
response variable.
y = ax + b
where,
y is the response variable
x is the predictor variable
a and b are the coefficients
The regression line created using this technique is a straight line. The response variable is
derived from predictor variables. Predictor variables are estimated using some statistical
experiments. Linear regression is widely used but these techniques is not capable of
predicting the probability.
2. Logistic Regression
On the other hand, logistic regression has an advantage over linear regression as it is capable
of predicting the values within the range. Logistic regression is used to predict the values
within the categorical range. For example, male or female, winner or loser, etc.
3. Multinomial Logistic Regression
Parameters:
formula: represents an equation on the basis of which model has to be fitted.
family: represents the type of function to be used i.e., binomial for logistic regression
To know about more optional parameters of glm() function, use below command in R:
help("glm")
Example:
Let us assume a vector of IQ level of students in a class. Another vector contains the result of the
corresponding student i.e., fail or pass (0 or 1) in an exam.
# Data Frame
df <- as.data.frame(cbind(IQ, result))
Output:
IQ result
1 25.46872 0
2 26.72004 0
3 27.16163 0
4 27.55291 1
5 27.72577 0
6 28.00731 0
7 28.18095 0
8 28.28053 0
9 28.29086 0
10 28.34474 1
11 28.35581 1
12 28.40969 0
13 28.72583 0
14 28.81105 0
15 28.87337 1
16 29.00383 1
17 29.01762 0
18 29.03629 0
19 29.18109 1
20 29.39251 0
21 29.40852 0
22 29.78844 0
23 29.80456 1
24 29.81815 0
25 29.86478 0
26 29.91535 1
27 30.04204 1
28 30.09565 0
29 30.28495 1
30 30.39359 1
31 30.78886 1
32 30.79307 1
33 30.98601 1
34 31.14602 0
35 31.48225 1
36 31.74983 1
37 31.94705 1
38 31.94772 1
39 33.63058 0
40 35.35096 1
Correlation is a statistical measure that indicates how strongly two variables are related. It
involves the relationship between multiple variables as well. For instance, if one is interested
to know whether there is a relationship between the heights of fathers and sons, a correlation
coefficient can be calculated to answer this question. Generally, it lies between -1 and +1. It is
a scaled version of covariance and provides the direction and strength of a
relationship. Correlation coefficient test in R
Pearson Correlation Testing in R
There are mainly two types of correlation:
1. Parametric Correlation – Pearson correlation(r): It measures a linear dependence between
two variables (x and y) is known as a parametric correlation test because it depends on the
distribution of the data.
2. Non-Parametric Correlation – Kendall(tau) and Spearman(rho): They are rank-based
correlation coefficients, and are known as non-parametric correlation.
Pearson Rank Correlation Coefficient Formula
Pearson Rank Correlation is a parametric correlation. The Pearson correlation coefficient is
probably the most widely used measure for linear relationships between two normal distributed
variables and thus often just called “correlation coefficient”.
Note:
r takes a value between -1 (negative correlation) and 1 (positive correlation).
r = 0 means no correlation.
Cannot be applied to ordinal variables.
The sample size should be moderate (20-30) for good estimation.
Outliers can lead to misleading values means not robust with outliers.
Implementation in R
R Programming Language provides two methods to calculate the pearson correlation
coefficient. By using the functions cor() or cor.test() it can be calculated. It can be noted
that cor() computes the correlation coefficient whereas cor.test() computes the test for
association or correlation between paired samples. It returns both the correlation coefficient
and the significance level(or p-value) of the correlation.
Parameters:
x, y: numeric vectors with the same length
method: correlation method
# R program to illustrate
# pearson Correlation Testing
# Using cor()
# Calculating
# Correlation coefficient
# Using cor() method
result = cor(x, y, method = "pearson")
# R program to illustrate
# pearson Correlation Testing
# Using cor.test()
# Calculating
# Correlation coefficient
data: x and y
t = 1.4186, df = 5, p-value = 0.2152
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3643187 0.9183058
sample estimates:
cor
0.5357143
In the output above:
T is the value of the test statistic (T = 1.4186)
p-value is the significance level of the test statistic (p-value = 0.2152).
alternative hypothesis is a character string describing the alternative hypothesis (true
correlation is not equal to 0).
sample estimates is the correlation coefficient. For Pearson correlation coefficient it’s
named as cor (Cor.coeff = 0.5357).