0% found this document useful (0 votes)
53 views

Data Analytics - Unit 5

Unit 5 Notes

Uploaded by

esasc.swayam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Data Analytics - Unit 5

Unit 5 Notes

Uploaded by

esasc.swayam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Page 1 of 56

Unit – 5 : Data Analysis Techniques – II

Association rules analysis

Association Rules analysis attempts to find relationships between items. The most common
example of this is market basket analysis.

The rules can be described as the antecedent itemset implies the consequent itemset. The
antecedent and consequent are itemsets, which are sets of items. In other words, the antecedent is
a combination of items that are analyzed to determine what other items are implied by this
combination. These implied items are the consequent of the analysis.

Association rule analysis is a robust data mining technique for identifying intriguing
connections and patterns between objects in a collection.
Association rule analysis is widely used in retail, healthcare, and finance industries. These rules
enable organisations to uncover hidden relationships and patterns in data that would otherwise
go unnoticed, providing valuable insights that can inform decision-making and drive
improvement.

Association rule analysis is commonly used for market basket analysis, product
recommendation, fraud detection, and other applications in various domains.

In other words, it helps to find the association between different events or items in a dataset.

Importance of Association Rule Analysis In Data Mining

Association rule analysis plays a vital role in data mining by providing insights into complex data
relationships that would be difficult to identify manually. It is an important tool for businesses to
understand customer behaviour, preferences, and trends.

For example, retail businesses use association rule analysis to determine which products are
frequently purchased together and to improve product placement and promotion strategies.

Association rule analysis can also be used in medical research to identify potential drug
interactions or adverse effects.

Basic Concepts and Terminology

The following terms are commonly used in association rule analysis:

 Item: An element or attribute of interest in the dataset

Unit – 5 : Data Analytics III BCA


Page 2 of 56

 Transaction: A collection of items that occur together


 Support: The frequency with which an item or itemset appears in the dataset.
o (Item A + Item B) / (Entire dataset)
 Confidence: The likelihood that a rule is correct or true, given the occurrence of the
antecedent and consequent in the dataset.
o (Item A + Item B)/ (Item A)
 Lift: A measure of how often the antecedent and consequent occur together than expected by
chance.
o (Confidence) / (item B)/ (Entire dataset)

Data Preprocessing

Before performing association rule analysis, it is necessary to preprocess the data. This involves
data cleaning, transformation, and formatting to ensure that the data is in a suitable format for
analysis.

Data preprocessing steps may include:

 Removing duplicate or irrelevant data


 Handling missing or incomplete data
 Converting data to a suitable format (e.g., binary or numerical)
 Discretizing continuous variables into categorical variables
 Scaling or normalizing data

Measures For Evaluating Association Rules

Association rule analysis generates a large number of potential rules, and it is important to
evaluate and select the most relevant rules.

The following measures are commonly used to evaluate association rules:

 Support:
o Rules with high support are more significant as they occur more frequently in the dataset
 Confidence:
o Rules with high confidence are more reliable, as they have a higher probability of being true
 Lift:
o Rules with high lift indicate a strong association between the antecedent and consequent, as
they occur together more frequently than expected by chance

Association Rule Mining Algorithms

Unit – 5 : Data Analytics III BCA


Page 3 of 56

An association rule mining algorithm is a tool used to find patterns and relationships in data.
Several algorithms are used in association rule mining, each with its own strengths and
weaknesses.

Apriori Algorithm

One of the most popular association rule mining algorithms is the Apriori algorithm. The Apriori
algorithm is based on the concept of frequent itemsets, which are sets of items that occur together
frequently in a dataset.

The algorithm works by first identifying all the frequent itemsets in a dataset, and then
generating association rules from those itemsets.

These association rules can then be used to make predictions or recommendations based on the
patterns and relationships discovered.

FP-Growth Algorithm

In large datasets, FP-growth is a popular method for mining frequent item sets.

It generates frequent itemsets efficiently without generating candidate itemsets using a tree-
based data structure called the FP-tree. As a result, it is faster and more memory efficient than
the Apriori algorithm when dealing with large datasets.

Unit – 5 : Data Analytics III BCA


Page 4 of 56

First, the algorithm constructs an FP-tree from the input dataset, then recursively generates
frequent itemsets from it.

Eclat Algorithm

Equivalence Class Transformation, or Eclat is another popular algorithm for Association Rule
Mining.

Compared to Apriori, Eclat is designed to be more efficient at mining frequent itemsets. There
are a few key differences between the Eclat algorithm and the Apriori algorithm.

To mine the frequent itemsets, Eclat uses a depth-first search strategy instead of candidate
generation. Eclat is also designed to use less memory than the Apriori algorithm, which can be
important when working with large datasets.

Unit – 5 : Data Analytics III BCA


Page 5 of 56

Decision trees are a popular machine learning algorithm that can be used for both regression and
classification tasks. They are easy to understand, interpret, and implement, making them an ideal
choice for beginners in the field of machine learning.

What is a Decision Tree?

A decision tree is a non-parametric supervised learning algorithm for classification and


regression tasks. It has a hierarchical tree structure consisting of a root node, branches,
internal nodes, and leaf nodes. Decision trees are used for classification and regression tasks,
providing easy-to-understand models.

A decision tree is a hierarchical model used in decision support that depicts decisions and
their potential outcomes, incorporating chance events, resource expenses, and utility. This
algorithmic model utilizes conditional control statements and is non-parametric, supervised
learning, useful for both classification and regression tasks. The tree structure is comprised
of a root node, branches, internal nodes, and leaf nodes, forming a hierarchical, tree-like
structure.

It is a tool that has applications spanning several different areas. Decision trees can be used
for classification as well as regression problems. The name itself suggests that it uses a
flowchart like a tree structure to show the predictions that result from a series of feature-based
splits. It starts with a root node and ends with a decision made by leaves.

Unit – 5 : Data Analytics III BCA


Page 6 of 56

Decision Tree Terminologies

Before learning more about decision trees let’s get familiar with some of the terminologies:

 Root Node: The initial node at the beginning of a decision tree, where the entire
population or dataset starts dividing based on various features or conditions.
 Decision Nodes: Nodes resulting from the splitting of root nodes are known as
decision nodes. These nodes represent intermediate decisions or conditions within the
tree.
 Leaf Nodes: Nodes where further splitting is not possible, often indicating the final
classification or outcome. Leaf nodes are also referred to as terminal nodes.
 Sub-Tree: Similar to a subsection of a graph being called a sub-graph, a sub-section
of a decision tree is referred to as a sub-tree. It represents a specific portion of the
decision tree.
 Pruning: The process of removing or cutting down specific nodes in a decision tree to
prevent overfitting and simplify the model.
 Branch / Sub-Tree: A subsection of the entire decision tree is referred to as a branch
or sub-tree. It represents a specific path of decisions and outcomes within the tree.
 Parent and Child Node: In a decision tree, a node that is divided into sub-nodes is
known as a parent node, and the sub-nodes emerging from it are referred to as child
nodes. The parent node represents a decision or condition, while the child nodes
represent the potential outcomes or further decisions based on that condition.

Example of Decision Tree

Unit – 5 : Data Analytics III BCA


Page 7 of 56

Let’s understand decision trees with the help of an example:

Decision trees are upside down which means the root is at the top and then this root is split
into various several nodes. Decision trees are nothing but a bunch of if-else statements in
layman terms. It checks if the condition is true and if it is then it goes to the next node attached
to that decision.

In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy, or rainy?
If yes then it will go to the next feature which is humidity and wind. It will again check if
there is a strong wind or weak, if it’s a weak wind and it’s rainy then the person may go and
play.

Unit – 5 : Data Analytics III BCA


Page 8 of 56

Did you notice anything in the above flowchart? We see that if the weather is cloudy then we
must go to play. Why didn’t it split more? Why did it stop there?

To answer this question, we need to know about few more concepts like entropy, information
gain, and Gini index. But in simple terms, I can say here that the output for the training dataset
is always yes for cloudy weather, since there is no disorderliness here we don’t need to split
the node further.

The goal of machine learning is to decrease uncertainty or disorders from the dataset and for
this, we use decision trees.

Now you must be thinking how do I know what should be the root node? what should be the
decision node? when should I stop splitting? To decide this, there is a metric called “Entropy”
which is the amount of uncertainty in the dataset.

How decision tree algorithms work?

Decision Tree algorithm works in simpler steps

1. Starting at the Root: The algorithm begins at the top, called the “root node,”
representing the entire dataset.
2. Asking the Best Questions: It looks for the most important feature or question that
splits the data into the most distinct groups. This is like asking a question at a fork in
the tree.
3. Branching Out: Based on the answer to that question, it divides the data into smaller
subsets, creating new branches. Each branch represents a possible route through the
tree.

Unit – 5 : Data Analytics III BCA


Page 9 of 56

4. Repeating the Process: The algorithm continues asking questions and splitting the
data at each branch until it reaches the final “leaf nodes,” representing the predicted
outcomes or classifications.

Decision Tree Assumptions

Several assumptions are made to build effective models when creating decision trees. These
assumptions help guide the tree’s construction and impact its performance. Here are some
common assumptions and considerations when creating decision trees:

Binary Splits

Decision trees typically make binary splits, meaning each node divides the data into two
subsets based on a single feature or condition. This assumes that each decision can be
represented as a binary choice.

Recursive Partitioning

Decision trees use a recursive partitioning process, where each node is divided into child
nodes, and this process continues until a stopping criterion is met. This assumes that data can
be effectively subdivided into smaller, more manageable subsets.

Feature Independence

Decision trees often assume that the features used for splitting nodes are independent. In
practice, feature independence may not hold, but decision trees can still perform well if
features are correlated.

Homogeneity

Decision trees aim to create homogeneous subgroups in each node, meaning that the samples
within a node are as similar as possible regarding the target variable. This assumption helps
in achieving clear decision boundaries.

Top-Down Greedy Approach

Decision trees are constructed using a top-down, greedy approach, where each split is chosen
to maximize information gain or minimize impurity at the current node. This may not always
result in the globally optimal tree.

Unit – 5 : Data Analytics III BCA


Page 10 of 56

Categorical and Numerical Features

Decision trees can handle both categorical and numerical features. However, they may require
different splitting strategies for each type.

Overfitting

Decision trees are prone to overfitting when they capture noise in the data. Pruning and setting
appropriate stopping criteria are used to address this assumption.

Impurity Measures

Decision trees use impurity measures such as Gini impurity or entropy to evaluate how well
a split separates classes. The choice of impurity measure can impact tree construction.

No Missing Values

Decision trees assume that there are no missing values in the dataset or that missing values
have been appropriately handled through imputation or other methods.

Equal Importance of Features

Decision trees may assume equal importance for all features unless feature scaling or
weighting is applied to emphasize certain features.

No Outliers

Decision trees are sensitive to outliers, and extreme values can influence their construction.
Preprocessing or robust methods may be needed to handle outliers effectively.

Sensitivity to Sample Size

Small datasets may lead to overfitting, and large datasets may result in overly complex trees.
The sample size and tree depth should be balanced.

Unit – 5 : Data Analytics III BCA


Page 11 of 56

R is a powerful programming language and environment for data analysis. It is one of the most
popular data science tools because it is designed from ground up for statistics and data analysis.
It is the programming language used throughout this book.
This chapter is primarily designed for readers who have little to no experience with
programming, and hence we devote quite a bit of space to topics like variables and data types.
If you have programming experience, you may quickly skim through this chapter to just learn
the basic R syntax, and how to use RStudio.

2.1 What is R and why do you want to use it?


R is a programming language that allows you to write code to work with data. It is designed
from ground-up for this task–statistics and data processing.
R is called “R” because it was inspired by and comes after the language “S”, a language
for Statistics developed by AT&T.
There are many other languages that are good for working with data. We have selected R
because of its simplicity–as a language that is designed for such tasks from ground up, its tools
are rather simple. This is also a reason why R is very popular in areas like health and social
sciences–data processing in R is typically easier and requires less coding than in more general
languages.
Working with R (and other programming languages) works by writing formal instructions to
your computer, and the computer will execute those. The instructions can be written in different
“languages”, more precisely programming languages, and the computer needs tools to
understand each of these. R software you installed above (see Section 1.1) is one such tool.
As projects grow, it will become useful not to issue the instructions one-by-one, but to write
them all down in a single file, and then tell the computer to execute all of those instructions at
once. This list of instructions is called a script or program or code. Writing scripts is
called programming or coding. Executing or “running” a script will cause each instruction
(line of code) to be run in order, one after the other, just as if you had typed them in one by
one. Writing scripts allows you to save, share, and re-use your work. By saving instructions in
a file (or set of files), you can easily check, change, and re-execute the list of instructions as
you figure out how to use data to answer questions.
As you begin working with data in R, you will be writing multiple instructions (lines of code)
and saving them in files with the .R extension, representing R scripts. Through this course we
use RStudio for this task, but if you wish, you can use any text editor.

2.2 How to Run R


The primary way to use R in this course is through RStudio (see below). However, R can be
used also without RStudio.
RStudio is an open-source integrated development environment (IDE) that provides an
informative user interface for interacting with the R interpreter. If you haven’t done this already,
make sure to download and install the free version of RStudio (see Section 1.2 above). IDEs
are glorified text editors that provide various other handy tools for programming. For instance,

Unit – 5 : Data Analytics III BCA


Page 12 of 56

RStudio let’s you to edit your program, colors your code in a way to make understanding it
easier (syntax coloring), allows you to execute it with a simple keypress, explore data and
workspace variables, your command history, install packages, and much more.

RStudio’s default user interface. Red texts are annotations.


When you open RStudio (either by searching for it, or double-clicking on a desktop icon), you’ll
see an interface that looks something like here. By default, RStudio interface consists of
4 panes–small windows for different tasks (you can customize this layout if you wish):
 Console: The bottom-left pane is a console, the R command line for entering R commands. The
console will also show your code.
Normally you use console for quick computations and short sequences of 1-2 lines of code.
Longer blocks of code is usually easier to do as scripts.
 Script: The top-left pane is a text editor for writing R code, markdown, and other files. It
contains a plethora of tools to work with R (and some other) code, including syntax coloring
(coloring code according the its function), “auto-complete” and formating text, and to execute
your code easily. Note that this pane is hidden if there are no open scripts; select File > New
File > R Script from the menu to create a new script file.
 Environment: The top-right pane displays information about the current R environment
(workspace)—specifically, information that you have stored inside of workspace variables (see
Section 2.4 below). In the example in RStudio script window, the value 201 is stored in a
variable called x. You’ll often create dozens of variables within a script, and the Environment
pane helps you keep track of which values you have stored in what variables.
 Plots, packages, help, etc.: The bottom right pane contains multiple tabs for accessing various
information about your files and code. When you create visualizations, those plots will also be
in that pane. Most importantly, this is also where you can access the documentation. If you have
a question about how something in R works, this is a good place to start!
Note, you can use the small spaces between the panes to adjust the size of each area to your
liking. You can also use menu options to reorganize the panes if you wish. The most useful
tools are focusing and zooming. Focusing means moving your cursor and input into a particular
pane, e.g. Ctrl + 1 makes the script pane active and Ctrl + 2 makes console pane active. Using
keyboard shortcuts to move your focus is much faster than grabbing the mouse.
Zooming is a bit similar to focus, just it also hides the other panes and makes the zoomed on of
full size. Ctrl + Shift + 1 zooms to script, Ctrl + Shift + 2 zooms to console, and Ctrl + Shift +
0 restores the original 4-pane view. Zooming to individual panes is very useful if you are
working on a small screen. See the View > Panes menu and options therein, the menus also list
keyboard shortcuts.
See Section J for more information.
TBD: set up a project

2.3 Basic R

Unit – 5 : Data Analytics III BCA


Page 13 of 56

Here we introduce the very basics of R language. We start with typing simple commands on
console, and thereafter switch to scripts. If your task requires just 1-2 commands, then it is often
easier to type those directly on the console (the lower-left pane in RStudio) while longer
sequences are typically better to be written as a separate script (see below).

2.3.1 Entering commands on console


R Console is a small window where you can type in R commands. 1 We can start with simple
arithmetic. Write 1 + 1 in the R command prompt and hit enter. R replies with [1] 2. Below we
write these steps as:
1+1
## [1] 2
The first block shows the commands you issue in R console, and underneath is ## followed by
the R’s reply (the answer). The R’s reply contains the answer, 2, and a marker [1]. The marker
is related that to the fact that one command may produce many answers and this is the first of
those (see more in Section 4 below).
This is how we can use R as a powerful calculator. The other arithmetic operations are pretty
easy and intuitive: - for subtraction, * for multiplication, / for division and ^ for
exponentiation. Only exponentiation is somewhat non-standard, different programming
languages have different habits here. R knows that multiplication must be done before addition,
if you want the opposite then you need parenthesis:
1 + 2*3
## [1] 7
(1 + 2)*3
## [1] 9
Let’s now compute something that is hard to do manually–namely the length of light-year.
Light-year is the distance that light, moving 300,000 kilometers per second, covers in one year:
300000*60*60*24*365
## [1] 9.4608e+12
Here we take the speed of light, and multiply it by seconds in minute (60), minutes in hour (60),
hours in day (24) and days in year (365). R prints the answer in exponential form, it must be
understood as 9.46⋅10129.46⋅1012, i.e. almost 10 trillion kilometers.
You cannot just click on the previously entered command and edit it. But in RStudio, you
can use the up arrow to retrieve the previously entered command, edit it, and re-run.
See more in Section J.

2.3.2 Writing scripts

Unit – 5 : Data Analytics III BCA


Page 14 of 56

One can open a new script through RStudio menus, the corresponding keyboard shortcut is
visible as well.
Next, let’s re-write these calculations as a script. The easiest way to write scripts is using the
RStudio script editor. Depending on your exact configuration, an “Untitled” script may already
be open, or you can choose from menu File -> New File -> R Script (or Ctrl - Shift - N). This
opens a new R script in a dedicated window (top left in RStudio).
Let’s put the same R command in that window. Now the command (or more often, a collection
of commands) is called script or computer program.2 So the content of your script window will
look like
300000*60*60*24*365
This is a script, a very simple, one-line computer program.

Location of “Source” button in R Script window in RStudio.


The next task is to run the script, it means execute all the commands there (or in this case the
only command we have there). RStudio offer several ways to do it:

Unit – 5 : Data Analytics III BCA


Page 15 of 56

 “Source” (Ctrl + Shift + S) will execute (source in R parlance) the program. It will not show
the code that you execute, nor any results that are not explicitly printed (see Section 2.6).
 “Source with Echo” (Ctrl + Shift + Enter) will also execute the code, but will show both the
code and output, even if not explicitly printed.

Location of “Run” button in R Script window in RStudio.


Another handy way to execute code is to use “Run” button (Ctrl + Enter / ⌘ + Shift + Enter).
This executes either the region that is highlighted, or the command where the cursor is currently
located if there is no highlight. In the example figure at right, this will execute the line “1 + 2
+ 3”, and show both the code and the result in the “Console” window.
Finally, you may want to save your script using a better name than “Unititled”. Use the
menu: File -> Save As… to pick a good name.
TBD: normal prompt is “>”, “+” is continuation prompt

2.3.3 Comments
One of the extremely handy and simple features of scripts (and computer programs in general)
are comments. These are part of the code that are ignored by computer. These are just notes for
the human reader (including you!) to make it easier to understand what the code does. Since
programs can be opaque and difficult to understand, comments are widely used to add
explanations. Even your own code may be quite incomprehensible a few months after writing
it.
Comments should be clear, concise, and helpful—they should provide information that is not
otherwise present or “obvious” in the code itself.
In R, we mark text as a comment by putting it after the pound/hashtag symbol (#). Everything
from the # until the end of the line is a comment. It is common to put descriptive comments
immediately above the code it describes, and sometimes immediately aftewards. One can also
put short notes at the end of the line of code:
So the commented light-year script might look like this:
## Length of light-year:
## c by seconds in minute by minutes in hour by
## .. by hours in day by days in year
300000*60*60*24*365

Unit – 5 : Data Analytics III BCA


Page 16 of 56

Note that these comments start with double hash sign ##: only one is needed, but as the
computer ignores everything after the first one, it will also ignore the second one. So any
number of has signs is fine!
See Section 7.5.2 for more about how to write good comments.
You can “execute” comments and enter those on the console, but it is not very useful as they
do not do anything.
Comments are also used for temporarily “deleting” parts of the code–if you add comment
signs # in front of every line in some parts of your code, these lines will be ignored by the
computer. But you can easily get these back if you need those again.
In RStudio, you can turn highlighted lines into comments and back by pressing Ctrl - Shift - C.
See more in Section J.

From now on, you can write (or copy) the example code directly into the script window and
execute it using “Source” or “Run”.

2.4 Variables
Since computer programs involve working with lots of data, we need a way to store and refer
to this information. We do this using variables.

2.4.1 What are variables


For instance, if we want to add numbers, we can do just write it as
2+5
## [1] 7
This is a good way to compute something where we know the inputs (numbers “2” and “7”) and
we just want to print the output. But quite often we want to do something similar, just we do
not know what are the numbers. It may sound a bit counter-intuitive–how on earth can we
compute something if we do not know the inputs? –but there are many valid reasons for that.
For instance, we may ask the input from the user. Or the input may be date or time, and we do
not know when will someone run our program. Or the input is read from a dataset, and it may
be one of many datasets. In such cases can we cannot “hardcode” our computations like 2 + 7.
We must keep the program open to learn the actual input values later. This can be done using
variables.
The same example above, just using variables, may look like 3
x <- 2
y <- 5
x+y
## [1] 7
So what is the difference? After all, we still got the same number?

Unit – 5 : Data Analytics III BCA


Page 17 of 56

However, now our code stores the numbers, “2” and “7”, in memory under two separate labels
(variable names) “x” and “y”. You can think of variabls as labeled “boxes” for data. You can
use the label to refer to the data inside. The numbers can be stored into the boxes (variables)
using a special assignment operator <-, it is like an arrow that puts number “2” into a box
labelled “x” and number “5” into the box “y”. This process is called assignment. Note
that variable names goes left, value comes right.4 Later, we just use the box labels (variable
names) to perform the tasks with data that is inside of the boxes (variables).
In RStudio, use Alt-- (Alt-minus) to get the <- operator.
See Section J for more.
Now you can imagine that instead of x <- 2 and y <- 5, we may instead write code that
asks x from the user, and reads y from a dataset. But computation, adding x and y, will remain
the same. This is the beauty of variables: as long as the computations are the same, we can use
the same code. 5
But variables can also be used to remember and retrieve the values later. This requires a slightly
different code, for instance:
x <- 2
y <- 5
z <- x + y
z
## [1] 7
Note that we store the result of x + y in “z” in a fairly similar manner as how we stored numbers
into “x” and “y”. Just what goes into the box “z” is a result of a calculation, not a given number
as above. Now we have an additional “box” in memory, labeled as “z”. You can see your
variables in RStudio “Environment” pane. You can also see all the variables using
command ls():
ls()
## [1] "r" "x" "y" "z"
This shows that we have defined three variables: “x”, “y” and “z”.
More specifically, we are talking here about workspace variables or environment variables.
These are the variables that are part of R workspace, and that you can see on the top-right
“Environment” tab in RStudio. These are what programming languages typically call
just variables. Later, in Section 11, we will encounter data variables, stored in the datasets and
not in the workspace.
A note about the last line–it is just “z” and nothing else. This is for printing the result. R console
normally only prints the result if it is not assigned to a variable. If we were writing the code
instead like
x <- 2
y <- 5
z <- x + y

Unit – 5 : Data Analytics III BCA


Page 18 of 56

then we do not see any result. The result is still computed, just not printed on screen. The last
lonely “z” prints it in a simple manner (see Section 2.6 for more about printing).
We can use any variable to do computations and store it in any variable. So we can also do like
this:
## to begin with, 'z' contains value '7'
z <- z + 1 # take z, add 1, and store result back in z
z # now it is '8'
## [1] 8
Here we take the number form the “box z”, add “1” to it, and “put it back into the same box”.
This is perfectly valid computer code, and in fact widely used for various tasks, such as
counting.

2.4.2 Variable names


In the example above, we used a single-letter variable names. But they need not to be single-
letter only, they may be much longer. In fact, you are fairly free to choose any kind of names
you want but there some rules: variable names must begin with a letter and can contain any
combination of letters, numbers, periods (.), or underscores (_).
Here are a few examples of valid variable names:
x <- 1
xx <- 2
x1 <- 3
anotherX <- 4 # camelCase
one_more_x <- 5 # snake_case
beta.2 <- 6
All these styles have their advantages and disadvantages, in general, pick shorter names for
shorter scripts and long descriptive names for large complex projects. You can pick all kinds of
variables names, but they should be descriptive and informative about what the “boxes” contain.
Confusing or misleading variable names is a major problem in programming. See more in
Section 7.5.1.
A good example of how to use variables and choose variable names is here:
minutes_in_day <- 60*24
Variable names are case-sensitive, so “x” and “X” are two different variables. In the example
above, Minutes_in_day will not work:
Minutes_in_day
## Error in eval(expr, envir, enclos): object 'Minutes_in_day' not found
Here are some examples of invalid variable names:
1x <- 7 # starts with a number
new x <- 7 # contains space
price$ <- 8 # contains $
This code will not work and produce errors.
You can see what value is inside any variable by typing that variable name as a line of code:

Unit – 5 : Data Analytics III BCA


Page 19 of 56

x
## [1] 1

2.5 Data Types


In the previous section, we were only working with numeric values. We did some computations
and stored those in variables. But there are data that are not numbers.
The two most important non-numeric data types are text (strings) and logical values. Using
other data types is very similar to using numbers. For instance,
greeting <- "Hi!" # text
answer <- TRUE # logical
R is intelligent enough to understand that if we have code x <- 7, then x will contain a numeric
value (and so we can do math with it!), and if your write y <- "blah-blah-blah", then it is text,
and we can convert it to upper case instead. 6
There are four “basic types” (called atomic data types) in R that we encounter in this book.

2.5.1 Numeric
The default computational data type in R is numeric data. It can represent real numbers
(numbers that contain decimals). We can use use mathematical operators (such as +, -, *, ^,
see below in Section 2.5.1) to do computations with numeric data. There are also numerous
functions that work on numeric data (such as calculating sums, averages and square roots).
Numeric data is normally printed in a fairly obvious way, e.g.
1/2
## [1] 0.5
In case of non-finite fraction, only the first few digits are printed:
-1/7
## [1] -0.1428571
If numbers are too large, or too small, then they are printed in exponential form:
1000*2000*3000*4000/1.1
## [1] 2.181818e+13
1/1000/2000/3000
## [1] 1.666667e-10
The exponential form must be understood as 2.181818⋅10132.181818⋅1013 in the former case,
and as 1.666667⋅10−101.666667⋅10−10 in the latter case. Exponential form can also be used to
enter numbers, e.g.
x <- -3e-2 # -0.03
x
## [1] -0.03
Naturally, there are various ways to adjust the way the numbers are printed.

Unit – 5 : Data Analytics III BCA


Page 20 of 56

There is also a special mathematical constants: pi is π=3.1415927, and Inf is infinity. You can
get infinities when you do certain operations, e.g. divide by zero. You can also use infinity if
you need a constant that is larger than any number.
One can use Mathematical operators with numeric values. Mathematical operators are the
common signs like + and - that allow to do basic mathematics (to “operate”), plus a few others:

+: addition

-: subtraction

*: multiplication

/: division

^: exponentiation (i.e. 2^3 means 2*2*2).


These are defined for most numbers, except for a few corner cases, such as division by zero.
The other way to do math, besides operators, is with functions. We’ll talk more about those
below in Section 3.2.
Besides these well known mathematical operations, there are more, for instance

 %/% is integer division: e.g. 7 %/% 2 equals 3. This is a division that only returns the
integer part and ignores the remainder.
 %% is modulo, e.g. 7 %% 2 equals 1–when you divide 7 by 2, then 1 is “left over”.
 There are many more mathematical operators, such as matrix product or outer product.
We do not discuss these in this book.

2.5.2 Character
Another very common task we do is to perform simple text manipulations. Text data is
called character or string data in R. This may include simple tasks like storing a single letter in
a variable, or changing words to upper case; but it may also include quite complicated text
analysis.
You can tell that something is character data by putting this in quotes (both single quotes ' and
double quotes " will do). For instance, we can store the name of a certain well-known playwriter
in a variable: r famous_poet <- "Bill Shakespeare" Note that character data is still data, so it
can be assigned to a variable just like numeric data! We can print its value by just typing its
name on the console, or using dedicated printing functions (see Section 2.6). There are no
special operators for character data, though there are a many functions for working with strings.
Note that it is not the content but the type of the content that decides if the variable is numeric
or character:
x <- 1 # this is numeric
y <- "1" # this is character

Unit – 5 : Data Analytics III BCA


Page 21 of 56

Both variables contain “one”, but in case of “x” this is stored as number, in “y” it is stored as
string. This is because 1 (without quotes) is a number and "1" (with quotes) is a character, and
the variable automatically “knows” what type data you put in there. Hence we can do
mathematical operations with “x” but not with “y”, and text functions with “y” but not with
“x”:
x+1
## [1] 2
will work but y + 1 will give an error. If you are unsure what type of a particular variable is,
you can query it with function class(), e.g.
class(y)
## [1] "character"

There are no dedicated character operators but there is a plethora of functions dedicated to
manipulating text.

2.5.3 Logical
The third extremely important variable type is logical variables (a.k.a Boolean variables).
These can only store two values–“true” or “false”. In R, these two values are written
as TRUE and FALSE. Importantly, these are not the strings "TRUE" or "FALSE"; logical
values are a different type! If you write these values in RStudio script window, you see that it
has a special color for these “logical constants”.
logical values are called “booleans” after mathematician and logician George Boole.
But why do we need such “powerless” variables that only can contain two values? Weren’t it
more useful to use numbers or strings that can contain much more? It turns out that logical
values are extremely important. Namely, most of decision-making is logical. We either do this,
or we do not do this. And there is a lot of decision-making in the computer code. We have to
check if our results are correct (or not), if the user input makes sense (or not), if we are done
with all inputs or not, so forth. All these decisions involve only two values, and R has many
decisionmaking tools that rely on such logical values.
You can create logical variables directly, like a <- TRUE but that is rarely useful. Most
commonly we see those as the result of applying comparison operators to data. These are

<: less than

>: greater than

<=: less-than-or-equal

>=: greater-than-or-equal

==: equal

Unit – 5 : Data Analytics III BCA


Page 22 of 56

!=: not-equal
Note that equality is tested with double equal signs ==, not with single equal sign! For instance
2 == 3
## [1] FALSE
gives you FALSE but you cannot use single equal sign for comparison, 2 = 3 gives an error
instead.
Comparison operators behave in many ways exactly as mathematical operators like + and *,
just they result in logical values:
3<4
## [1] TRUE
3.14 < 3
## [1] FALSE
We can store these values in variables exactly like in case of numbers or strings:
a <- 3
b <- 4
c <- a == b # does 3 equal 4?
c
## [1] FALSE

One can also compare strings. While equality is fairly obvious, then for instance
"cat" > "dog"
## [1] FALSE
turns out to be false. This has nothing to do with the size of the corresponding mammals–the
fact that cat is “smaller” here means it is located before dog when written in alphabetic order.

Logical values have also additional operators, called logical operators or boolean operators.
These work only with logical values and they produce logical values. This allows you to make
more complex logical expressions. Although their behavior is very similar to that of
mathematical operators, logical operators are often confusing for beginners. We are used to
work with numbers but not with logical values.
Logical operators include & (logical and), | (logical or), and ! (logical not). The meaning of
these logical operators corresponds rather closely (but not exactly!) to their meaning in
everyday language. In particular true AND true is true, for instance
x <- 3
y <- 5

x < 4 # TRUE
## [1] TRUE
y > 4 # TRUE

Unit – 5 : Data Analytics III BCA


Page 23 of 56

## [1] TRUE
x < 4 & y > 4 # TRUE and TRUE is TRUE
## [1] TRUE
But if any of the involved logical values is false, then logical AND will produce false:
x > 4 & y > 4 # FALSE and TRUE is FALSE
## [1] FALSE
However, you can use logical NOT, ! to reverse the condition:
!(x > 4) & y > 4 # not FALSE and TRUE is TRUE
## [1] TRUE
Note that we need to put x > 4 in parenthesis to tell R that ! applies to x > 4, not on x alone!
Logical OR behaves otherwise similarly, but it is true if at least one of the values involved is
true:
pet <- "dog"
weather <- "rain"

# Check if pet is "cat" OR "dog"


pet == "cat" | pet == "dog"
## [1] TRUE
# Check if pet is dog OR whether is sunny
pet == "dog" | weather == "sunny"
## [1] TRUE
It’s easy to write complex expressions with logical operators. If you find yourself getting lost,
I recommend rethinking your question to see if there is a simpler way to express it!

2.5.4 Integer
The final “atomic” data type we encounter in this book is integer. These are numbers like
“numeric”, but these can only hold integer values. Now again, one may ask why do we need
such limited numbers, but there are a few reasons for this.

 First, and most importantly, integer arithmetic is precise. This is not guaranteed to be the case
of floating point “numerics”–computers cannot represent infinite number of decimals, and
hence usually only produce results that are close to, but not exactly right.
 The other reason why integers is sometime preferred is that integer arithmetic may be faster and
consume less memory. However, for computations we encounter in this class, the storage and
computation speed does not matter.
Integers are produced by certain operations, e.g when creating sequences.
Base R has two additional “basic types” that we do not discuss in this book:

 Complex: Complex (imaginary) numbers have their own data storage type in R, they are are
created using the i syntax: c <- 1 + 2i.

Unit – 5 : Data Analytics III BCA


Page 24 of 56

 Raw: is a sequence of “raw” data. It is good for storing a “raw” sequence of bytes, such as
image data. R does not interpret raw data in any particular way.
2.6 Producing output: cat and print
When you just compute on R console, or even when you write small scripts, it is not necessary
to dedicate any extra effort to printing. The results are automatically printed. This is a common
behavior in R console: the last result will be printed. It is a handy but limited feature.

Output depends on the way the code is executed. The same script is first “run”, that produces
the first lines of output on the console, including the result “1”. Thereafter it is “sourced”. The
only prints the source() command, but no output.
First, it only prints the “last” value (unless assigned to a variable). Second, this only works in
certain environments, e.g. in RStudio console when running the program, but not when
“sourcing” it (see Section 2.3.2). Third, when writing longer programs, you may want to see
more results than the last one, and maybe also add some explanatory notes. Finally, the result
depends on what exactly does the “last” value mean–the code can either be fed line-by-line, in
which case every value is the last one, or all at once, in which case only the last line is the last
one…
All this suggests that instead on relying automatic printing, in more complex projects you may
want to use dedicated printing functions. R has two printing commands: cat and print. cat is
useful if you want to print simple objects, but potentially more than one object. These may be
one or more numbers, strings, and explanatory text. print can output complex objects but only
one at time.
Next, we illustrate the usage of cat:
## Compute length of light-year
ly <- 300000*60*60*24*365
cat("Length of light-year is", ly, "km\n")

Unit – 5 : Data Analytics III BCA


Page 25 of 56

## Length of light-year is 9.4608e+12 km


This short script computes the length of light-year and prints it with a small informative
message. Alternatively, we can just compute this number and let R console to automatically
print it:
ly <- 300000*60*60*24*365
ly
## [1] 9.4608e+12
Why should we use cat then? The automatic printing is good enough if you work interactively
on console, or just run very short code snippets. But if the code is not run on R console, then
the number may not even be printed. Alternatively, if the script computes and prints many
results, the user gets easily confused what do these numbers mean. So it is a good habit to output
your results together with a brief explanation.
The syntax of cat is pretty simple: it takes a list of arguments, texts, variables and numbers you
want to print. One very useful symbol you may want to add is the newline character "\n". (Note:
it uses backslash "\n", not _slash "/n".) This forces printing to jump to the next line:
## output on single line:
cat("hi there\n")
## hi there
## output on multiple lines
cat("hi\n there\n") # jump to new line
## hi
## there
print is somewhat similar to cat but designed to output more complex objects, such
as vectors, lists, and data frames. Print may produce multi-line output but it does not allow to
add explanatory messages. You have to cat the message and print your complex object
thereafter.
Obviously, output does not have to be printed on console, it may also be sent to a file, or
uploaded to internet, or played as audio instead. But whatever the exact format, it is important
to ensure the user has enough information to understand what the output is.
Finally, let’s use the tools we learned above, and re-write the light-year script in a way that
looks more like normal computer code:
## Compute the length of lightyear
c <- 300000 # speed of light (km/s)
lightMinute <- c*60
lightHour <- lightMinute*60
lightDay <- lightHour*24
lightYear <- lightDay*365
cat("Lightyear is", lightYear, "km\n")
## Lightyear is 9.4608e+12 km

Unit – 5 : Data Analytics III BCA


Page 26 of 56

R Vectors are the same as the arrays in R language which are used to hold multiple data values
of the same type. One major key point is that in R Programming Language the indexing of the
vector will start from ‘1’ and not from ‘0’. We can create numeric vectors and character vectors
as well.

R – Vector
Creating a vector
A vector is a basic data structure that represents a one-dimensional array. to create a array we
use the “c” function which the most common method use in R Programming Language.

# R program to create Vectors

# we can use the c function

# to combine the values as a vector.

# By default the type will be double

X<- c(61, 4, 21, 67, 89, 2)

cat('using c function', X, '\n')

Unit – 5 : Data Analytics III BCA


Page 27 of 56

# seq() function for creating

# a sequence of continuous values.

# length.out defines the length of vector.

Y<- seq(1, 10, length.out = 5)

cat('using seq() function', Y, '\n')

# use':' to create a vector

# of continuous values.

Z<- 2:7

cat('using colon', Z)

Output:
using c function 61 4 21 67 89 2
using seq() function 1 3.25 5.5 7.75 10
using colon 2 3 4 5 6 7

Types of R vectors
Vectors are of different types which are used in R. Following are some of the types of vectors:
Numeric vectors
Numeric vectors are those which contain numeric values such as integer, float, etc.

# R program to create numeric Vectors

Unit – 5 : Data Analytics III BCA


Page 28 of 56

# creation of vectors using c() function.

v1<- c(4, 5, 6, 7)

# display type of vector

typeof(v1)

# by using 'L' we can specify that we want integer values.

v2<- c(1L, 4L, 2L, 5L)

# display type of vector

typeof(v2)

Output:
[1] "double"
[1] "integer"

Character vectors
Character vectors in R contain alphanumeric values and special characters.

# R program to create Character Vectors

Unit – 5 : Data Analytics III BCA


Page 29 of 56

# by default numeric values

# are converted into characters

v1<- c('geeks', '2', 'hello', 57)

# Displaying type of vector

typeof(v1)

Output:
[1] "character"

Logical vectors
Logical vectors in R contain Boolean values such as TRUE, FALSE and NA for Null values.

# R program to create Logical Vectors

# Creating logical vector

# using c() function

v1<- c(TRUE, FALSE, TRUE, NA)

# Displaying type of vector

typeof(v1)

Output:

Unit – 5 : Data Analytics III BCA


Page 30 of 56

[1] "logical"

Length of R vector
In R, the length of a vector is determined by the number of elements it contains. we can use
the length() function to retrieve the length of a vector.

# Create a numeric vector

x <- c(1, 2, 3, 4, 5)

# Find the length of the vector

length(x)

# Create a character vector

y <- c("apple", "banana", "cherry")

# Find the length of the vector

length(y)

# Create a logical vector

z <- c(TRUE, FALSE, TRUE, TRUE)

# Find the length of the vector

Unit – 5 : Data Analytics III BCA


Page 31 of 56

length(z)

Output:
> length(x)
[1] 5

> length(y)
[1] 3

> length(z)
[1] 4

Accessing R vector elements


Accessing elements in a vector is the process of performing operation on an individual element
of a vector. There are many ways through which we can access the elements of the vector. The
most common is using the ‘[]’, symbol.
Note: Vectors in R are 1 based indexing unlike the normal C, python, etc format.

# R program to access elements of a Vector

# accessing elements with an index number.

X<- c(2, 5, 18, 1, 12)

cat('Using Subscript operator', X[2], '\n')

# by passing a range of values

# inside the vector index.

Unit – 5 : Data Analytics III BCA


Page 32 of 56

Y<- c(4, 8, 2, 1, 17)

cat('Using combine() function', Y[c(4, 1)], '\n')

Output:
Using Subscript operator 5
Using combine() function 1 4

Modifying a R vector
Modification of a Vector is the process of applying some operation on an individual element of
a vector to change its value in the vector. There are different ways through which we can modify
a vector:

# pogam to modify elements of a Vecto

# Ceating a vecto

X<- c(2, 7, 9, 7, 8, 2)

# modify a specific element

X[3] <- 1

X[2] <-9

cat('subscipt opeato', X, '\n')

Unit – 5 : Data Analytics III BCA


Page 33 of 56

# Modify using diffeent logics.

X[1:5]<- 0

cat('Logical indexing', X, '\n')

# Modify by specifying

# the position o elements.

X<- X[c(3, 2, 1)]

cat('combine() function', X)

Output:
subscipt opeato 2 9 1 7 8 2
Logical indexing 0 0 0 0 0 2
combine() function 0 0 0

Deleting a vector

Deletion of a Vector is the pocess of deleting all of the elements of the vector. This can be done
by assigning it to a NULL value.

# pogam to delete a Vecto

# Ceating a Vecto

M<- c(8, 10, 2, 5)

Unit – 5 : Data Analytics III BCA


Page 34 of 56

# set NULL to the vecto

M<- NULL

cat('Output vecto', M)

Output:
Output vector NULL

Unit – 5 : Data Analytics III BCA


Page 35 of 56

Sorting elements of a Vector


sort() function is used with the help of which we can sot the values in ascending or descending
order.

# pogam to sot elements of a Vecto

# Ceation of Vecto

X<- c(8, 2, 7, 1, 11, 2)

# Sot in ascending ode

A<- sort(X)

cat('ascending ode', A, '\n')

# sort in descending order

# by setting deceasing as TURE

B<- sot(X, deceasing = TUE)

cat('descending ode', B)

Output:
ascending ode 1 2 2 7 8 11
descending ode 11 8 7 2 2 1

Unit – 5 : Data Analytics III BCA


Page 36 of 56

The measure of central tendency in R Language represents the whole set of data by a single
value. It gives us the location of the central points. There are three main measures of central
tendency:
 Mean
 Median
 Mode

Mean, Median and Mode in R Programming


Prerequisite:
Before doing any computation, first of all, we need to prepare our data and save our data in
external .txt or .csv files and it’s a best practice to save the file in the current directory. After
that import, your data into R as follow:
Get the CSV file here.

 R

# R program to import data into R

# Import the data using read.csv()

myData = read.csv("CardioGoodFitness.csv",

stringsAsFactors=F)

# Print the first 6 rows

print(head(myData))

Output:
Product Age Gender Education MaritalStatus Usage Fitness Income Miles

Unit – 5 : Data Analytics III BCA


Page 37 of 56

1 TM195 18 Male 14 Single 3 4 29562 112


2 TM195 19 Male 15 Single 2 3 31836 75
3 TM195 19 Female 14 Partnered 4 3 30699 66
4 TM195 19 Male 12 Single 3 3 32973 85
5 TM195 20 Male 13 Partnered 4 2 35247 47
6 TM195 20 Female 14 Partnered 3 3 32973 66

Unit – 5 : Data Analytics III BCA


Page 38 of 56

Mean in R Programming Language


It is the sum of observations divided by the total number of observations. It is also defined as
average which is the sum divided by count.

Mean, Median and Mode in R Programming


Where, n = number of terms
Example:

 R

# R program to illustrate

# Descriptive Analysis

# Import the data using read.csv()

myData = read.csv("CardioGoodFitness.csv",

stringsAsFactors=F)

# Compute the mean value

mean = mean(myData$Age)

print(mean)

Output:
[1] 28.78889

Unit – 5 : Data Analytics III BCA


Page 39 of 56

Median in R Programming Language


It is the middle value of the data set. It splits the data into two halves. If the number of elements
in the data set is odd then the center element is median and if it is even then the median would
be the average of two central elements.

Mean, Median and Mode in R Programming


Where n = number of terms
Syntax: median(x, na.rm = False)
Where, X is a vector and na.rm is used to remove missing value
Example:

 R

# R program to illustrate

# Descriptive Analysis

# Import the data using read.csv()

myData = read.csv("CardioGoodFitness.csv",

stringsAsFactors=F)

# Compute the median value

Unit – 5 : Data Analytics III BCA


Page 40 of 56

median = median(myData$Age)

print(median)

Output:
[1] 26

Unit – 5 : Data Analytics III BCA


Page 41 of 56

Mode in R Programming Language


It is the value that has the highest frequency in the given data set. The data set may have no mode
if the frequency of all data points is the same. Also, we can have more than one mode if we
encounter two or more data points having the same frequency. There is no inbuilt function for
finding mode in R, so we can create our own function for finding the mode or we can use the
package called modeest.

Creating a user-defined function for finding Mode


There is no in-built function for finding mode in R. So let’s create a user-defined function that
will return the mode of the data passed. We will be using the table() method for this as it creates
a categorical representation of data with the variable names and the frequency in the form of a
table. We will sort the column Age column in descending order and will return the 1 value from
the sorted values.
Example: Finding mode by sorting the column of the data frame

# Import the data using read.csv()

myData = read.csv("CardioGoodFitness.csv",

stringsAsFactors=F)

mode = function(){

return(sort(-table(myData$Age))[1])

mode()

Output:
25: -25

Using Modeest Package

Unit – 5 : Data Analytics III BCA


Page 42 of 56

We can use the modeest package of the R. This package provides methods to find the mode of
the univariate data and the mode of the usual probability distribution.
Example:

# R program to illustrate

# Descriptive Analysis

# Import the library

library(modeest)

# Import the data using read.csv()

myData = read.csv("CardioGoodFitness.csv",

stringsAsFactors=F)

# Compute the mode value

mode = mfv(myData$Age)

print(mode)

Output:
[1] 25

Unit – 5 : Data Analytics III BCA


Page 43 of 56

Standard deviation R is the measure of the dispersion of the values. It can also be defined as
the square root of variance.
Formula of sample standard deviation:

where,
 s = sample standard deviation
 N = Number of entities
 = Mean of entities
Basically, there are two different ways to calculate standard Deviation in R Programming
language, both of them are discussed below.

Method 1: Naive approach

In this method of calculating the standard deviation, we will be using the above standard
formula of the sample standard deviation in R language.
Example 1:

v <- c(12,24,74,32,14,29,84,56,67,41)

s<-sqrt(sum((v-mean(v))^2/(length(v)-1)))

print(s)

Output:
[1] 25.53886

Unit – 5 : Data Analytics III BCA


Page 44 of 56

Example 2:

v <- c(1.8,3.7,9.2,4.7,6.1,2.8,6.1,2.2,1.4,7.9)

s<-sqrt(sum((v-mean(v))^2/(length(v)-1)))

print(s)

Output:
[1] 2.676004

Method 2: Using sd()

The sd() function is used to return the standard deviation.


Syntax: sd(x, na.rm = FALSE)
Parameters:
 x: a numeric vector, matrix or data frame.
 na.rm: missing values be removed?
Return: The sample standard deviation of x.

Example 1:

v <- c(12,24,74,32,14,29,84,56,67,41)

Unit – 5 : Data Analytics III BCA


Page 45 of 56

s<-sd(v)

print(s)

Output:
[1] 25.53886
Example 2:

v <- c(71,48,98,65,45,27,39,61,50,24,17)

s1<-sqrt(sum((v-mean(v))^2/(length(v)-1)))

print(s1)

s2<-sd(v)

print(s2)

Output:
[1] 23.52175
Example 3:

v <- c(1.8,3.7,9.2,4.7,6.1,2.8,6.1,2.2,1.4,7.9)

s1<-sqrt(sum((v-mean(v))^2/(length(v)-1)))

print(s1)

s2<-sd(v)

Unit – 5 : Data Analytics III BCA


Page 46 of 56

print(s2)

Output:
[1] 2.676004

Unit – 5 : Data Analytics III BCA


Page 47 of 56

Variance

The variance is a numerical measure of how the data values is dispersed around the mean. In
particular, the sample variance is defined as:

Similarly, the population variance is defined in terms of the population mean μ and
population size N:

Example:
Let’s consider there are 8 data points,
2, 4, 4, 4, 5, 5, 7, 9

Computing Variance in R Programming

One can calculate the variance by using var() function in R.


Syntax: var(x)

Parameters:
x: numeric vector

Example 1:

# R program to get variance of a list

# Taking a list of elements

list = c(2, 4, 4, 4, 5, 5, 7, 9)

Unit – 5 : Data Analytics III BCA


Page 48 of 56

# Calculating variance using var()

print(var(list))

Output:
[1] 4.571429
Example 2:

# R program to get variance of a list

# Taking a list of elements

list = c(212, 231, 234, 564, 235)

# Calculating variance using var()

print(var(list))

Output:
[1] 22666.7

Unit – 5 : Data Analytics III BCA


Page 49 of 56

Regression Analysis in R Programming

In statistics, Logistic Regression is a model that takes response variables (dependent variable) and
features (independent variables) to determine the estimated probability of an event. A logistic
model is used when the response variable has categorical values such as 0 or 1. For example, a
student will pass/fail, a mail is a spam or not, determining the images, etc. In this article, we’ll
discuss regression analysis, types of regression, and implementation of logistic regression in R
programming.
Regression Analysis in R
Regression analysis is a group of statistical processes used in R programming and statistics to
determine the relationship between dataset variables. Generally, regression analysis is used to
determine the relationship between the dependent and independent variables of the dataset.
Regression analysis helps to understand how dependent variables change when one of the
independent variables changes and other independent variables are kept constant. This helps in
building a regression model and further, helps in forecasting the values with respect to a change in
one of the independent variables. On the basis of types of dependent variables, a number of
independent variables, and the shape of the regression line, there are 4 types of regression analysis
techniques i.e., Linear Regression, Logistic Regression, Multinomial Logistic Regression, and
Ordinal Logistic Regression.
Types of Regression Analysis
1. Linear Regression
Linear Regression is one of the most widely used regression techniques to model the
relationship between two variables. It uses a linear relationship to model the regression line.
There are 2 variables used in the linear relationship equation i.e., predictor variable and the
response variable.
y = ax + b
where,
 y is the response variable
 x is the predictor variable
 a and b are the coefficients
The regression line created using this technique is a straight line. The response variable is
derived from predictor variables. Predictor variables are estimated using some statistical
experiments. Linear regression is widely used but these techniques is not capable of
predicting the probability.
2. Logistic Regression
On the other hand, logistic regression has an advantage over linear regression as it is capable
of predicting the values within the range. Logistic regression is used to predict the values
within the categorical range. For example, male or female, winner or loser, etc.
3. Multinomial Logistic Regression

Unit – 5 : Data Analytics III BCA


Page 50 of 56

Multinomial logistic regression is an advanced technique of logistic regression that takes


more than 2 categorical variables unlike, in logistic regression which takes 2 categorical
variables. For example, a biology researcher found a new type of species and the type of
species can be determined by many factors such as size, shape, eye color, the environmental
factor of its living, etc.
4. Ordinal Logistic Regression
Ordinal logistic regression is also an extension of logistic regression. It is used to predict the
values as different levels of category (ordered). In simple words, it predicts the rank. For
example, a survey of the taste quality of food is created by a restaurant, and using ordinal
logistic regression, a survey response variable can be created on a scale of any interval such
as 1-10 which helps in determining the customer’s response to their food items.

Implementation of Logistic Regression in R programming


In R language, a logistic regression model is created using glm() function.
Syntax: glm(formula, family = binomial)

Parameters:
formula: represents an equation on the basis of which model has to be fitted.
family: represents the type of function to be used i.e., binomial for logistic regression

To know about more optional parameters of glm() function, use below command in R:
help("glm")
Example:
Let us assume a vector of IQ level of students in a class. Another vector contains the result of the
corresponding student i.e., fail or pass (0 or 1) in an exam.

# Generate random IQ values with mean = 30 and sd =2


IQ <- rnorm(40, 30, 2)
# Sorting IQ level in ascending order
IQ <- sort(IQ)

# Generate vector with pass and fail values of 40 students


result <- c(0, 0, 0, 1, 0, 0, 0, 0, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 1, 0,
0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
1, 1, 1, 0, 1, 1, 1, 1, 0, 1)

Unit – 5 : Data Analytics III BCA


Page 51 of 56

# Data Frame
df <- as.data.frame(cbind(IQ, result))

# Print data frame


print(df)

Output:
IQ result
1 25.46872 0
2 26.72004 0
3 27.16163 0
4 27.55291 1
5 27.72577 0
6 28.00731 0
7 28.18095 0
8 28.28053 0
9 28.29086 0
10 28.34474 1
11 28.35581 1
12 28.40969 0
13 28.72583 0
14 28.81105 0
15 28.87337 1
16 29.00383 1
17 29.01762 0
18 29.03629 0
19 29.18109 1
20 29.39251 0
21 29.40852 0

Unit – 5 : Data Analytics III BCA


Page 52 of 56

22 29.78844 0
23 29.80456 1
24 29.81815 0
25 29.86478 0
26 29.91535 1
27 30.04204 1
28 30.09565 0
29 30.28495 1
30 30.39359 1
31 30.78886 1
32 30.79307 1
33 30.98601 1
34 31.14602 0
35 31.48225 1
36 31.74983 1
37 31.94705 1
38 31.94772 1
39 33.63058 0
40 35.35096 1

# Plotting IQ on x-axis and result on y-axis

plot(IQ, result, xlab = "IQ Level",

ylab = "Probability of Passing")

Unit – 5 : Data Analytics III BCA


Page 53 of 56

# Create a logistic model

g = glm(result~IQ, family=binomial, df)

# Create a curve based on prediction using the regression model

curve(predict(g, data.frame(IQ=x), type="resp"), add=TRUE)

Unit – 5 : Data Analytics III BCA


Page 54 of 56

Pearson Correlation Testing in R Programming

Correlation is a statistical measure that indicates how strongly two variables are related. It
involves the relationship between multiple variables as well. For instance, if one is interested
to know whether there is a relationship between the heights of fathers and sons, a correlation
coefficient can be calculated to answer this question. Generally, it lies between -1 and +1. It is
a scaled version of covariance and provides the direction and strength of a
relationship. Correlation coefficient test in R
Pearson Correlation Testing in R
There are mainly two types of correlation:
1. Parametric Correlation – Pearson correlation(r): It measures a linear dependence between
two variables (x and y) is known as a parametric correlation test because it depends on the
distribution of the data.
2. Non-Parametric Correlation – Kendall(tau) and Spearman(rho): They are rank-based
correlation coefficients, and are known as non-parametric correlation.
Pearson Rank Correlation Coefficient Formula
Pearson Rank Correlation is a parametric correlation. The Pearson correlation coefficient is
probably the most widely used measure for linear relationships between two normal distributed
variables and thus often just called “correlation coefficient”.

Note:
 r takes a value between -1 (negative correlation) and 1 (positive correlation).
 r = 0 means no correlation.
 Cannot be applied to ordinal variables.
 The sample size should be moderate (20-30) for good estimation.
 Outliers can lead to misleading values means not robust with outliers.

Implementation in R
R Programming Language provides two methods to calculate the pearson correlation
coefficient. By using the functions cor() or cor.test() it can be calculated. It can be noted
that cor() computes the correlation coefficient whereas cor.test() computes the test for
association or correlation between paired samples. It returns both the correlation coefficient
and the significance level(or p-value) of the correlation.

Syntax: cor(x, y, method = “pearson”)


cor.test(x, y, method = “pearson”)

Unit – 5 : Data Analytics III BCA


Page 55 of 56

Parameters:
 x, y: numeric vectors with the same length
 method: correlation method

Correlation Coefficient Test In R Using cor() method

# R program to illustrate
# pearson Correlation Testing
# Using cor()

# Taking two numeric


# Vectors with same length
x = c(1, 2, 3, 4, 5, 6, 7)
y = c(1, 3, 6, 2, 7, 4, 5)

# Calculating
# Correlation coefficient
# Using cor() method
result = cor(x, y, method = "pearson")

# Print the result


cat("Pearson correlation coefficient is:", result)
Output:
Pearson correlation coefficient is: 0.5357143

Correlation Coefficient Test In R Using cor.test() method

# R program to illustrate
# pearson Correlation Testing
# Using cor.test()

# Taking two numeric


# Vectors with same length
x = c(1, 2, 3, 4, 5, 6, 7)
y = c(1, 3, 6, 2, 7, 4, 5)

# Calculating
# Correlation coefficient

Unit – 5 : Data Analytics III BCA


Page 56 of 56

# Using cor.test() method


result = cor.test(x, y, method = "pearson")

# Print the result


print(result)
Output:
Pearson's product-moment correlation

data: x and y
t = 1.4186, df = 5, p-value = 0.2152
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3643187 0.9183058
sample estimates:
cor
0.5357143
In the output above:
 T is the value of the test statistic (T = 1.4186)
 p-value is the significance level of the test statistic (p-value = 0.2152).
 alternative hypothesis is a character string describing the alternative hypothesis (true
correlation is not equal to 0).
 sample estimates is the correlation coefficient. For Pearson correlation coefficient it’s
named as cor (Cor.coeff = 0.5357).

Unit – 5 : Data Analytics III BCA

You might also like