0% found this document useful (0 votes)
91 views244 pages

CST 322 Data Analytics (Elective)

Jawaharlal College of Engineering and Technology offers a B.Tech in Computer Science and Engineering, with a focus on producing high-quality engineers through innovative education and industry partnerships. The course CST 332 Data Analytics covers essential topics such as mathematics for data analytics, predictive and descriptive analytics, and big data applications using R programming. The program aims to equip students with practical skills and knowledge to analyze data effectively in real-world scenarios.

Uploaded by

mepranaligulhane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views244 pages

CST 322 Data Analytics (Elective)

Jawaharlal College of Engineering and Technology offers a B.Tech in Computer Science and Engineering, with a focus on producing high-quality engineers through innovative education and industry partnerships. The course CST 332 Data Analytics covers essential topics such as mathematics for data analytics, predictive and descriptive analytics, and big data applications using R programming. The program aims to equip students with practical skills and knowledge to analyze data effectively in real-world scenarios.

Uploaded by

mepranaligulhane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 244

JAWAHARLAL COLLEGE OF ENGINEERING AND TECHNOLOGY

(Approved by AICTE, Affiliated to APJ Abdul Kalam Technological


University, Kerala)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(NBA Accredited)

COURSE MATERIAL

CST 332 DATA ANALYTICS

VISION OF THE INSTITUTION


Emerge as a centre of excellence for professional education to produce high quality engineers
and entrepreneurs for the development of the region and the Nation.

MISSION OF THE INSTITUTION

 To become an ultimate destination for acquiring latest and advanced knowledge in the
multidisciplinary domains.
 To provide high quality education in engineering and technology through innovative
teaching-learning practices, research and consultancy, embedded with professional
ethics.
 To promote intellectual curiosity and thirst for acquiring knowledge through outcome
based education.
 To have partnership with industry and reputed institutions to enhance the
employability skills of the students and pedagogical pursuits.
 To leverage technologies to solve the real life societal problems through community
services.

ABOUT THE DEPARTMENT

 Established in: 2008

 Courses offered: B.Tech in Computer Science and Engineering

 Affiliated to the A P J Abdul Kalam Technological University.

DEPARTMENT VISION
To produce competent professionals with research and innovative skills, by providing them
with the most conducive environment for quality academic and research oriented
undergraduate education along with moral values committed to build a vibrant nation.

DEPARTMENT MISSION

 Provide a learning environment to develop creativity and problem solving skills in a


professional manner.
 Expose to latest technologies and tools used in the field of computer science.
 Provide a platform to explore the industries to understand the work culture and
expectation of an organization.
 Enhance Industry Institute Interaction program to develop the entrepreneurship skills.
 Develop research interest among students which will impart a better life for the
society and the nation.

PROGRAMME EDUCATIONAL OBJECTIVES


Graduates will be able to

 Provide high-quality knowledge in computer science and engineering required for a


computer professional to identify and solve problems in various application domains.
 Persist with the ability in innovative ideas in computer support systems and transmit
the knowledge and skills for research and advanced learning.
 Manifest the motivational capabilities, and turn on a social and economic commitment
to community services.
PROGRAM OUTCOMES (POS)

Engineering Graduates will be able to:

1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering


fundamentals, and an engineering specialization to the solution of complex engineering problems.
2. Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and research
methods including design of experiments, analysis and interpretation of data, and synthesis of the
information to provide valid conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities with
an understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to the
professional engineering practice.
7. Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need for
sustainable development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member or leader
in diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and write
effective reports and design documentation, make effective presentations, and give and receive
clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and leader
in a team, to manage projects and in multidisciplinary environments.
12. Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
COURSE OUTCOMES

SUBJECT CODE: C210


COURSE OUTCOMES
C312B.1 To identify the basic structure and functional units of a digital computer. And
analyze the effect of addressing modes on the execution time of a program
C312B.2 To design processing unit using the concepts of ALU and control logic design.
C312B.3 To select appropriate interfacing standards for I/O devices.
C312B.4 To identify the pros and cons of different types of Memory systems and
understand mapping functions.
C312B.5 To select appropriate interfacing standards for I/O devices.
C312B.6 To identify the roles of various functional units of a computer in instruction
execution. And analyze the types of control logic design in processors.

PROGRAM SPECIFIC OUTCOMES (PSO)

The students will be able to

 Use fundamental knowledge of mathematics to solve problems using


suitable analysismethods, data structure and algorithms.
 Interpret the basic concepts and methods of computer systems
and technicalspecifications to provide accurate solutions.
 Apply theoretical and practical proficiency with a wide area of
programmingknowledge, design new ideas and innovations towards
research.

CO PO MAPPING

Note: H-Highly correlated=3, M-Medium correlated=2,L-Less


correlated=1

CO’S PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12
C312B.1 3 2 2 - - - - - - - 2
C312B.2 3 3 - - 2 - - - - - - 2
C312B.3 3 2 - - 2 - - - - - - 2
C312B.4 3 2 2 - - - - - - - 2
C312B.5 3 3 3 - - - - - - - - -
C312B.6 3 2 - - 2 - - - - - - 2
C312 3 2.3 2.3 2 2
COMPUTER SCIENCE AND ENGINEERING

Category L T P Credits Year of Introduction


CST DATA
322 ANALYTICS
PEC 2 1 0 3 2019

Preamble:
This course helps the learner to understand the basic concepts of data analytics. This course covers
mathematics for data analytics, predictive and descriptive analytics of data, Big data and its
applications, techniques for managing big data and data analysis & visualization using R
programming tool. It enables the learners to perform data analysis on a real world scenario using
appropriate tools.

Prerequisite: NIL

Course Outcomes: After the completion of the course the student will be able to

CO# Course Outcomes

CO1 Illustrate the mathematical concepts for data analytics (Cognitive Knowledge
Level: Apply)

CO2 Explain the basic concepts of data analytics (Cognitive Knowledge Level:
Understand)

CO3 Illustrate various predictive and descriptive analytics algorithms (Cognitive


Knowledge Level: Apply)

CO4 Describe the key concepts and applications of Big Data Analytics (Cognitive
Knowledge Level: Understand)

CO5 Demonstrate the usage of Map Reduce paradigm for Big Data Analytics
(Cognitive Knowledge Level: Apply)

CO6 Use R programming tool to perform data analysis and visualization (Cognitive
Knowledge Level: Apply)

225
COMPUTER SCIENCE AND ENGINEERING

Mapping of course outcomes with program outcomes

PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12

CO1

CO2

CO3

CO4

CO5

CO6

Abstract POs Defined by National Board of Accreditation

PO# Broad PO PO# Broad PO

PO1 Engineering Knowledge PO7 Environment and Sustainability

PO2 Problem Analysis PO8 Ethics

PO3 Design/Development of solutions PO9 Individual and team work

PO4 Conduct investigations of PO10 Communication


complex problems

PO5 Modern tool usage PO11 Project Management and Finance

PO6 The Engineer and Society PO12 Lifelong learning

226
COMPUTER SCIENCE AND ENGINEERING

Assessment Pattern

Bloom’s Continuous Assessment Tests End Semester


Category Examination Marks (%)

Test 1 Test 2
(%) (%)

Remember 30 30 30

Understand 40 40 40

Apply 30 30 30

Mark Distribution

Total CIE ESE ESE


Marks Marks Marks Duration

150 50 100 3

Continuous Internal Evaluation Pattern:


Attendance 10 marks
Continuous Assessment Tests (Average of Series Tests 1& 2) 25 marks
Continuous Assessment Assignment 15 marks

Internal Examination Pattern:


Each of the two internal examinations has to be conducted out of 50 marks. The first series test
shall be preferably conducted after completing the first half of the syllabus and the second series
test shall be preferably conducted after completing the remaining part of the syllabus. There will
be two parts: Part A and Part B. Part A contains 5 questions (preferably, 2 questions each from the
completed modules and 1 question from the partly completed module), having 3 marks for each
question adding up to 15 marks for part A. Students should answer all questions from Part A. Part
B contains 7 questions (preferably, 3 questions each from the completed modules and 1 question

227
COMPUTER SCIENCE AND ENGINEERING

from the partly completed module), each with 7 marks. Out of the 7 questions, a student should
answer any5.

End Semester Examination Pattern:

There will be two parts; Part A and Part B. Part A contains 10 questions with 2 questions from each
module, having 3 marks for each question. Students should answer all questions. Part B contains 2
full questions from each module of which students should answer any one. Each question can have
a maximum 2 sub-divisions and carries 14 marks.

Syllabus

Module – 1 (Mathematics for Data Analytics)

Descriptive statistics - Measures of central tendency and dispersion, Association of two variables -
Discrete variables, Ordinal and Continuous variable, Probability calculus - probability distributions,
Inductive statistics - Point estimation, Interval estimation, Hypothesis Testing - Basic definitions, t-
test
Module - 2 (Introduction to Data Analytics)

Introduction to Data Analysis - Analytics, Analytics Process Model, Analytical Model


Requirements. Data Analytics Life Cycle overview. Basics of data collection, sampling,
preprocessing and dimensionality reduction
Module - 3 (Predictive and Descriptive Analytics)

Supervised Learning - Classification, Naive Bayes, KNN, Linear Regression. Unsupervised


Learning - Clustering, Hierarchical algorithms – Agglomerative algorithm, Partitional algorithms -
K- Means. Association Rule Mining - Apriori algorithm
Module - 4 (Big Data Analytics)

Big Data Overview – State of the practice in analytics, Example Applications - Credit Risk
Modeling, Business Process Analytics.Big Data Analytics using Map Reduce and Apache Hadoop,
Developing and Executing a HadoopMapReduce Program.
Module - 5 (R programming for Data Analysis)

Overview of modern data analytic tools.Data Analysis Using R - Introduction to R - R Graphical


User Interfaces, Data Import and Export, Attribute and Data Types, Descriptive Statistics,
Exploratory Data Analysis - Visualization Before Analysis, Dirty Data, Visualizing a Single
Variable, Examining Multiple Variables, Data Exploration Versus Presentation, Statistical Methods
for Evaluation

228
COMPUTER SCIENCE AND ENGINEERING

Text Book

1. Bart Baesens," Analytics in a Big Data World: The Essential Guide to Data Science and
its Business Intelligence and Analytic Trends”, John Wiley & Sons, 2013.
2. David Dietrich, “EMC Education Services, Data Science and Big Data Analytics:
Discovering, Analyzing, Visualizing and Presenting Data”, John Wiley & Sons, 2015.
3. Jaiwei Han, MichelineKamber, “Data Mining Concepts and Techniques'', Elsevier, 2006.
4. Christian Heumann and Michael Schomaker, “Introduction to Statistics and
DataAnalysis”, Springer, 2016

References
1. Margaret H. Dunham, Data Mining: Introductory and Advanced Topics. Pearson, 2012.
2. Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer, 2007.

Course Level Assessment Questions

Course Outcome 1 (CO1):


1. Explain the measures of central tendency.
2. Drive the mean and variance of normal distribution.
3. Collect sample data associated with a real world scenario, and identify central tendency
and dispersion measures. Explain your inferences.

Course Outcome 2 (CO2):

1. Explain the life cycle of Data Analytics.


2. Discuss in detail the relevance of data sampling.

Course Outcome 3 (CO3):


1. The following table shows the midterm and final exam marks obtained for students in a
database course.

X (Midterm exam) Y (Final exam)

72 84

50 63

229
COMPUTER SCIENCE AND ENGINEERING

81 77

74 78

94 90

86 75

59 49

83 79

65 77

33 52

88 74

81 90

a) Use the method of least squares to find an equation for the prediction of a
student’s final exam marks based on the student’s midterm grade in the
course.
b) Predict the final exam marks of a student who received an 86 on the
midterm exam.

2. Perform knn classification on the following dataset and predict the class for the data
point X (P1 = 3, P2 =7), assuming the value of k as 3.

P1 P2 Class

7 7 False

7 4 False

3 4 True

1 4 True

Course Outcome 4 (CO4):


1. List down the characteristics of Big Data.
2. Illustrate process discovery task in business analytics using the scenario of
insurance claim handling process. Draw the annotated process map.

230
COMPUTER SCIENCE AND ENGINEERING

Course Outcome 5 (CO5):

1. Explain how fault tolerance is achieved in HDFS.


2. Write down the pseudocode for Map and Reduce functions to solve any one data
analytic problem.

Course Outcome 6 (CO6):

1. Illustrate any three R functions used in data analytics.


2. Explain the different categories of attributes and data types in R.

Model Question Paper


QP CODE:
Reg No:______________
Name :______________ PAGES : 4
APJ ABDUL KALAM TECHNOLOGICAL UNIVERSITY
SIXTH SEMESTER B.TECH DEGREE EXAMINATION, MONTH & YEAR
Course Code: CST 322
Course Name: Data Analytics
Max.Marks :100 Duration: 3 Hrs
PART A
(Answer all Questions. Each question carries 3 Marks)

1. Outline the errors that arise in hypothesis testing.

2. The number of members of a millionaires’ club were as follows:

Year 2011 2012 2013 2014 2015 2016

Members 23 24 27 25 30 28

(a)What is the average growth rate of the membership?


(b)Based on the results of (a), how many members would one expect in 2018?

231
COMPUTER SCIENCE AND ENGINEERING

3. List and explain any two methods for dealing with missing values in a dataset.

4. Consider the following data (in increasing order) for the attribute age: 13, 15, 16, 16, 19, 20,
20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70. Sketch an
example for stratified sampling using samples of size 5 and the strata “youth,” “middle-aged,”
and “senior.”

5. Why is k nearest neighbor classifier called a lazy learner?

6. Find the absolute support, relative support and confidence of the rule (bread => jam) in the
following set of transactions
T1 {bread, butter}, T2{bread, jam, milk}
T3{Milk, curd}, T4{bread, jam}

7. Explain the 3 Vs of Big Data.

8. Discuss the application of big data analytics in credit risk modeling.

9. Why is Exploratory Data Analysis important in business application ?

10. Explain how box plots be used for data summarization.


(10x3=30)

Part B
(Answer any one question from each module. Each question carries 14 Marks)

11. (a) Illustrate the Maximum Likelihood Estimation of Bernoulli distribution.


(8)

(b) A hiking enthusiast has a new app for his smartphone which summarizes his hikes by
using a GPS device. Let us look at the distance hiked (in km) and maximum altitude (in (6)
m) for the last 10 hikes:
Distance 12.5 29.9 14.8 18.7 7.6 16.2 16.5 27.4 12.1 17.5

Altitude 342 1245 502 555 398 670 796 912 238 466

Calculate the arithmetic mean and median for both distance and altitude.

OR

232
COMPUTER SCIENCE AND ENGINEERING

12. (a) Explain the steps in conducting a hypothesis test.


(8)

(b) A total of 150 customers of a petrol station are asked about their satisfaction with their (6)
car and motorbike insurance. The results are summarized below: Determine and
interpret Pearson’s χ2 statistic and Cramer’s V.
Satisfied Unsatisfied Total

Car 33 25 58
Car (Diesel engine) 29 31 60
Motor bike 12 20 32

Total 74 76 150

13. (a) Explain the data analytical process model. (8)

(b) Discuss the methods for handling noisy data. Consider the following sorted data for (6)
price (in dollars) 4, 8, 15, 21, 21, 24, 25, 28, 34.
Illustrate smoothing by bin means and bin boundaries

OR

14. (a) a) What is the need for sampling in data analytics? Discuss the different sampling (8)
techniques.

(b) Use these methods to normalize the following group of data: (6)
200, 300, 400, 600, 1000
(i) min-max normalization by setting min = 0 and max = 1
(ii) z-score normalization
(iii) normalization by decimal scaling .

15. (a) A database has five transactions. Let min_sup be 60% and min_conf be 80%.
.
TID items_bought

T100 {M, O, N, K, E, Y}

T200 {D, O, N, K, E, Y}

T300 {M, A, K, E}

T400 {M, U, C, K, Y}

T500 {C, O, O, K, I, E}

233
COMPUTER SCIENCE AND ENGINEERING

(a) Find all frequent itemsets using Apriori algorithm (10)

(b) Generate strong association rules from any one 3 itemset. (4)

OR

16. (a) Explain agglomerative hierarchical clustering with an example. (8)

(b) Suppose that the data mining task is to cluster points (with (x, y) representing location) (6)
into three clusters, where the points areA1(2,10), A2 (2,5), A3 (8,4), B1 (5,8), B2 (7,5),
B3 (6,4), C1(1,2), C2 (4,9). The distance function is Euclidean distance. Suppose
initially we assign A1, B1, and C1as the center of each cluster, respectively. Use the k-
means algorithm to show only
(a) The three cluster centers after the first round of execution.
(b) The final three clusters.

17. (a) Illustrate the working of a Map Reduce program with example.
(8)

(b) Explain the data analytic architecture with a diagram. (6)

OR

18. (a) Discuss the architecture of HDFS and its features. (8)

(b) Illustrate the use of big data analytics in credit risk modeling. (6)

19. (a) List and explain the R functions used in descriptive statistics. (8)

(b) Explain hypothesis testing using ANOVA. (6)

OR

20. (a) Discuss the data visualization for multiple variables in R (8)

(b) Describe the R functions used for cleaning dirty data. (6)
(5 x 14 = 70)

234
COMPUTER SCIENCE AND ENGINEERING

Teaching Plan

No Contents No of
Lecture
Hrs

Module – 1(Mathematics for Data Analytics ) (7 hrs)

1.1 Descriptive statistics - Measures of central tendency 1

1.2 Measures of dispersion 1

1.3 Association of two variables - Discrete Variables 1

1.4 Association of two variables - Ordinal and Continuous variable 1

1.5 Probability calculus - Probability distributions 1

1.6 Inductive statistics - Point estimation, Interval estimation 1

1.7 Hypothesis Testing - Basic definitions, t-test 1

Module – 2 (Introduction to Data Analytics) (6 hrs)

2.1 Introduction to Data Analysis –Analytics, Analytics process model 1

2.3 Analytical model requirements 1

2.4 Data Analytics Life Cycle overview 1

2.5 Basics of data collection 1

2.6 Basics of sampling and preprocessing 1

2.7 Dimensionality reduction 1

Module - 3 (Predictive and Descriptive Analytics) (8 hrs)

3.1 Supervised Learning, Naive Bayes classification 1

3.2 KNN algorithm 1

235
COMPUTER SCIENCE AND ENGINEERING

3.3 Linear Regression 1

3.4 Unsupervised Learning- Clustering 1

3.5 Hierarchical algorithms Agglomerative algorithm 1

3.6 Partitional algorithms -K- Means 1

3.7 Association Rule Mining 1

3.8 Apriori algorithm 1

Module - 4 (Big Data Analytics) (7 hrs)

4.1 Big Data Overview – State of the practice in analytics. 1

4.2 Example Applications - Credit Risk Modeling 1

4.3 Business Process Analytics. 1

4.4 Big Data Analytics using Map Reduce and Apache Hadoop 1

4.5 Big Data Analytics using Map Reduce and Apache Hadoop 1

4.6 Developing and Executing a Hadoop MapReduce Program 1

4.7 Developing and Executing a Hadoop MapReduce Program 1

Module - 5 (R programming for Data Analysis) (8 hrs)

5.1 Overview of modern data analytic tools, Introduction to R, R 1


Graphical User Interfaces
5.2 Data Import and Export, Attribute and Data Types 1

236
COMPUTER SCIENCE AND ENGINEERING

5.3 Descriptive Statistics 1

5.4 Exploratory Data Analysis, Visualization Before Analysis 1

5.5 Dirty Data, Visualizing a Single Variable 1

5.6 Examining Multiple Variable 1

5.7 Data Exploration Versus Presentation 1

5.8 Statistical Methods for Evaluation 1

237
MODULE 1 – MATHEMATICS FOR DATA ANALYTICS
Descriptive statistics
Descriptive statistics are used to describe or summarize the characteristics of a sample or data set,
such as a variable's mean, standard deviation, or frequency. Inferential statistics can help us
understand the collective properties of the elements of a data sample.

• Measures of Frequency: * Count, Percent, Frequency. ...


• Measures of Central Tendency. * Mean, Median, and Mode. ...
• Measures of Dispersion or Variation. * Range, Variance, Standard Deviation. ...
• Measures of Position. * Percentile Ranks, Quartile Ranks.

Measures of central tendency and dispersion are common descriptive measures for summarising
numerical data.

Measures of central tendency are measures of the location of the middle or the center of a distribution.
The most frequently used measures of central tendency are the mean, median and mode.
A measure of dispersion is a numerical value describing the amount of variability present in a
data set.

The standard deviation (SD) is the most commonly used measure of dispersion. With the SD
you can measure dispersion relative to the scatter of the values about their mean.

The range can also be used to describe the variability in a set of data and is defined as the
difference between the maximum and minimum values. The range is an appropriate measure of
dispersion when the distribution is skewed.

Association between two variables means the values of one variable relate in some way to
the values of the other

Cross Tabulations
Scatter grams

Correlation
There are several types of correlation measures that can be applied to different measurement scales of
a variable (i.e. nominal, ordinal, or interval). One of these, the Pearson product-moment correlation
coefficient, is based on interval-level data and on the concept of deviation from a mean for each of the
variables. A statistic, covariance, is the product of the deviations of the observed values from each of
their means divided by the number of observations.
Regression
If the correlation between two variables is found to be significant and there is reason to suspect that one
variable influences the other, then one might decide to calculate a regression line for the two variables.
In this example one might state that an increase in population results in an increase in the crime rate.
Thus, the crime rate would be considered a dependent variable and the population size would be
considered an independent variable.
PROBABILITY CALCULUS
From probability calculus we know that for two events A and B, the probability of B given A is
obtained by dividing the joint by the marginal: p(B∣A) = p(A and B)/p(A).

There are three major types of probabilities:

• Theoretical Probability.
• Experimental Probability.
• Axiomatic Probability.

A probability distribution is the mathematical function that gives the probabilities of occurrence of
different possible outcomes for an experiment. It is a mathematical description of a random
phenomenon in terms of its sample space and the probabilities of events (subsets of the sample space).
Inductive statistics is the phase of statistics which is concerned with the conditions under which
conclusions about populations can be drawn from analysis of particular samples. Inductive statistics is
also known as statistical inference, or inferential statistics.
Inductive statistics is the logical process of drawing general conclusions based on specific pieces of
information.
The branch of statistics that deals with generalizations, predictions, estimations, and decisions about a
population from data sampled from that population.
POINT ESTIMATE AND AN INTERVAL ESTIMATE

A point estimate is a single value estimate of a parameter. For instance, a sample mean is a
point estimate of a population mean.

An interval estimate gives you a range of values where the parameter is expected to lie. A
confidence interval is the most common type of interval estimate.

The main difference between point and interval estimation is the values that are used. Point
estimation uses a single value, the statistic mean, while interval estimation uses a range of
numbers to infer information about the population.
Hypothesis Testing

Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a population
parameter. The methodology employed by the analyst depends on the nature of the data used and the
reason for the analysis

4 Steps of Hypothesis Testing

All hypotheses are tested using a four-step process:

1. The first step is for the analyst to state the two hypotheses so that only one can be
right.
2. The next step is to formulate an analysis plan, which outlines how the data will be
evaluated.
3. The third step is to carry out the plan and physically analyse the sample data.
4. The fourth and final step is to analyse the results and either reject the null hypothesis,
or state that the null hypothesis is plausible, given the data.
STEPS FOR CONSTRUCTING A HYPOTHESIS

• The first step before constructing a hypothesis is a thorough review of existing


literature on the topic of research.
• After the literature review, identify gaps in the literature. Then narrow down the
research problem to fulfil the gap.
• The research problem needs to be stated in terms of research objectives or research
questions.
• Following the research question, identify the dependent and the independent
variables.
• Frame statements or hypotheses that reflect a prediction and are testable.
• The results of hypothesis testing directly help to answer the research questions and
draw conclusions for the study.

T-TEST
MODULE
Aralyhes, Analyics Prous Modal
Inocuchion to Dota dnalgis.-
Analyhcal Modul RLauiremunts, Dat Analytics
Lik yde ovenview
Basics ey data l e dion, Sampling, Prepmusing, andd dimeninalkty Rducho-

Data Analyhcs
data gat in o1dun to
Data Aonalyhics ha Suen d Analy Sing faw

thay hold. enables u to


Uiive a mclusion reganding tha infovmakon

data and dhaw Valuable in brmokion hom him.


oisuoven paiiunns in tha 1aw

wncenihamcl thar
Ihse
Hndings anu inkipretk and wed to hulp 01ai Sations
cushmiza Lont, cYalu
cienda. beltu, omalugoe huun primohunml CampoiqmA
Gmlent Shatugien amd davolop pndudo Data Analyies hulp oganisoakons to
thiáy eanning.
moAximi manke ekucenty and inpnre

PT0ass Pata Nnalghcs


dhe doda analutics mthod
Below ane ihe steps inuuolued in
common

STEP Detunumint ah ikonia ton gnourung dala


Dala con diuided by a nange dbhanan
be
Oulua suh a age, populolior-
+o inuome, taxe/ sex. The valis c
uhi dola can de
numeu cal on cateaonice
dala
STEP 2: Collcling h dat
Data cn be collected through several soUces, including online sources,
and souYces from the community
computers, persaneinel
STEP3: Onggnizing the data it be exxomined. Datg
The data must be organied
en it S Oleced so
that can
on a Spreasheet or ofher type of software that is
Organiption can take place
cApable of taking Stotístical
datq
STEP 4 cleaning the Datg -> The daa is first cleaned up to cnsUve ihat there is no
overap Or
mistake. Then, It is reviewed tomake
i SUre that s not
Incomplete. cleaning the datq helps to frx ov elimingte any mistakes befare
the data qoes to q datq analyst for analysis

Dada Arolyhkes Tyes

Doto Nnalyhes

Preseiphve
Diagno Sfc Predice
Descriphve
LAcaly Aralykis
hrAlighess
Analghis

- Desciphe Hralyics

Desciphve Aalgies describes hoappeningo Oven imu Shdh an whahu


thu numben d vicws inreaped,oy datreáped. Oand ulhuthen thu cTYent.
monyha Satoo a beh tham tha Jon inu.

2- Diagmoshc Hralykcs
1

Diagnoskc Analyhcs buse on


Yeason u te d uYone of Omy evemt
+eoui rea. h?o H i2mg Omd, involves aa uun divene dataseh.
tca.amins
dota to ansuden quoohtns, Such a dd Weathn 'impact hu slns t beu7
Did ths nuud Hata3 afkt Jateo
3 Predi chve Analykcs
PaedlicHive Analyhe s Ficusen on tha eventn thaB anu
eaprclhd to occuy in
immediats Fihuu Predickve Analyhcs tmon to knd amsuwem
to ounionm
k whod happens to tha Soalun in tha la hot Summun Seanun9 How many
weahun hre cats epe cts }hwo earin hot Summug
4 Preseri pive Amalykcs
PiescripHve Analkes indicatn a plan d achon. 1E Ah chonca e a hoB Sumun
Caltulatd an e venage t ha kve Weahun modas is obore &5% S8%.
On
evening ghi Con be addad to h breudeay, and an addihunal tank
Can be Yertad to maaimze the pmduth on.

BeneRts of Data Nnalskcs


pecsion making improNE Copami es Con upe thi m bmain thuy oblain
for dala amalghc o Huida Aar dausions oodimg Ho inprOved neoulta.

2 Manketkng beumes more efledhve Whun buinen tundnyt amc thu


Cushor bellen, huy i l be alole to thum more ePRenty

3 Castomun favit 'poyes Datt Analykcs proviclun buinum with,dopen


nght hiy dhenta , halping
hum to uthomiza curhmun
epeniona to their
heads ofkn more CurtomizaHon amd cyeale belBn relaiumhip sith Hum

4 h efkueny d openaiom incvcanea:


Data Analghcs will hlp buhe
Shreamlin haiy opeakions, Saye jeaouyes and inmptove tha bottom lin.
Oveview omalyhcs Proun Modal
Atnalyhcs

denkls idonik Seluck


cleo
butinus Data h Atnolyza he
pruolan Sounas data data
data T1ons hm
hu data

pre-proanin9-

mkenprek
evaluatt omd
duplos hu moll

post
proupins

Slep1
In ha r &p homugh duhnikiun of buino pn»lim Av be addsped
s naadad. eg- yeenian modalling Poyt poid Telco subst riphm ov Foud
duleckun Br Credit cands Dehining th pentimetn dt he anahyhas modllin
ercantrge reoruires a dose tolalbo1ation behsean Ha data suenta amd buin

Cperdo Bolkh paaken hada to agruu on hs tt ot kus (onughs. hux thog inchde
hou we duhnu a turbmun tm sachun, chuin o1
Haud
Skp 2
Nect, all source data that could be of potential interest need
to De icdentified The golden Tule here s the move datq The
the
be
Hself will later de cide which date ave
better he analytical model
which ave not fov the task
at hand
Teleyant ancd

model will be estimgted on ihe pre-poce.


STEP 3: Anandanalytim
transformecd data Depercira on ihe business abjec-
Ssed and

tie ancl the exaci task at hand A particulay analydical

Cwill be selected and Implemended by the


the alatg
technique
scientist
STEP 4
Once he results ave obtainec hey uill be interpietec and
evalua ted by he DUsiness erperi. Rasultsmay be clusters,
roles, pattens Ov Teations among others. NI of which cuill be
callecd analytica maclel esulting fhom applying analytics. Trivial
aterns that may be deteciedby ihe
analy ti cal model ts interee-
sting as they helps to valuclate the model ut the key issue
s to 4nd the onkncun yet
yet interesting ancd actionable paHerns ihat
Can pYOvide new insights into yoUY data 1hat can
Can be then be
translaled in to new pvof opportuniiies.
STEP 5:

Once the
anclytimal model has been
ualidald and
appnouac, t can be
apphopnialaly
put
Put unto phocudion
as an
aralylics appücaon.
Data nalytics cgo
D a t o ginalyiica Lilk l duirud hr bigdada pnblemo and
olada siea projeto

>he Ek a han giz


stageolphaot»
bi moy phagen în Lk cgde 4u movement can loe eihun kuad

0 ue Wand

Key RoleD e a Saccewhl Analykcs Pro

Busineo ta
Somaon Who ndenytamda he domain and benelih rorm he
JuuMs w penon Com cansult oncl ddvi he prod kam on
the onesk of the prot, valu of Aha subuls and he Hh Cuhputs
wil be openatiuniauu zad.

2: Pro4 SponsoY
d the pruyc Proviclun Inpe tus and usunemu
Ropon ibde ky genwis
Cove buin so pwbltm. Cnenuol
provicle
o aha pNajut ound diknun tha

Omd tha dzu ot valus Nom ta hnal ouBpuks d


h ndang gaug»
soYting Ram.

3 roec Monagen.
as mat on time amnd
En Suun hat kos oile shons Ond dojeclVe»

atth eopecdad ouality


4 Businun Inkligen Analot
rovid bu finp domain eopeai se baned co daap undenyhandling

Of Hu data, ky feamamu 'ndi cata CkPI), kus uics,


Busnu
Gnd ounginem imlkli genta kom o uposing Pen ypechve.
Ond
Ceata dayh boanda ond TupoTts
mklligune Amalpk enuoly
howe Knoulladge of ha dakafesds ond Souna

5 Databa Adminiskkotor DBA)

databae eviroamumt o Suppus


PTovi ioms amd ani aures h
h analyhcs nascls of h worlking eom. Thes usponsbiKße»
map nclude psdviding acem to kay databage or ables ame ensin=

ulaid to ha data
fha
apppriat Seccariy Javeln a m in paca
Tupo itorep.

Pdg
Pato Enginsan
Levenagen dasp Hechnical ntills to ani sth SQL Oaics
dada momasamant omd dota ecacukon, And pmvides
Suppot hs data ingeskon n to.Omallgke sandbox

PatPat Scenkst
Poroi da Sulojd matkn eapenk f i Omalphcal kchniquewr dat.
mo dallin9, amd aplyin valid onalykcal
dechniome to given buHhuo
prublem Ensw ovenall omalytcs doithive» am mut
Data Analgkcs e Cgul

Do hae,

e n o n g h i h h m o t "

Ploa
cnlgh

dsat

DISuNen D
hapon View

Openahonelizu Data Do hoOve

Piep emouyh 300d


uabo dat o
hont btldins
ha moda

CornmuthiGA Modal
uDuts
plannins

Vmod
1s hu modal
building
obuyt enAgh ?
po hac a godd
Hovc we ailed r ldaa odoot tHuu tvype
Su of modal to rs 2
eOn i n h analyie plon?
phagI1- Disuoven
I teawn aans businun domoin, Includäng ulevamt istony
whahn the ovgami2ohom o1 buinam unit hoi attempkd
Such as

in pat From twhiuh thuy can Juonn. ha nUsouYUA

inilan Proco
of peoplk, tchnulo99,
piojck m lenms
ANonlable wil Suppart Aha
m ad data

phae 2 Data Prepenakn


In uhich ha k a
auire the preoemu ot omalyie Somd box,
penkvo Gnalyhics kr cuurokm t pokUh Tha
Com ouwt with datsa and

a o m neds o exacuts eaxhact, sad


,
amd Wonstm CBLT) o7

Ao
erat Nanshm amd Lood tETL ) to a data in

Sond box.

phas 3 Mods plannins


udu Houw
Whuu eao daeminan tha mehods,khrionen ond
h kam
inlendo to Kllouo- 6r he Subseruunt modal ouilcding phagk.
aont h ahmyhips behuwcan
Eplove he data to leann
tls cld h kay Vaniables omcl ha moy
Vaniable» Gmd Juosequun
S toble modals.
pha 4- Modal building

dotase kt Jeskng, aining, and produckRon


The Heanm davelops
ounlds omd exaut»
In addikm thus phase the eam
purs pose.
tha work donn in h mocdel planning phase.
models baged on

phase 6:- Conmunicatk wuhs

tullaboraHn ulth major scki holdeva, deknmins i£


Team In

h uwulh of th pm^u a n Succem 01 a Foiluu basd on t

Crilenia daveloped tn Phae 1.


phase 6- 0Pcaaimaliza

bickngs, ods, omd lechni tal


leao daliveas nal upot,
to inplemnt
addihon h kam 1un a pilot profut
douumundo. In
envionwn
modulo in t prodnchon
h
nalyåcal odel Krauvemando

Cnood axnalyhkad d l ghould Sohss geve ra «quiremann

dapendnS on aPPli caHon Oua

A . BunneM cdavanu

Sinalghcal mocul thuald rolve o buineh pruban hith +


be
was daveoped hwa binuw pnolero to be iclved thonld

appnpiata ly du hred, onali hed and agpuad. pun pcuhe


irwolved.

2 3tok?ktal penkoimonu

- h a modal Yhonld have &rakyhal giJnihtanuu


Concl predach

powe

3 Analfcal madel hpvld be inhenprekable amel JuMkable

4 * openakmally efhunt

>5 * ewnomic oy

6 *
.Jugulaton and J$slaion-
Basics e Daka Colleckün, Soammpling& Pie proains

lupe er Doda Soutus

Tiansachons

THomsacAonal dota Conrirta et


Siruchmed, Jous lave, daaile
imkoimodiun Couphing AL kuy hanodttnigk» e Cushmn hanuad
in mamive online hansact"
Tho tvype et olata o uoually hred
prd ceoi n9 LoLT P) Tloniona databaoe.

2 leoc oloccumu
mulhmadiaa onlent Cam alfo be Tmlenehns
Teod doccma.mdn oY

D Omoly24. Houseven hsK 3oUYa opial Teouire eceni

prc prouming beove thas Cam be Succeohalh includad in omalyica

exeiR.

baned data
3 ualhta ive, Expent
Hm eapent o a Pen ron wih a Sub yronhal Omount+ of

Swodt moten capakse win a panhcdan Sehhng. The eapen iia


glimahon both omm on tn Omd bnincn enpaicnu and
impovamt to eliut epeahik an mtndh an pomble be faye
Onahcs un
4 Data poolens
Buveau Von Dijck amd Thomon
. DUn &Brad mesk,
Rantenu
is to gathn dais in
h ore buhnuw Ok hLk ompamie)
build' modsls ith it, Omd el
a blp
panitulan sethnis
of ha modulo poibly toethu wih he undanliyi0s r a w dla

to imenealed Cnbmuw

5 pubicly availalble Data


Macmeonomic dlda abount no domeric PrDduct(aO)
mFlaiarn, nemploymunt Omol Jo m

O p hmg

The ao of.Saplins is to taka.a Sukow c post cuyhmn date


Gnd ane haB to buld am
amonlyhcas modle
Kip TLoMiYement dr G0odl Somple hot ik &hould be
Jupreseritakve oF. , hau tuyornuy on whidh thu
amalyical
moda will be nun.

Henu h hming aopeck beadmuo


of toda
impovant ble Cushomuw
a more Hmilan to cayhbmus ok
omorouo han
CUuYhmuw &r enkndap
im windaus kv SOmple mvolves a hada - t
chuonng th ophmad
blus ols of data and uunt dala

Spea hv itkl thot Samplins bian honld be avoidsd an

muuh aM ponible

eg

|hmusgh- th Dos

Accep
Ruut

Baco
Bado Cnoodo

Iha ujet inkremu prioem 10 CYCddt Sning.


e Cnedit Suoing

appi cain Sure Cand to


womh to bild
on
Assm onL

S&ore muvdage applicoiuns

all c a m m who
h s huho popuaion hun unnft of
Com to h bank Omc pply or a morag th So- calld

Hhuuth he door CTTD) population


T h nusoo Sulot hihiccTTD pupulahn to
build ao Gmalyhcal moodal.
Ti
-> Howeven bomk han .alreoadh opplying a cYedB policy
Hhat tu hi hvicalTTP- poploahion ha hwo lubkto
mplico
wih h old pulicy cnd
T ubmu haB w e acc eptd
Onu thoB -weu did

l h u building a sawple we Cam maka we e hot weu

aLLepled
Credit Saing anown ponibiliy i wthdhawaL» . [huse
au uyhmu who ade of fened Credit but deidad not to
take 4 it
gpe e Dota elomunt

Coninuons
ha e anu A daa damundo hat anu dainad om an inlnva
thot Com be Linitl ov
nlimitd
2. Colegorica

Nomincl he a ha data elomudis tho can only


Aaka on a limited Sere valnen uwith no

meoming hil ovdluins in beh»eLn.


3 Moikal Mah, pmknim e.
*oidanc use ar H dota elurmud thok can onls
4akan o0 a miled Ser ovalues i h meomingh

o1din 3 ase codd Go gons mí


dd e agkd,
old
*Binay.- Thaoe oL h- data. elumunds hot Con +aks on
huo yaluep.3:. nodon , enplogunt tatua
Missing Values

occui blc of Vanions uaooni The


Missing Valueo com

(shuman noB illing


n kvmaRon com be non appicouble.
to uveal Candairm daa

Schemes to Deal with miins Valu


Valun

: Raplau imputu)
valua. o n tould
implie uplau mibing Vauue by knuwn
01 median
JuoYCD wh t avenage
jput t h Credit bearean
baoed

oknown Valuen. on wuld alo appls uyuni on


eahmadtad to
a pion modal
impuwrothon Ldhum loy a

thu ohn inkamahon avaslabe


modal o tangeh Voniable baoed on

2 Delut
hruwoncl ophun.Omd uon ita s
movk shoight
nin o n9 Valu.
w i h lolBs o mini
obknvahions o yanioalblco
dalaing

3 Ka
be maninghw e J : - Cyomun cdicln'4
vauwo cam
ming
inuwma blc h or he m YYemBb unemployed,
oiscloe his ov hm
Out lien Delechoo eot munl

Tuo topen t Oatlie aru the


O a t ien a Caenuo obovakons thot anu veny
d&nimilan to th sot: of Hu
ppulatiuo
Valid obnvoins Ce: Salany i million)

2 lnvali d obnvahiono age 30gean)

Bo OU an vOniot outlie in th kn hat huo au


Out ging on on dimuosion.

amd reat munt.


Fo obviuws dhuk oat Rew is to Cakulat a minimum

.and masima Values hy eah of Ha dlata elanento

2sbo
2000

looc
SDo

20 30 40 S0

muli vani ati out lhevn.


teli Can be
Vanias 91aphsta wed to daled oct iea
h o t H u s w e h l Visua mchouni im ss bor plob

-S - 2 3r3 15D- 2

Hsbgromo kri ou ien dkthon.

Box plo

S x IQR
o

out lien
Min M

3 Ku Quankles odate 1 Maahke 25,


Abax plot Juprcoenio
Csd7. o th
ok the Obsenvahon hove auler Valu), Jks udian
of obitavakeno has
obevGkw han owa valu), 31 quaakle Cis 7
in hu bax Minimum and
ow vGlu. All 3 Quanhlo upveoemko
maiimum valuneo a adto addud a n l Huy too an awag tm

e d y s ok Hu box. Tos a awdo hun Quomikcd a mure ha

15 *inen QuGnhle Tonga C19e 1R 3-Q,)


calulat 2 ScOYe.
-) Anohn ajo A to
Meapning hau momy Slamdand diviahor' an obKAYatico ieo

auop hoo thu meam

Whu h= avenasa &VGnialble

ondond diviaion

dakniim he 2-Swre wil hove 0 muon omd unit tendand


Nok B
c u k n . O u t lieu uwhio
daria hion: prachcoas u l df humb hun
o hon 3.
vous e thu Z-Scare (2l bis8en
heolo Soluk

1D 2 Su
30-40)/10 = 1 u 40
30
S-40 /10 = +1

50
2
Lo-u0/10= - 3

40-40/1o

GO-4o/Io +2
5 60
.'.

So-40to 44

40
Slendandizng Data

a pre ptuning ackvihy tangeld at saing Salang Vanialdes

AD 8imilan noangt. 3.- 2 Vanialoles aemdan odad as Co/)

1nom Tomsino behdean 8o on nillion). whun baildin9


elumurdo, ha
0gihc ugenion modalo uwing btth infovmahn
in unma might beum Von mall. Hena i+ ulc mak
o e hci ent h

Sune v bring hun bade to a imilan scala Folluuins flondancitoke

poacduYCD ald be ocloplkd.

Min/ Max omolancizaion


Rold - mi CX o1d)
2 Xnew moL hesmin) f
h
-

n min

ma-CXald)-min CXo1d)
mGTÄMUm
hUw maz aMd n w rmin onm h nawly impoto
h
Ornd minimtnm.

2- Sore Yandlandli2ahicn
S
2-Suro
* Caluuat ha

4 Deamas. Scokng.
o lo X nes Xold uih n h
Dividing by powen

numben d di jib in mCaxim um abroluti Valus.


-olro kncwo an c laibco clarins, onoupins Camt be

don Vaniuw uaouns


Wih caegon2át" onL wouldl cveade cokguies v a l u Such
ho wen ponomlun ill have to be cohmtid and more o oU

mudlul m oblainud.

i nms VGniablos cak gori2oion má alo


be ve ben i

hu vaniable in to Tanse pa ok
mon
munDEC ra
COe g0i zins
eKa
Can be taun in b onepn actont
in a Te7unitn
no-inOn
vanicble can be uweh to rnocla
CCUE Go12aicn c onts
eCte to in to tihean modals

2 moyt ommoniy woed mehodo kx calegoni2ahon

m a l inkavad binning

equa re quimty binnins

Bqal Inkava binnins

wih ths Sam


1n nhseasaHas.
numbea s ohseaaia
H woald Creats two lbins

+3e0
Bin 4: , 0o0, t, 200,

1500, 8ee,
200e
Bin 2:

bEoma hregeo bining


+Ould cveatu hwo bins wh Sam humben e obkvaionu.
Bin 1. Jo0o, 12 00, 1300 Bio 2 400, 1 Soo, 2ooo
Ch Squas mehod m o v e sophaSicatad. w o o do eaea

Clami hcotion

Rant|uh Oan | No Tola


Rint
Ahibud Ouunu unhnshd hnihud |' pherlo anjuan

Cnsodo 6oo0 160o. 350

Bado 300 l40 loo So oo

CAgod: 20: 1 4 .2:51 O.5: 8: :1


badd oddo

Suppose want 3 cakgoricD amd unidan ha Alloudns opkua

ophun 1! Ounu, TenknI, otw

ophon 2:' Ownn, asih paunda ,oha

com be invesk5ud unnS Ci sovoni' mumud.


Boh opkms
.
hat we hiad v upaL empiv i Call
Obknvcd h equencie.

Emprial heLUncies ophun k cuanK clamikging uoideni al Jhahy

Phibuta Ouunn Ranler othuis Tol0


O0O
hoodo

Bad 300 540 g


Tole 6300 2410 . 1210 Loo0o
h nuMben er taooco owss given hat tha oddo an SCuma a

he tuhole populoB iun

300 000X lbo00 5610


lo00o

On can calcadotu thi Sonana diykanta as Allouo

2
Go o0-5&70 3o0-630) (asv -2241)
5670 630
2241

540-14) s -1o&) to-121) 583


12)
24

Indopendnce heanmtien K Clamiins Sucdunhal laha

A o t | ownu | Runkm | olhuw Tule

000
Cnoocl 5610 224 108
121
Codo 630 241

240 1210 0r000


300
Tokal

1950+4D oo0xIO0o Rankw

0000 Lodod

l00+ 160o ohw

loooo loooo
Like oi ophon 2

6ooo - 5610) 300-630) so -45)


63 45
SG10

loo- loS)2 205D-2385) Goo 2cs)


+
05 23 85 2S

= G2

thi- Somuans Cctnzor 20" 5pkon 2 S h betn


So bapcl on

Cotzoi 2od"
CST322 – Data Analytics
Module – III
Syllabus
Module – III (Predictive and Descriptive Analytics)

Supervised Learning - Classification, Naive Bayes, KNN, Linear


Regression.

CST322 - DA | Mod-3 | Sahrdaya CET


Unsupervised Learning - Clustering, Hierarchical algorithms –
Agglomerative algorithm,

Partitional algorithms - K- Means. Association Rule Mining - Apriori


algorithm

2
Predictive Analytics
• Statistics research develops tools for prediction and forecasting using
data and statistical models
• Statistical methods can be used to summarize or describe a collection of
data.

CST322 - DA | Mod-3 | Sahrdaya CET


• Statistics is useful for mining various patterns from data as well as for
understanding the underlying mechanisms generating and affecting the
patterns.
• Inferential statistics (or predictive statistics) models data in a way
that accounts for randomness and uncertainty in the observations and is
used to draw inferences about the process or population under
investigation.

3
Predictive Analytics
• Statistical methods can also be used to verify data mining results.
• For example, after a classification or prediction model is mined, the
model should be verified by statistical hypothesis testing.
A statistical hypothesis test (sometimes called confirmatory data

CST322 - DA | Mod-3 | Sahrdaya CET



analysis) makes statistical decisions using experimental data.
• A result is called statistically significant if it is unlikely to have
occurred by chance.
• If the classification or prediction model holds true, then the descriptive
statistics of the model increases the soundness of the model.

4
Classic problems in machine learning
• Supervised learning
• Unsupervised learning
• Semi-supervised learning

CST322 - DA | Mod-3 | Sahrdaya CET


• Active learning

5
Supervised learning
• The supervision in the learning comes from the labeled examples in the
training data set.
• Basically supervised learning is when we teach or train the machine
using data that is well labelled. Which means some data is already

CST322 - DA | Mod-3 | Sahrdaya CET


tagged with the correct answer.
• After that, the machine is provided with a new set of examples(data) so
that the supervised learning algorithm analyses the training data(set of
training examples) and produces a correct outcome from labelled data.
• For example, in the postal code recognition problem, a set of
handwritten postal code images and their corresponding machine-
readable translations are used as the training examples, which
supervise the learning of the classification model

6
Supervised learning
• For instance, suppose you are given a basket filled with different kinds
of fruits. Now the first step is to train the machine with all the different
fruits one by one like this:

CST322 - DA | Mod-3 | Sahrdaya CET


🞄 If the shape of the object is rounded and has a depression at the top, is red in color,
then it will be labeled as –Apple.
🞄 If the shape of the object is a long curving cylinder having Green-Yellow color, then it
will be labeled as –Banana
7
Supervised learning
• Now suppose after training the data, you have given a new separate
fruit, say Banana from the basket, and asked to identify it.

CST322 - DA | Mod-3 | Sahrdaya CET


Since the machine has already learned the things from previous data and this time
has to use it wisely. It will first classify the fruit with its shape and color and would
confirm the fruit name as BANANA and put it in the Banana category. Thus the
machine learns the things from training data(basket containing fruits) and then
applies the knowledge to test data(new fruit).

8
Supervised learning
• Supervised learning is classified into two categories of algorithms:
🞄 Classification: A classification problem is when the output variable is
a category, such as “Red” or “blue” , “disease” or “no disease”.
🞄 Regression: A regression problem is when the output variable is a
real value, such as “dollars” or “weight”.

CST322 - DA | Mod-3 | Sahrdaya CET


• Types:-
🞄 Regression
🞄 Logistic Regression
🞄 Classification
🞄 Naive Bayes Classifiers
🞄 K-NN (k nearest neighbours)
🞄 Decision Trees
🞄 Support Vector Machine
9
Supervised learning
• Advantages:-
🞄 Supervised learning allows collecting data and produces data output
from previous experiences.
🞄 Helps to optimize performance criteria with the help of experience.
🞄 Supervised machine learning helps to solve various types of real-world

CST322 - DA | Mod-3 | Sahrdaya CET


computation problems.
• Disadvantages:-
🞄 Classifying big data can be challenging.
🞄 Training for supervised learning needs a lot of computation time. So, it
requires a lot of time.

10
Unsupervised learning
• Unsupervised learning is the training of a machine using information
that is neither classified nor labeled and allowing the algorithm to act
on that information without guidance.
• Here the task of the machine is to group unsorted information according

CST322 - DA | Mod-3 | Sahrdaya CET


to similarities, patterns, and differences without any prior training of
data.
• Unlike supervised learning, no teacher is provided that means no
training will be given to the machine.
• Therefore the machine is restricted to find the hidden structure in
unlabeled data by itself.

11
Unsupervised learning
• For instance, suppose it is given an image having both dogs and cats
which it has never seen.

CST322 - DA | Mod-3 | Sahrdaya CET


• Thus the machine has no idea about the features of dogs and cats so we
can’t categorize it as ‘dogs and cats ‘.
• But it can categorize them according to their similarities, patterns, and
differences, i.e., we can easily categorize the above picture into two
parts.
12
Unsupervised learning
• The first may contain all pics having dogs in them and the second part
may contain all pics having cats in them.
• Here you didn’t learn anything before, which means no training data or
examples.

CST322 - DA | Mod-3 | Sahrdaya CET


• It allows the model to work on its own to discover patterns and
information that was previously undetected.
• It mainly deals with unlabeled data.

13
Unsupervised learning
• Unsupervised learning is classified into two categories of algorithms:
🞄 Clustering: A clustering problem is where you want to discover the
inherent groupings in the data, such as grouping customers by
purchasing behavior.
🞄 Association: An association rule learning problem is where you want

CST322 - DA | Mod-3 | Sahrdaya CET


to discover rules that describe large portions of your data, such as
people that buy X also tend to buy Y.

14
Unsupervised learning
• Types of Unsupervised Learning:-
🞄 Clustering
🞄 Exclusive (partitioning)
🞄 Agglomerative

CST322 - DA | Mod-3 | Sahrdaya CET


🞄 Overlapping
🞄 Probabilistic
🞄 Clustering Types:-
🞄 Hierarchical clustering
🞄 K-means clustering
🞄 Principal Component Analysis
🞄 Singular Value Decomposition
🞄 Independent Component Analysis

15
Supervised vs. Unsupervised Machine Learning

Supervised machine Unsupervised


Parameters
learning machine learning

CST322 - DA | Mod-3 | Sahrdaya CET


Algorithms are used
Algorithms are trained using
Input Data against data that is not
labeled data.
labeled
Computational Computationally
Simpler method
Complexity complex
Accuracy Highly accurate Less accurate

16
Naïve Bayes
Classifier Algorithm

CST322 - DA | Mod-3 | Sahrdaya CET


17
Naïve Bayes algorithm
• Naïve Bayes algorithm is a supervised learning algorithm, which is
based on Bayes theorem and used for solving classification problems.
• It is mainly used in text classification that includes a high-dimensional
training dataset.

CST322 - DA | Mod-3 | Sahrdaya CET


• Naïve Bayes Classifier is one of the simple and most effective
Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.
• It is a probabilistic classifier, which means it predicts on the
basis of the probability of an object.
• Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.

18
Why is it called Naïve Bayes?
• The Naïve Bayes algorithm is comprised of two words Naïve and Bayes,
Which can be described as:
🞄 Naïve: It is called Naïve because it assumes that the occurrence of a
certain feature is independent of the occurrence of other features. Such
as if the fruit is identified on the bases of color, shape, and taste, then

CST322 - DA | Mod-3 | Sahrdaya CET


red, spherical, and sweet fruit is recognized as an apple. Hence each
feature individually contributes to identify that it is an apple without
depending on each other.
🞄 Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem.

19
Bayes' Theorem:
• Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is
used to determine the probability of a hypothesis with prior knowledge. It
depends on the conditional probability.
𝑷 𝑩𝑨 𝑷 𝑨
• The formula for Bayes' theorem is given as: 𝑷 𝑨𝑩 =
𝑷(𝑩)

CST322 - DA | Mod-3 | Sahrdaya CET


• Where,
🞄 P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

🞄 P(B|A) is Likelihood probability: Probability of the evidence given that the


probability of a hypothesis is true.

🞄 P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

🞄 P(B) is Marginal Probability: Probability of Evidence.

20
Working of Naïve Bayes' Classifier: - Example
• Suppose we have a dataset of weather conditions and corresponding
target variable "Play". So using this dataset we need to decide that
whether we should play or not on a particular day according to the

CST322 - DA | Mod-3 | Sahrdaya CET


weather conditions. So to solve this problem, we need to follow the below
steps:
🞄 Convert the given dataset into frequency tables.
🞄 Generate Likelihood table by finding the probabilities of given
features.
🞄 Now, use Bayes theorem to calculate the posterior probability.

21
Working of Naïve Bayes' Classifier: - Example 1
• Problem: If the Play
weather is sunny, Outlook
then the Player 0 Rainy Yes
should play or not? 1 Sunny Yes
2 Overcast Yes

CST322 - DA | Mod-3 | Sahrdaya CET


• Solution: To solve 3 Overcast Yes
this, first consider 4 Sunny No
the below dataset: 5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
22
Frequency table for the Weather
Conditions:
Play
Weather Yes No
Outlook
Overcast 5 0 0 Rainy Yes
Rainy 2 2 1 Sunny Yes
Sunny 3 2 2 Overcast Yes

CST322 - DA | Mod-3 | Sahrdaya CET


Total 10 4 3 Overcast Yes
4 Sunny No
5 Rainy Yes
Likelihood table weather condition:
6 Sunny Yes
7 Overcast Yes
Weather No Yes 8 Rainy No
Overcast 0 5 5/14= 0.35 9 Sunny No
Rainy 2 2 4/14=0.29 10 Sunny Yes
11 Rainy No
Sunny 2 3 5/14=0.35
12 Overcast Yes
All 4/14=0.29 10/14=0.71
13 Overcast Yes
23
Applying Bayes'theorem:
• P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
🞄 P(Sunny|Yes)= 3/10= 0.3
🞄 P(Sunny)= 0.35
🞄 P(Yes)=0.71
🞄 So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

CST322 - DA | Mod-3 | Sahrdaya CET


• P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
🞄 P(Sunny|NO)= 2/4=0.5
🞄 P(No)= 0.29
🞄 P(Sunny)= 0.35
🞄 So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
• So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
• Hence on a Sunny day, Player can play the game.

24
Advantages of Naïve Bayes Classifier:
• Naïve Bayes is one of the fast and easy ML algorithms to predict a class
of datasets.
• It can be used for Binary as well as Multi-class Classifications.
• It performs well in Multi-class predictions as compared to the other
Algorithms.
• It is the most popular choice for text classification problems.

25
Working of Naïve Bayes' Classifier: - Example 2

Given a new instance,

x’=(Outlook=Sunny,
Temperature=Cool,

CST322 - DA | Mod-3 | Sahrdaya CET


Humidity=High,
Wind=Strong)

Predict whether a person


can play tennis or not.

26
1. Calculate Prior Probability

P(Play=Yes) = 9/14
P(Play=No) = 5/14

2. Calculate Current probability or


conditional probability of individual

CST322 - DA | Mod-3 | Sahrdaya CET


attributes

P(Outlook=o|Play=b) P(Temperature=t|Play=b)
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5

27
Calculate Current probability or conditional
probability of individual attributes

CST322 - DA | Mod-3 | Sahrdaya CET


P(Humidity=h|Play=b) P(Wind=w|Play=b)

Humidity Play=Yes Play=No Wind Play=Yes Play=No


High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5

28
= 𝑎𝑟𝑔𝑚𝑎𝑥𝑣𝑗∈{𝑦𝑒𝑠,𝑁𝑜} 𝑃 𝑣𝑗 . 𝑃 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑆𝑢𝑛𝑛𝑦 𝑣𝑗 . 𝑃 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 𝐶𝑜𝑜𝑙 𝑣𝑗 .

CST322 - DA | Mod-3 | Sahrdaya ET


𝑃 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 𝐻𝑖𝑔ℎ 𝑣𝑗 . 𝑃(𝑊𝑖𝑛𝑑 = 𝑆𝑡𝑟𝑜𝑛𝑔|𝑉𝑗)

𝑣𝑁𝐵 𝑌𝑒𝑠 = 𝑃 𝑌𝑒𝑠 . 𝑃 𝑆𝑢𝑛𝑛𝑦 𝑌𝑒𝑠 . 𝑃 𝐶𝑜𝑜𝑙 𝑌𝑒𝑠 . 𝑃 𝐻𝑖𝑔ℎ 𝑌𝑒𝑠 . 𝑃 𝑆𝑡𝑟𝑜𝑛𝑔 𝑌𝑒𝑠 = 0.0053

𝑣𝑁𝐵 𝑁𝑜 = 𝑃 𝑁𝑜 . 𝑃 𝑆𝑢𝑛𝑛𝑦 𝑁𝑜 . 𝑃 𝐶𝑜𝑜𝑙 𝑁𝑜 . 𝑃 𝐻𝑖𝑔ℎ 𝑁𝑜 . 𝑃 𝑆𝑡𝑟𝑜𝑛𝑔 𝑁𝑜 = 0.0206

Normalization
𝑣𝑁𝐵 𝑌𝑒𝑠
𝑣𝑁𝐵 𝑁𝑜 =
𝑣𝑁𝐵 𝑁𝑜 = 0.795
𝑣𝑁𝐵 𝑌𝑒𝑠 = = 0.205 𝑣𝑁𝐵 𝑌𝑒𝑠 +𝑣𝑁𝐵 𝑁𝑜
𝑣𝑁𝐵 𝑌𝑒𝑠 +𝑣𝑁𝐵 𝑁𝑜

Hence on a Sunny day, with Cool temperature, with higher humidity and Strong wind,
player can’t play the game.
29
Try it out - 1

X = <rain, hot, high, false>

CST322 - DA | Mod-3 | Sahrdaya CET


30
Try it out - 2
Attributes are Color , Type , Origin, and the subject, stolen can be either yes or no

Classify a Red Domestic SUV

CST322 - DA | Mod-3 | Sahrdaya CET


31
Disadvantages of Naïve Bayes Classifier:
• Naive Bayes assumes that all features are independent or unrelated, so
it cannot learn the relationship between features.

Applications of Naïve Bayes Classifier:

CST322 - DA | Mod-3 | Sahrdaya CET


• It is used for Credit Scoring.
• It is used in medical data classification.
• It can be used in real-time predictions because Naïve Bayes Classifier
is an eager learner.
• It is used in Text classification such as Spam filtering and Sentiment
analysis.

32
KNN algorithm
33
KNN algorithm
• K-Nearest Neighbour is one of the simplest Machine Learning
algorithms based on Supervised Learning technique.
• K-NN algorithm assumes the similarity between the new case/data and
available cases and put the new case into the category that is most

CST322 - DA | Mod-3 | Sahrdaya CET


similar to the available categories.
• K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then
it can be easily classified into a well suite category by using K- NN
algorithm.
• K-NN algorithm can be used for Regression as well as for Classification
but mostly it is used for the Classification problems.

34
KNN algorithm
• K-NN is a non-parametric algorithm, which means it does not make
any assumption on underlying data.
• It is also called a lazy learner algorithm because it does not learn
from the training set immediately instead it stores the dataset and at

CST322 - DA | Mod-3 | Sahrdaya CET


the time of classification, it performs an action on the dataset.
• KNN algorithm at the training phase just stores the dataset and when
it gets new data, then it classifies that data into a category that is much
similar to the new data.

35
Example
Suppose, we have an image of a creature that looks similar to cat and
dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a similarity
measure. Our KNN model will find the similar features of the new data
set to the cats and dogs images and based on the most similar features it

CST322 - DA | Mod-3 | Sahrdaya CET


will put it in either cat or dog category.

36
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and
we have a new data point x1, so this data point will lie in which of these
categories.

CST322 - DA | Mod-3 | Sahrdaya CET


37
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
• Step-1: Select the number K of the neighbors
• Step-2: Calculate the Euclidean distance of K number of neighbors

CST322 - DA | Mod-3 | Sahrdaya CET


• Step-3: Take the K nearest neighbors as per the calculated Euclidean
distance.
• Step-4: Among these k neighbors, count the number of the data points
in each category.
• Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.
• Step-6: Our model is ready.

38
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:

CST322 - DA | Mod-3 | Sahrdaya CET


Firstly, we will choose the number of neighbors, so we will choose the k=5.

39
Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It
can be calculated as:

CST322 - DA | Mod-3 | Sahrdaya CET


40
By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:

CST322 - DA | Mod-3 | Sahrdaya CET


As we can see the 3 nearest
neighbors are from category
A, hence this new data point
must belong to category A.

41
How to select the value of K in the K-NN Algorithm?
• There is no particular way to determine the best value for "K", so we
need to try some values to find the best out of them. The most preferred
value for K is 5.
• A very low value for K such as K=1 or K=2, can be noisy and lead to the

CST322 - DA | Mod-3 | Sahrdaya CET


effects of outliers in the model.
• Large values for K are good, but it may find some difficulties.

42
Advantages of KNN Algorithm:
• It is simple to implement.
• It is robust to the noisy training data
• It can be more effective if the training data is large.

CST322 - DA | Mod-3 | Sahrdaya CET


Disadvantages of KNN Algorithm:
• Always needs to determine the value of K which may be complex some
time.
• The computation cost is high because of calculating the distance
between the data points for all the training samples.
43
Workout Example
NAME AGE GENDER CLASS OF SPORTS
Ajay 32 0 Football
Here male is denoted with
Mark 40 0 Neither numeric value 0 and

CST322 - DA | Mod-3 | Sahrdaya CET


Sara 16 1 Cricket female with 1.

Zaira 34 1 Cricket
Let’s find in which class of
Sachin 55 0 Neither people “Kiran” will lie
whose k factor is 3 and age
Rahul 40 0 Cricket is 5.
Pooja 20 1 Neither
Smith 15 0 Cricket
Laxmi 55 1 Football
Michael 15 0 Football
44
So we have to find out the distance using

𝑑 = (𝑥2 − 𝑥1)² + (𝑦2 − 𝑦1)²to find the distance between any two points.

CST322 - DA | Mod-3 | Sahrdaya CET


So let’s find out the distance between Ajay and Kiran using formula

d= (𝑎𝑔𝑒2 − 𝑎𝑔𝑒1)² + (𝑔𝑒𝑛𝑑𝑒𝑟2 − 𝑔𝑒𝑛𝑑𝑒𝑟1)²

d= (5 − 32)² + (1 − 0)²
d= 729 + 1
d=27.02

Similarly, we find out all distance one by one.


45
Distance between
Distance
So the value of k factor is 3 for Kiran.
Kiran and And the closest to 3 is 9,10,10.5 that is
Ajay 27.02 closest to Angelina are Zaira, Smith and
Michael.
Mark 35.01
Sara 11.00 Zaira 9 cricket

CST322 - DA | Mod-3 | Sahrdaya CET


Zaira 9.00
Sachin 50.01
Michael 10 cricket

Rahul 35.01 Smith 10.5 football


Pooja 15.00
Smith 10.00 So according to KNN algorithm, Kiran
Laxmi 50.00 will be in the class of people who like
cricket.
Michael 10.05

46
Linear Regression
47
Regression
• Technique used for the modeling and analysis of numerical data
• Exploits the relationship between two or more variables so that we can
gain information about one of them through knowing values of the other
Regression can be used for prediction, estimation, hypothesis testing,

CST322 - DA | Mod-3 | Sahrdaya CET



and modeling causal relationships

48
Why Linear Regression?
• Linear regression algorithm shows a linear relationship between a
dependent (Y) and one or more independent (x) variables, hence called
as linear regression.
• Since linear regression shows the linear relationship, which means it

CST322 - DA | Mod-3 | Sahrdaya CET


finds how the value of the dependent variable is changing according to
the value of the independent variable.

The linear regression model provides a


sloped straight line representing the
relationship between the variables.

49
Linear Regression is a Probabilistic Model
• Much of mathematics is devoted to studying variables that are
deterministically related to one another

CST322 - DA | Mod-3 | Sahrdaya CET


But we’re
interested in
understanding the
relationship
between variables
related in a
nondeterministic
fashion

50
A Linear Probabilistic Model
• Mathematically, we can represent a linear regression as:

CST322 - DA | Mod-3 | Sahrdaya CET


Here,

y = Dependent Variable (Target Variable)


x = Independent Variable (predictor Variable)
𝖰𝟎 = intercept of the line (Gives an additional degree of freedom)
𝖰𝟏 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear
Regression model representation. 51
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
• Simple Linear Regression:
If a single independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm

CST322 - DA | Mod-3 | Sahrdaya CET


is called Simple Linear Regression.
• Multiple Linear regression:
If more than one independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm
is called Multiple Linear Regression.

52
Linear Regression Line
A linear line showing the relationship between the dependent and
independent variables is called a regression line. A regression line can
show two types of relationship:
• Positive Linear Relationship:

CST322 - DA | Mod-3 | Sahrdaya CET


If the dependent variable increases on the Y-axis and independent
variable increases on X-axis, then such a relationship is termed as a
Positive linear relationship.

53
Linear Regression Line
• Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent
variable increases on the X-axis, then such a relationship is called a
negative linear relationship.

CST322 - DA | Mod-3 | Sahrdaya CET


54
Multiple Linear Regression
• Extension of the simple linear regression model to two or more
independent variables

CST322 - DA | Mod-3 | Sahrdaya CET


Population Slopes
Random Error
Y- intercept

55
Example

Predict the value of Y given 𝑋1 and 𝑋2


Subjects Y 𝑿𝟏 𝑿𝟐 𝒀 = 𝖰𝟎 + 𝖰𝟏𝒙𝟏 + 𝖰𝟐𝒙𝟐 + 𝝐
1 -3.7 3 8
2 3,5 4 5

CST322 - DA | Mod-3 | Sahrdaya CET


3 2.5 5 7
4 11.5 6 3
5 5.7 2 1 ഥ = 𝒂 + 𝒃𝟏𝒙𝟏 + 𝒃𝟐𝒙𝟐
𝒀
6 -------- 3 2

Y - dependent variable
𝑋1 and 𝑋2 - Independent variable

ഥ − 𝒃𝟏𝒙𝟏 − 𝒃𝟐𝒙𝟐
𝒂= 𝒀
56
Example
57

CST322 - DA | Mod-3 | Sahrdaya CET


N – No. of Examples
58
Subject Y 𝑿𝟏 𝑿𝟐 𝑿𝟏𝑿𝟏 𝑿𝟐𝑿𝟐 𝑿𝟏𝑿𝟐 𝑿𝟏Y 𝑿𝟐Y

1 -3.7 3 8

CST322 - DA | Mod-3 | Sahrdaya CET


2 3,5 4 5
3 2.5 5 7
4 11.5 6 3
5 5.7 2 1

𝜮 19.5 20 24

59
10

32.8

CST322 - DA | Mod-3 | Sahrdaya CET


17.8
𝒃𝟏 = 𝟐. 𝟐𝟖

-48 𝒃𝟐 = −𝟏. 𝟔𝟕
𝟏𝟗. 𝟓 𝟐𝟎 𝟐𝟒
𝒂= − 𝟐. 𝟐𝟖 × + 𝟏. 𝟔𝟕 × = 𝟐. 𝟕𝟗𝟔
𝟓 𝟓 𝟓
3 Final Regression Equation/Model is:
𝒀 = 𝟐. 𝟕𝟗𝟔 + 𝟐. 𝟐𝟖𝒙𝟏 − 𝟏. 𝟔𝟕𝒙𝟐

𝒙𝟏 = 𝟑, 𝒙𝟐 = 𝟐 𝒀 = 𝟐. 𝟕𝟗𝟔 + 𝟐. 𝟐𝟖 𝟑 − 𝟏. 𝟔𝟕(𝟐)
60
CST322 - DA | Mod-3 | Sahrdaya CET
Unsupervised
Learning- Clustering

61
Clustering
• It is basically a type of unsupervised learning method
• Generally, it is used as a process to find meaningful structure,
explanatory underlying processes, generative features, and groupings
inherent in a set of examples.

CST322 - DA | Mod-3 | Sahrdaya CET


• Clustering is the task of dividing the population or data points into a
number of groups such that data points in the same groups are more
similar to other data points in the same group and dissimilar to the data
points in other groups.
• It is basically a collection of objects on the basis of similarity and
dissimilarity between them.

62
Clustering – Ex.
• The data points in the graph below clustered together can be classified
into one single group. We can distinguish the clusters, and we can
identify that there are 3 clusters in the below picture.

CST322 - DA | Mod-3 | Sahrdaya CET


63
Why Clustering?
• Clustering is very much important as it determines the intrinsic
grouping among the unlabelled data present.
• There are no criteria for good clustering. It depends on the user, what is
the criteria they may use which satisfy their need.

CST322 - DA | Mod-3 | Sahrdaya CET


• For instance, we could be interested in finding representatives for
homogeneous groups (data reduction), in finding “natural clusters” and
describe their unknown properties (“natural” data types), in finding
useful and suitable groupings (“useful” data classes) or in finding
unusual data objects (outlier detection).

64
Clustering Methods :
• Density-Based Methods: These methods consider the clusters as the
dense region having some similarities and differences from the lower
dense region of the space. These methods have good accuracy and the
ability to merge two clusters. Example DBSCAN (Density-Based Spatial
Clustering of Applications with Noise), OPTICS (Ordering Points to

CST322 - DA | Mod-3 | Sahrdaya CET


Identify Clustering Structure), etc.
• Hierarchical Based Methods: The clusters formed in this method
form a tree-type structure based on the hierarchy. New clusters are
formed using the previously formed one. It is divided into two category
🞄 Agglomerative (bottom-up approach)
🞄 Divisive (top-down approach)

65
Clustering Methods : (Contd…)
• Partitioning Methods: These methods partition the objects into k
clusters and each partition forms one cluster. This method is used to
optimize an objective criterion similarity function such as when the
distance is a major parameter example K-means, CLARANS (Clustering
Large Applications based upon Randomized Search), etc.

CST322 - DA | Mod-3 | Sahrdaya CET


• Grid-based Methods: In this method, the data space is formulated
into a finite number of cells that form a grid-like structure. All the
clustering operations done on these grids are fast and independent of
the number of data objects example STING (Statistical Information
Grid), wave cluster, CLIQUE (CLustering In Quest), etc.

66
Applications of Clustering in different fields
• Marketing: It can be used to characterize & discover customer
segments for marketing purposes.
• Biology: It can be used for classification among different species of
plants and animals.

CST322 - DA | Mod-3 | Sahrdaya CET


• Libraries: It is used in clustering different books on the basis of topics
and information.
• Insurance: It is used to acknowledge the customers, their policies and
identifying the frauds

67
Hierarchical
algorithms
68
Hierarchical algorithms
• Hierarchical clustering is another unsupervised machine learning
algorithm, which is used to group the unlabeled datasets into a cluster
and also known as hierarchical cluster analysis or HCA.
• In this algorithm, we develop the hierarchy of clusters in the form of a

CST322 - DA | Mod-3 | Sahrdaya CET


tree, and this tree-shaped structure is known as the dendrogram.
• Sometimes the results of K-means clustering and hierarchical clustering
may look similar, but they both differ depending on how they work.

69
Approaches
• The hierarchical clustering technique has two approaches:
🞄 Agglomerative: Agglomerative is a bottom-up approach, in which
the algorithm starts with taking all data points as single clusters and
merging them until one cluster is left.
🞄 Divisive: Divisive algorithm is the reverse of the agglomerative

CST322 - DA | Mod-3 | Sahrdaya CET


algorithm as it is a top-down approach.

70
Agglomerative Hierarchical clustering
• The agglomerative hierarchical clustering algorithm is a popular
example of HCA.
• To group the datasets into clusters, it follows the bottom-up approach. It
means, this algorithm considers each dataset as a single cluster at the

CST322 - DA | Mod-3 | Sahrdaya CET


beginning, and then start combining the closest pair of clusters
together.
• It does this until all the clusters are merged into a single cluster that
contains all the datasets.
• This hierarchy of clusters is represented in the form of the dendrogram.

71
How the Agglomerative Hierarchical clustering Work?

• Step-1: Create each data point as a single


cluster. Let's say there are N data points,
so the number of clusters will also be N.

CST322 - DA | Mod-3 | Sahrdaya CET


• Step-2: Take two closest data points or
clusters and merge them to form one
cluster. So, there will now be N-1 clusters.

72
How the Agglomerative Hierarchical clustering Work?

• Step-3: Again, take the two closest


clusters and merge them together to form
one cluster. There will be N-2 clusters.

CST322 - DA | Mod-3 | Sahrdaya CET


• Step-4: Repeat Step 3 until only one
cluster left. So, we will get the following
clusters. Consider the below images:

73
CST322 - DA | Mod-3 | Sahrdaya CET
Step-5: Once all the clusters are combined into one big cluster,
develop the dendrogram to divide the clusters as per the
problem.
74
Measure for the distance between two clusters
• There are various ways to calculate the distance between two clusters,
and these ways decide the rule for clustering. These measures are called
Linkage methods.
• Some of the popular linkage methods are given below:

CST322 - DA | Mod-3 | Sahrdaya CET


🞄 Single Linkage: It is the Shortest Distance between the closest points
of the clusters. Consider the below image:

75
Measure for the distance between two clusters
• Complete Linkage: It is the farthest distance between the two points
of two different clusters. It is one of the popular linkage methods as it
forms tighter clusters than single-linkage.

CST322 - DA | Mod-3 | Sahrdaya CET


76
Measure for the distance between two clusters
• Average Linkage: It is the linkage method in which the distance
between each pair of datasets is added up and then divided by the total
number of datasets to calculate the average distance between two
clusters. It is also one of the most popular linkage methods.

CST322 - DA | Mod-3 | Sahrdaya CET


77
Measure for the distance between two clusters
• Centroid Linkage: It is the linkage method in which the distance
between the centroid of the clusters is calculated. Consider the below
image:

CST322 - DA | Mod-3 | Sahrdaya CET


78
Working of Dendrogram in Hierarchical clustering
• The dendrogram is a tree-like structure that is mainly used to store
each step as a memory that the HC algorithm performs.
• In the dendrogram plot, the Y-axis shows the Euclidean distances
between the data points, and the x-axis shows all the data points of the

CST322 - DA | Mod-3 | Sahrdaya CET


given dataset.

In the above diagram, the left part is showing how clusters are created in agglomerative
clustering, and the right part is showing the corresponding dendrogram. 79
CST322 - DA | Mod-3 | Sahrdaya CET
• firstly, the datapoints P2 and P3 combine together and form a cluster,
correspondingly a dendrogram is created, which connects P2 and P3 with a
rectangular shape. The hight is decided according to the Euclidean distance
between the data points.
• In the next step, P5 and P6 form a cluster, and the corresponding
dendrogram is created. It is higher than of previous, as the Euclidean
distance between P5 and P6 is a little bit greater than the P2 and P3.
80
CST322 - DA | Mod-3 | Sahrdaya CET
• Again, two new dendrograms are created that combine P1, P2, and P3 in one
dendrogram, and P4, P5, and P6, in another dendrogram.
• At last, the final dendrogram is created that combines all the data points
together.
We can cut the dendrogram tree structure at
any level as per our requirement. 81
Clusters using a Single Link Technique
Problem Definition:
For the given dataset, find the clusters
using a single link technique. Use
Euclidean distance and draw the

CST322 - DA | Mod-3 | Sahrdaya CET


dendrogram.

82
Clusters using a Single Link Technique
Step – 1 Compute the distance matrix
Find Euclidean distance between each
and every point

CST322 - DA | Mod-3 | Sahrdaya CET


𝑑 𝑃1, 𝑃2 = 0.23
𝑑 𝑃1, 𝑃3 = 0.22
𝑑 𝑃2, 𝑃3 = 0.14


83
Clusters using a Single Link Technique

CST322 - DA | Mod-3 | Sahrdaya CET


Distance Matrix

84
Clusters using a Single Link Technique
Step – 2 Merging the two closest
members
Form clusters based on the minimum
value in the matrix and update the

CST322 - DA | Mod-3 | Sahrdaya CET


distance matric

85
(𝑷𝟑, 𝑷𝟔)
86

CST322 - DA | Mod-3 | Sahrdaya CET


{(𝑷𝟑, 𝑷𝟔), 𝑷𝟒}
87
(𝑷𝟐, 𝑷𝟓)
{(𝑷𝟑, 𝑷𝟔), 𝑷𝟒}and

CST322 - DA | Mod-3 | Sahrdaya CET


88
[{(𝑷𝟑, 𝑷𝟔), 𝑷𝟒}, (𝑷𝟐, 𝑷𝟓)], 𝑷𝟏

CST322 - DA | Mod-3 | Sahrdaya CET


89
[{(𝑷𝟑, 𝑷𝟔), 𝑷𝟒}, (𝑷𝟐, 𝑷𝟓)], 𝑷𝟏

CST322 - DA | Mod-3 | Sahrdaya CET


90
Partitional
algorithms -K- Means

CST322 - DA | Mod-3 | Sahrdaya CET


91
Partitioning Method
• This clustering method classifies the information into multiple groups
based on the characteristics and similarity of the data.
• Its the data analysts to specify the number of clusters that has to be
generated for the clustering methods.

CST322 - DA | Mod-3 | Sahrdaya CET


• In the partitioning method when database(D) that contains multiple(N)
objects then the partitioning method constructs user-specified(K)
partitions of the data in which each partition represents a cluster and a
particular region.
• There are many algorithms that come under partitioning method some
of the popular ones are K-Mean, PAM(K-Mediods), CLARA algorithm
(Clustering Large Applications) etc.

92
K-Mean (A centroid based Technique):
• The K means algorithm takes the input parameter K from the user and
partitions the dataset containing N objects into K clusters so that resulting
similarity among the data objects inside the group (intra-cluster) is high but
the similarity of data objects with the data objects from outside the cluster is
low (inter-cluster).

CST322 - DA | Mod-3 | Sahrdaya CET


• The similarity of the cluster is determined with respect to the mean value of
the cluster.
• It is a type of square error algorithm. At the start randomly k objects from the
dataset are chosen in which each of the objects represents a cluster
mean(centre).
• For the rest of the data objects, they are assigned to the nearest cluster based
on their distance from the cluster mean.
• The new mean of each of the cluster is then calculated with the added data
objects.

93
Algorithm: K mean
Input:
• K: The number of clusters in which the dataset has to be divided
• D: A dataset containing N number of objects

CST322 - DA | Mod-3 | Sahrdaya CET


Output:
• A dataset of K clusters

94
Method
1. Randomly assign K objects from the dataset(D) as cluster centers(C)
2. (Re) Assign each object to which object is most similar based upon
mean values.
Update Cluster means, i.e., Recalculate the mean of each cluster with

CST322 - DA | Mod-3 | Sahrdaya CET


3.
the updated values.
4. Repeat Step 4 until no change occurs.

95
Advantages
• Simple, easy to understand, and easy to implement
• It is also efficient, in which the time taken to cluster K-means rises
linearly with the number of data points
No other clustering algorithm performs better than K-means, in general

CST322 - DA | Mod-3 | Sahrdaya CET


Disadvantages
• The user needs to specify an initial value of K
• The process of finding the clusters may not converge
• Not suitable for discovering all types of clusters

96
4
6
5
5
6
8
4
5
2
2
X

4
3
7
2
6
3
7
6
6
4
Y
Example:

CST322 - DA | Mod-3 | Sahrdaya CET


97
98

CST322 - DA | Mod-3 | Sahrdaya CET


99

CST322 - DA | Mod-3 | Sahrdaya CET


100

CST322 - DA | Mod-3 | Sahrdaya CET


101

CST322 - DA | Mod-3 | Sahrdaya CET


102

CST322 - DA | Mod-3 | Sahrdaya CET


103

CST322 - DA | Mod-3 | Sahrdaya CET


104

CST322 - DA | Mod-3 | Sahrdaya CET


105

CST322 - DA | Mod-3 | Sahrdaya CET


106

CST322 - DA | Mod-3 | Sahrdaya CET


Association Rule
Mining
107
Association Rule Mining
• It is a popular, unsupervised learning technique, used in business to
help identify shopping patterns.
• Also known as Market Basket Analysis
It helps to find interesting relationships(affinities) between variables

CST322 - DA | Mod-3 | Sahrdaya CET



(items or events).
• Thus, it can help cross-sell related items and increase the size of the
sale.
• There is no dependent variable
• All data used in this technique is categorical

108
Association Rule Mining
• Market Based Analysis is one of the key techniques used by large
relations to show associations between items.
• It allows retailers to identify relationships between the items that
people buy together frequently.

CST322 - DA | Mod-3 | Sahrdaya CET


• Given a set of transactions, we can find rules that will predict the
occurrence of an item based on the occurrences of other items in the
transaction.

109
Basic Definitions
• Support Count(𝝈) - Frequency of occurrence of a item set.
Here 𝝈({Milk, Bread, Diaper})=2

CST322 - DA | Mod-3 | Sahrdaya CET


• Frequent Item set – An item set whose support is greater than or
equal to minsup threshold.

• Association Rule – An implication expression of the form X → Y,


where X and Y are any 2 item sets.
Example: {Milk, Diaper}→{Beer}

110
111
Rule Evaluation Metrics
• Support(s) –
🞄 The number of transactions that include items in the {X} and {Y} parts of the rule as
a percentage of the total number of transaction.
🞄 It is a measure of how frequently the collection of items occur together as a
percentage of all transactions.

CST322 - DA | Mod-3 | Sahrdaya CET


• Support = 𝜎 𝑋 + 𝑌 ÷ 𝑡𝑜𝑡𝑎𝑙 –
🞄 It is interpreted as fraction of transactions that contain both X and Y.

• Confidence(c) –
🞄 It is the ratio of the no of transactions that includes all items in {B} as well as the no
of transactions that includes all items in {A} to the no of transactions that includes
all items in {A}.

• Conf(X=>Y) = 𝑆𝑢𝑝𝑝 𝑋 𝖴 Y ÷ 𝑆𝑢𝑝𝑝(𝑋) –


🞄 It measures how often each item in Y appears in transactions that contains items in
X also.

112
Rule Evaluation Metrics
• Lift(l) –
🞄 The lift of the rule X=>Y is the confidence of the rule divided by the expected
confidence, assuming that the item sets X and Y are independent of each other.
🞄 The expected confidence is the confidence divided by the frequency of {Y}.

CST322 - DA | Mod-3 | Sahrdaya CET


• Lift(X=>Y) = 𝑪𝒐𝒏𝒇 𝑿 => 𝒀 ÷ 𝑺𝒖𝒑𝒑(𝒀) –
🞄 Lift value near 1 indicates X and Y almost often appear together as expected, greater
than 1 means they appear together more than expected and less than 1 means they
appear less than expected.
🞄 Greater lift values indicate stronger association.

113
From the given table, {Milk, Diaper}=>{Beer}
Illustration
• S = 𝜎({Milk, Diaper, Beer}) ÷ |T|
= 2/5
= 0.4

CST322 - DA | Mod-3 | Sahrdaya CET


• C = 𝜎(Milk, Diaper, Beer) ÷ 𝜎(Milk, Diaper)
= 2/3
= 0.67

• l = Supp({Milk, Diaper, Beer}) ÷ Supp({Milk, Diaper})*Supp({Beer})


= 0.4/(0.6*0.6)
= 1.11
114
Apriori Algorithm
115
Apriori Algorithm
• Most popular algorithm used for Association Rule Mining
• A frequent item set is an item set whose support is greater than or
equal to minimum support threshold.
The Apriori property is a downward closure property, means that any

CST322 - DA | Mod-3 | Sahrdaya CET



subsets of a frequent item set are also frequent item sets.
• Thus, if (A,B,C,D) is a frequent item set, then any subset such as
(A,B,C) or (B,D) are also frequent item set
• This uses bottom up approach; and size of the frequent subsets is
gradually increased, from one-item subsets to two-item subsets, then
three-item subsets, and so on.
• Groups of candidates at each level are tested against the data for
minimum support
116
117
Example:

The objective is to use


this transaction data
to find affinities

CST322 - DA | Mod-3 | Sahrdaya CET


between products,
that is, which
products sell together
often.

The support level will be set at 33 percent: the confidence level


will be set at 50 percent
118
119

CST322 - DA | Mod-3 | Sahrdaya CET


120

CST322 - DA | Mod-3 | Sahrdaya CET


121

CST322 - DA | Mod-3 | Sahrdaya CET


122
Association Rule Mining – Subset Creation

CST322 - DA | Mod-3 | Sahrdaya CET


123
Association Rule Mining – Subset Creation

CST322 - DA | Mod-3 | Sahrdaya CET


124
Association Rule Mining – Subset Creation

CST322 - DA | Mod-3 | Sahrdaya CET


125
Example-2

Consider the following dataset and


we will find frequent item sets and
generate association rules for them.

CST322 - DA | Mod-3 | Sahrdaya CET


minimum support count is 2
minimum confidence is 60%

126
Limitations of Apriori Algorithm
• Apriori Algorithm can be slow.
• The main limitation is time required to hold a vast number of candidate
sets with much frequent item sets, low minimum support or large item
sets

CST322 - DA | Mod-3 | Sahrdaya CET


• i.e. it is not an efficient approach for large number of datasets.

127
Thank you…

CST322 - DA | Mod-3 | Sahrdaya CET


128
CST322- DATA ANALYTICS

MODULE 4
Data is one of the prime factors of any business purpose. Business Enterprises are data-driven
and without data, no one can have a competitive advantage. It has different definitions wherein
the huge amount of data can be considered as Big Data. It is the most widely used technology
these days in almost every business vertical.

Big Data Definition

Data can be defined as figures or facts that can be stored in or can be used by a computer.

Big Data is a term that is used for denoting a collection of datasets that is large and complex,
making it very difficult to process using legacy data processing applications.

Types of Big Data

Big Data is essentially classified into three types:

Structured Data

Unstructured Data

Semi-structured Data

The above three types of Big Data are technically applicable at all levels of analytics. It is
critical to understand the source of raw data and its treatment before analysis while working
with large volumes of big data. Because there is so much data, extraction of information needs
to be done efficiently to get the most out of the data.

Structured Data

Structured data is highly organized and thus, is the easiest to work with. Its dimensions are
defined by set parameters. Every piece of information is grouped into rows and columns like
spreadsheets. Structured data has quantitative data such as age, contact, address, billing,
expenses, debit or credit card numbers, etc.

Due to structured data’s quantitative nature, it is easy for programs to sort through and collect
data. It requires little to no preparation to process structured data. The data only needs to be
cleaned and pared down to the relevant points. The data does not need to be converted or
interpreted too deeply to perform a proper inquiry.

Structured data follow road maps to specific data points or schemas for outlining the location
of each datum and its meaning.

The streamlined process of merging enterprise data with relational data is one of the perks of
structured data. Due to the pertinent data dimensions being defined and being in a uniform
format, very little preparation is required to have all sources be compatible.

The ETL process, for structured data, stores the finished product in a data warehouse. The
initial data is harvested for a specific analytics purpose, and for this, the databases are highly
structured and filtered. However, there is only a limited amount of structured data available,
and it falls under a slim minority of all existing data. Consensus says that structured data makes
up only 20 percent or less of all data.
CST322- DATA ANALYTICS

Unstructured Data

Not all data is structured and well-sorted with instructions on how to use it. All unorganized
data is known as unstructured data.

Almost everything generated by a computer is unstructured data. The time and effort required
to make unstructured data readable can be cumbersome. To yield real value from data, datasets
need to be interpretable. But the process to make that happen can be much more rewarding.

The challenging part about unstructured data analysis is teaching an application to understand
the information it’s extracting. Oftentimes, translation into structured form is required, which
is not easy and varies with different formats and end goals. Some methods to achieve the
translation are by using text parsing, NLP, and developing content hierarchies through
taxonomy. Complex algorithms are involved to blend the processes of scanning, interpreting,
and contextualizing.

Unlike structured data, which is stored in data warehouses, unstructured data is stored in data
lakes. Data lakes preserve the raw format of data as well as all of its information. Data lakes
make data more malleable, unlike data warehouses where data is limited to its defined schema.

Semi-structured Data

Semi-structured data falls somewhere between structured data and unstructured data. It mostly
translates to unstructured data that has metadata attached to it. Semi-structured data can be
inherited such as location, time, email address, or device ID stamp. It can even be a semantic
tag attached to the data later.

Consider the example of an email. The time an email was sent, the email addresses of the sender
and the recipient, the IP address of the device that the email was sent from, and other relevant
information are linked to the content of the email. While the actual content itself is not
structured, these components enable the data to be grouped in a structured manner.

Using the right datasets can make semi-structured data into a significant asset. It can aid
machine learning and AI training by associating patterns with metadata.

Semi-structured data’s no set schema can be a benefit as well as a challenge. It can be a


challenge to put in all that effort to tell an application the meaning of each data point. But at
the same time, there are no limits in structured data ETL in terms of definition.

Subtypes of Data

Apart from the three above-mentioned types, there are subtypes of data that are not formally
considered Big Data but are somewhat pertinent to analytics. Most times, it is the origin of data
such as social media, machine (operational logging), event-triggered, or geospatial. It can also
involve access levels—open (open source), linked (web data transmitted via APIs and other
connection methods), or dark or lost (siloed within systems for the inaccessibility to outsiders
such as CCTV systems).
CST322- DATA ANALYTICS

Characteristics of Big Data

Volume: This refers to tremendously large data. As you can see from the image, the volume of data is
rising exponentially. In 2016, the data created was only 8 ZB; it is expected that, by 2020, the data
would rise to 40 ZB, which is extremely large.

Variety: A reason for this rapid growth of data volume is that data is coming from different sources in
various formats. We have already discussed how data is categorized into different types. Let us take
another glimpse at it with more examples.
CST322- DATA ANALYTICS
Velocity: The speed of data accumulation also plays a role in determining whether the data is big data
or normal data.

Value: How will the extraction of data work? Here, our fourth V comes in; it deals with a mechanism
to bring out the correct meaning of data. First of all, you need to mine data, i.e., the process to turn raw
data into useful data. Then, an analysis is done on the data that you have cleaned or retrieved from the
raw data. Then, you need to make sure whatever analysis you have done benefits your business, such
as in finding out insights, results, etc., in a way that was not possible earlier.

Veracity: Since packages get lost during execution, we need to start again from the stage of mining raw
data to convert it into valuable data. And this process goes on. There will also be uncertainties and
inconsistencies in the data that can be overcome by veracity. Veracity means the trustworthiness and
quality of data. The veracity of data must be maintained.

Major Sectors Using Big Data Every Day


The applications of big data provided solutions to every sector like Banking, Government,
Education, and healthcare, etc.

Banking

Since there is a massive amount of data that is gushing in from innumerable sources, banks
need to find uncommon and unconventional ways to manage big data. It’s also essential to
examine customer requirements, render services according to their specifications, and reduce
risks while sustaining regulatory compliance. Financial institutions have to deal with Big Data
Analytics to solve this problem.
CST322- DATA ANALYTICS

• NYSE (New York Stock Exchange): NYSE generates about one terabyte of new trade data every
single day. So imagine, if one terabyte of data is generated every day, in a whole year how
much data there would be to process. This is what Big Data is used for.

Government

Government agencies utilize Big Data and have devised a lot of running agencies, managing
utilities, dealing with traffic jams, or limiting the effects of crime. However, apart from its
benefits in Big Data, the government also addresses the concerns of transparency and privacy.

• Aadhar Card: The Indian government has a record of all 1.21 billion citizens. This huge data is
stored and analyzed to find out several things, such as the number of youth in the country.
According to which several schemes are made to target the maximum population. All this big
data can’t be stored in some traditional database, so it is left for storing and analyzing using
several Big Data Analytics tools.

Education

Education concerning Big Data produces a vital impact on students, school systems, and
curriculums. By interpreting big data, people can ensure students’ growth, identify at-risk
students, and achieve an improvised system for the evaluation and assistance of principals and
teachers.

• Example: The education sector holds a lot of information concerning curriculum, students,
and faculty. The information is analyzed to get insights that can enhance the operational
adequacy of the educational organization. Collecting and analyzing information about a
student such as attendance, test scores, grades, and other issues take up a lot of data. So, big
data approaches a progressive framework wherein this data can be stored and analyzed
making it easier for the institutes to work with.
CST322- DATA ANALYTICS

Big Data in Healthcare

When it comes to what Big Data is in Healthcare, we can see that it is being used enormously.
It includes collecting data, analyzing it, leveraging it for customers. Also, patients’ clinical data
is too complex to be solved or understood by traditional systems. Since big data is processed
by Machine Learning algorithms and Data Scientists, tackling such huge data becomes
manageable.

• Example: Nowadays, doctors rely mostly on patients’ clinical records, which means that a lot
of data needs to be gathered, that too for different patients. It is not possible for old or
traditional data storage methods to store this data. Since there is a large amount of data
coming from different sources, in various formats, the need to handle this large amount of
data is increased, and that is why the Big Data approach is needed.

E-commerce

Maintaining customer relationships is the most important in the e-commerce industry. E-


commerce websites have different marketing ideas to retail their merchandise to their
customers, manage transactions, and implement better tactics of using innovative ideas with
Big Data to improve businesses.

• Flipkart: Flipkart is a huge e-commerce website dealing with lots of traffic daily. But, when
there is a pre-announced sale on Flipkart, traffic grows exponentially that crashes the website.
So, to handle this kind of traffic and data, Flipkart uses Big Data. Big Data can help in organizing
and analyzing the data for further use.
CST322- DATA ANALYTICS

Social Media

Social media in the current scenario is considered the largest data generator. The stats have
shown that around 500+ terabytes of new data get generated into the databases of social media
every day, particularly in the case of Facebook. The data generated mainly consist of videos,
photos, message exchanges, etc. A single activity on any social media site generates a lot of
data which is again stored and gets processed whenever required. Since the data stored is in
terabytes, it would take a lot of time for processing if it is done by our legacy systems. Big
Data is a solution to this problem.

Tools for Big Data Analytics

Apache Hadoop
Big Data Hadoop is a framework that allows you to store big data in a distributed environment for
parallel processing.
CST322- DATA ANALYTICS
Apache Pig
Apache Pig is a platform that is used for analyzing large datasets by representing them as data flows.
Pig is designed to provide an abstraction over MapReduce which reduces the complexities of writing a
MapReduce program.
Apache HBase
Apache HBase is a multidimensional, distributed, open-source, and NoSQL database written in Java.
It runs on top of HDFS providing Bigtable-like capabilities for Hadoop.
Apache Spark
Apache Spark is an open-source general-purpose cluster-computing framework. It provides an
interface for programming all clusters with implicit data parallelism and fault tolerance.
Talend
Talend is an open-source data integration platform. It provides many services for enterprise
application integration, data integration, data management, cloud storage, data quality, and Big Data.
Splunk
Splunk is an American company that produces software for monitoring, searching, and analyzing
machine-generated data using a Web-style interface.
Apache Hive
Apache Hive is a data warehouse system developed on top of Hadoop and is used for interpreting
structured and semi-structured data.
Kafka
Apache Kafka is a distributed messaging system that was initially developed at LinkedIn and later
became part of the Apache project. Kafka is agile, fast, scalable, and distributed by design.
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

The previous chapter presented the six phases of the Data Analytics Lifecycle.

• Phase 1: Discovery

• Phase 2: Data Preparation

• Phase 3: Model Planning

• Phase 4: Model Building

• Phase 5: Communicate Results

• Phase 6: Operationalize

The first three phases involve various aspects of data exploration. In general, the success of a da ta
analysis project requires a deep understanding of the data. It also requires a toolbox for mining and pre-
senting the data. These activities include the study of the data in terms of basic statistical measures and
creation of graphs and plots to visualize and identify relationships and patterns. Several free or commercial
tools are available for exploring, conditioning, modeling, and presenting data. Because of its popularity and
versatility, the open-source programming language Ris used to illustrate many of the presented analytical
tasks and models in this book.
This chapter introduces the basic functionality of the Rprogramming language and environment. The
first section gives an overview of how to useR to acquire, parse, and filter the data as well as how to obtain
some basic descriptive statistics on a dataset. The second section examines using Rto perform exploratory
data analysis tasks using visua lization. The final section focuses on statistical inference, such as hypothesis
testing and analysis of variance in R.

3.1 Introduction toR


Ris a programming language and software framework for statistical analysis and graphics. Available for use
under the GNU General Public License [1], Rsoftware and installation instructions can be obtained via the
Comprehensive RArchive and Network [2]. This section provides an overview of the basic functionality of R.
In later chapters, this foundation in Ris utilized to demonstrate many of the presented analytical techniques.
Before delving into specific operations and functions of Rlater in this chapter, it is important to under-
stand the now of a basic Rscript to address an analytical problem. The following Rcode illustrates a typical
analytical situation in which a dataset is imported, the contents of the dataset are examined, and some
modeling building tasks are executed. Although the reader may not yet be familiar with the R syntax,
the code can be followed by reading the embedded comments, denoted by #. In the following scenario,
the annual sales in U.S. dollars for 10,000 retail customers have been provided in the form of a comma-
separated-value (CSV) file. The read . csv () function is used to import the CSV file. This dataset is stored
to the Rvariable sales using the assignment operator <- .

H imp< rt a CSV file of theo tot'l.- annual sa es for each customcl·


sales <- read . csv( "c:/data/yearly_sales.csv")

# €'>:amine Lhe imported dataset


head(sales)
3.1 Introduction toR

summary (sales)

# plot num_of_orders vs. sales


plot(sales$num_of_orders,sales$sales_total,
main .. "Number of Orders vs. Sales")

# perform a statistical analysis (fit a linear regression model)


results <- lm(sales$sales_total - sales$num_of_orders)
summary(results)

# perform some diagnostics on the fitted model


# plot histogram of the residuals
hist(results$residuals, breaks .. 800)

In this example, the data file is imported using the read. csv () function. Once the file has been
imported, it is useful to examine the contents to ensure that the data was loaded properly as well as to become
familiar with the data. In the example, the head ( ) function, by default, displays the first six records of sales.

# examine the imported dataset


head(sales)
cust - id sales - total num - of - orders gender
100001 800.64
2 100002 217.53
100003 74.58 2 t·l
4 100004 ·198. 60 t•l
5 100005 723.11 4 F
6 100006 69.43 2 F

The summary () function provides some descriptive statistics, such as the mean and median, for
each data column. Additionally, the minimum and maximum values as well as the 1st and 3rd quartiles are
provided. Because the gender column contains two possible characters, an "F" (female) or "M" (male),
the summary () function provides the count of each character's occurrence.
summary(sales)

cust id sales total num of orde1·s gender


i"lin. :100001 !'-lin. 30.02 !'-lin. 1.000 F:5035
1st Qu . : 1 o2 5o 1 1 s t Qu . : 80 . 29 1st Qu.: 2.000 !·l: 4965
l>ledian :105001 r'ledian : 151.65 t>ledian : 2.000
!>lean :105001 t•lean 24 9. 46 !>lean 2.428
3rd Qu. :107500 3 rd Qu. : 2 9 5 . 50 3rd Qu.: 3.000
l\lax. : 110 0 0 0 t•lax. :7606.09 1•1ax. :22.000

Plotting a dataset's contents can provide information about the relationships between the vari-
ous columns. In this example, the plot () function generates a scatterplot of the number of orders
(sales$num_of_orders) againsttheannual sales (sales$sales_total). The$ is used to refer-
ence a specific column in the dataset sales. The resulting plot is shown in Figure 3-1.
# plot num_of_orders vs. sales
plot(sales$num_of_orders,sales$sales_total,
main .. "Number of Orders vs. Sales")
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

Number of Orders vs. Total Sales

0
iii 0
0 0 0
:§ 0
I <0
C/)
Q>
0 0
iii
C/)
§ 0 0 0
0
II>

i
0
8 ~
I I I I i•
8
i 0
C/)
Q>
0 0 0 0
0
iii
C/)
N
8
0
• 5 10 15 20

sales$num_of_orders

FtGURE 3-1 Graphically examining the data

Each point corresponds to the number of orders and the total sales for each customer. The plot indicates
that the annual sales are proportional to the number of orders placed. Although the observed relationship
between these two variables is not purely linear, the analyst decided to apply linear regression using the
lm () function as a first step in the modeling process.

r esul t s <- lm(sal es$sa l es_total - sales$num_of _or de rs)


r e s ults

ca.l:
lm formu.a sa.c Ssales ~ota. sales$num_of_orders

Coefti 1en·
In· er ep • sa:essnum o f orders

The resulting intercept and slope values are -154.1 and 166.2, respectively, for the fitted linear equation.
However, results stores considerably more information that can be examined with the summary ()
function. Details on the contents of results are examined by applying the at t ributes () function.
Because regression analysis is presented in more detail later in the book, the reader should not overly focus
on interpreting the following output.

summary(results)

Call :
lm formu:a sa!esSsales_total - salcs$ num_of_orders

Re!'a ilnls:
Min IQ Med1an 3C 1·1ax
-666 . 5 12S . S - 26 . 7 86 . 6 4103 . 4

Coe f ficie nt:s:


Est1mate Std . Errol r value Prl> t
Intercept -15~.128 .; . 12"' - 37 . 33 <2e-16
sal~s$num f orders 166 . 22l 1 . 462 112 . 66 <2e - ~6
3.1 Introduction to R

Sior:1t . codes : 0 ' ... . 0.00! •·· · c . o: •• • 5 I • I .1 I 1

Res1aua. star:da~d e~ro~ : ~! .: on 999° deg~ees of :reeo~m


~ultlple R·squar d : 0 . ~617 , Aa:usted P-sq~a~ed : . 561

The summary () function is an example of a generic function. A generic function is a group of fu nc-
tions sharing the same name but behaving differently depending on the number and the type of arguments
they receive. Utilized previously, plot () is another example of a generic function; the plot is determined
by the passed variables. Generic functions are used throughout this chapter and the book. In the final
portion of the example, the following Rcode uses the generic function hist () to generate a histogram
(Figure 3-2) of the residualsstored in results. The function ca ll illustrates that optional parameter values
can be passed. In this case, the number of breaks is specified to observe the large residuals.

~ pert H. some d13gnosLics or. the htted m..de.


# plot hist >gnm f the residu, ls
his t (r esults $res idua l s, breaks= 800)

Histogra m of resultsSresid uals

0
I()

u>-
..
c
:J
<:T
0
0

~ 0
u. I()

0 1000 2000 3000 4000

resuttsSres1duals
FIGURE 3-2 Evidence of large residuals

This simple example illustrates a few of the basic model planning and building tasks that may occur
in Phases 3 and 4 of the Data Analytics Lifecycle. Throughout this chapter, it is useful to envision how the
presented R functionality will be used in a more comprehensive analysis.

3.1.1 R Graphical User Interfaces


R software uses a command-line interface (CLI) that is similar to the BASH shell in Li nux or the interactive
versionsof scripting languages such as Python. UNIX and Linux users can enter command Rat the terminal
prompt to use the CU. For Windows installations, Rcomes with RGui.exe, which provides a basic graphica l
user interface (GUI). However, to im prove the ease of writing, executing, and debugging Rcode, several
additional GUis have been written for R. Popular GUis include the Rcommander [3]. Rattle [4], and RStudio
[5). This section presents a brief overview of RStudio, which was used to build the Rexamples in th is book.
Figure 3-3 provides a screenshot of the previous Rcode example executed in RStudio.
REVIEW O F BASIC DATA ANALYTIC M ETHODS USING R

._CNtl.
.. - - ....._..
.. tJ
-....y
- ..... O....ft• f

....".,
' 1 • -...1 • n •• "'
.:1-
t 1 ules r t -.d.uv 6llu,-...,.1.,_uh,1.u.,· .-------, ulu 10000 OM. ol " " whbhs
Scripts
...
...
tw..o u ln
!~uln •
rn~o~ hs
j Workspace
...
.......•. . '
.........
plot

ruulu
u1ts~of-orcl9ors,u1es

t"ltSIIIU
to' 1 '
1• uln ,u lu_utul
Jo
ules_tot•l,
flt I
u t n- ·~ ~

ulu ~ ,..._ot_orct«'s
T
-
cwo...s '"· s..ln

.... ,..,.,...
; :- •toMtt·
""'-
0 { a......
•. ' 11 luu
'"
lU •Ill rt f " ~ I
Ul hht ruuhs 1 t"tst~•h. br u11 • 100
r Histogram or results$reslduals

••'"• iu~-~.~...~------:=::::::::::::::::~
"s-uy(resulu)
·- .:1
~ Plots
c•H:
l•(f or-..1& - ulu lulu_uul .. saln~of_orcMf's)

... , .... ls :
tUn
· 6M. 1 - US, \
tQ lllt'dhn
•U .7
1Q ~b
t6.6 .&10), <1
Console ~ ~

lsttuu Ud. Crrw 1 " ' ',.,. k(,.Jtl)


j
omuupt) .,,,,u, •.ut -Jr.n ~ •. ,, •·•
u l•s~oLoro.-' u6. Ul 1. 41 61 uJ.M <-lt- 16 ••• ~
sttnU . codts : o '• •• ' o. oot •••• 0.01 ••• o.os •, • 0.1 • • 1
l:utdlu•1 sund¥d uror: 110.1 on tnt CS•vr"' or fr~
tty\ttph I:•S.,¥-.1: 0 , $617, .lodJw~ t f'd l•squ.vl'd: 0 . )6)7

j
f" • Uathttc : t . ltl••Ool on I wrd 9Ht Of", P•VA1U41: ot l . h•16

,. #pf'rfor• ~- diA~ltn Of' Uw ttlud _,.., 1000 2000 3000


Jo ?lot hhtOQrM ot tt,. re~lduh
~ Mn(ruwh~k'n l cN .th, .,.,,,, · tOO)

FIGURE 3-3 RStudio GUI

The fou r highlighted window panes follow.

• Scripts: Serves as an area to write and saveR code

• Workspace: Lists the datasets and variables in the Renvironment

• Plot s: Displays the plots generated by the Rcode and provides a straightforward mechanism to
export the plots

• Console: Provides a history of the executed Rcode and the output

Additionally, the console pane can be used to obtain help information on R. Figure 3-4 illustrates that
by entering ? lm at the console prompt, the help details of the lm ( ) function are provided on the right.
Alternatively, help ( lm ) could have been entered at the console prompt.
Functions such as edit () and fix () allow the user to update the contents of an R variable.
Alternatively, such changes can be implemented with RStudio by selecting the appropriate variable from
the workspace pan e.
R allows one to save the workspace environment, includ ing variables and loaded libraries, into an
. Rdata file using the save . image () function. An existing . Rdata file can be loaded using the
load . image () function. Tools such as RStudio prompt the user for whether the developer wants to
save the workspace connects prior to exiting the GUI.
The reader is encouraged to install Rand a preferred GUI to try out the Rexamples provided in the book
and utilize the help functionality to access more details about the discussed topics.
3.1 Introduction toR

.._. . -;:,
...,
t" u lu
t
r e...cl.csv
tt • • ........... ,
.,.1yJ•1n.u\
41 1ft f Ot l.a.l:?' ~~ .:J-
J U
ulu
.. ...... o.ttl01 • i ~·

',
..
411\1 , •

......" ..
N
ru~o~lu

..,"'..,
hud ulu
s~salu
,. ,. ..
plot ultJ l......_.,....ot'MrS ,Uh.tSnles.toul. .. ,,.. -~of arden'''~· ~·u·

... t>t .II It~ r--r;~l~ ~\)

,,..
>01 • f • .I H-It t II
ru.ulu 1• uluSulu,.ICKil ulu s~of ..crd~
>07 ru~l n
J ,.. .... ,~ """
110 "' . . . .z..
Ul "r•·f dl \1 nt,,d~l ll f"'*"t 1--Ut M-" '"
U7 •ph•t hi t< t t~ ' " ' ••I
lU hht ru11hs Srut duah, bru~s • 100 1
u•
~••
,;.·,;·iah.... ; ;.;: .:---------=========----' ---
Fitting Linear Models

un:
.:J o..c:ttptSon
lo(f or-.la .. ultsSul•s..tO'CAI .. uo hsJtuil,..of_or~s )
~-u&t411 kllftl•f"'Idek I CM ... vt. . IIC:WfYWft9'*1 - UIQ!It1UIIIIIII~II-I .nd
luto.uh : 1Nfyt4vlt- • (~ • · ~~ . _.,_...clfbab'UMn)
Min IQ ..fltiM )Q '""'
~tM. , · US. S • lt.7 M , t 4110), 41

CCHff lc.l..r:ts: uqrcrw:a!• , 41 c..•, , ....., , , , .,.l111ht•, ft• .• etlc.tl,

(fnurcept)
Uttaau St d. lrror t "'' ' "' "'(•I t I>
-1~ .UI <I.Ut - J1.JJ c2t-l6 ••·
_,bOd • • q" · · .o., .. T~. • .. ~. y .. r;u..,r, ql" • ru:z.
• ~ h :.c t • TWI, cMtr uu • II"J:.I., o thu, . .. )
u 1e~ lnwa..ot _orws 1641.211 1."62 UI.M -.l t - 16 ...
Argument1
stvnu. cCIIOH.: o ···-· 0.001 •••• 0.01 ••• o.o\ ·.• o. t • • 1
• ntdvll nln<t¥0 tf'ror: 210.1 on t9N detJ't.S of frt~ f QI&l}& lf'lot.,Kt~tl&•••t· t• liii.•(OfiDMII'IIfunbece«cldlotNIWIII t tymtdcestiiCI¥IbOnol I

::::::.~.:. ~:-:=',::7'-=:::.7::~·~ 101CIU


tt~~lt
tple •-•qwwtd: 0. MJ7. AdJW1ttO •·squAred: o. ~617
r -su thttc : l.2t2... o.& on 1 1t10 tttl cw , .,._., ,..,.: c 2.21-16 I
......)t. . . . . . tlw""'*''"tlltfNdee fnalbn:l.ncl&:.• tt.~. . . tll... hm
) • .,..,.,0,.. · - ctl~tt( \ 01'1

...,.
• • plot

-• I
,.,,uq•
of ttw rnldv• h
.. "'ht(r~whs ~ ~sl..,.h, lilr'Uh • toO)
tr.. flt\f'G . . , . ,

-- ,-
J..... u ... tr_::~, f t ;....,l l l tYJIIUI1CIIt.,....,....,.kwaiiiii'O~•uhcl
II'I~'IIIKtorl~ l ....... fl .......... lO beuMO ~UMtc,.,poctst
II'I__......U.tl....,_,teMI<MCI•t!lotiftaltJI'X-tsl
1--'UL ....,C~IM1Il
SPIOII4 c.r--..:.or•-...aor
...... d•l4~..ql l vt:~t•~•
, ..r..... -,,, ...,.,...,.,.,.,,.,... ...,., .. YMO S..lban.tan
~ng
.=.J

FIGURE 3-4 Accessing help in Rstudio

3.1.2 Data Import and Export


In the annual retail sales example, the dataset was imported into R using the read . csv () function as
in the following code.

sales <- read . csv("c : /data/yearly_ sales . csv" )

R uses a forward slash {!) as the separator character in the directory and file paths. This convention
makes script Iiies somewhat more portable at the expense of some initial confusion on the part of Windows
users, w ho may be accustomed to using a backslash (\) as a separator. To simplify the import of multiple Iiies
with long path names, the setwd () function can be used to set the working directory for the su bsequent
import and export operations, as show n in the follow ing R code.

setwd ( "c: / data/ ")


sales < - read.csv ( "yearly_sales . csv" )

Other import functions include read. table ( l and read . de lim () ,which are intended to import
other common file types such as TXT. These functions can also be used to import the yearly_ sales
. csv file, as the following code illustrates.

sales_table <- read .table ( "yearly_sales . csv " , header=TRUE, sep="," )


sales_delim <- read . delim ("yearly_sales . csv", sep=",")

Th e ma in difference between these import functions is the default values. For example, t he read
. de lim () function expects the column separator to be a tab("\ t"). ln the event that the numerical data
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

in a data file uses a comma for the decimal, Ralso provides two additional functions-read . csv2 () and
read . del im2 ()-to import such data. Table 3-1 includes the expected defaults for headers, column
separators, and decimal point notations.

TABLE 3-1 Import Function Defaults

Function Headers Separator Decimal Point


r ead. t abl e () FALSE
r ead. csv () TRUE
r ead. csv2 ( ) TRUE .
"·"

read . d e lim () TRUE "\t "

r ead. d elirn2 () TRUE "\t " ...


The analogous Rfunctions such as write . table () ,write . csv () ,and write . csv 2 () enable
exporting of R datasets to an external fi le. For example, the following R code adds an additional column
to the sales dataset and exports the modified dataset to an external file.

t; ?\dd 1 ,... ... l,;.f'1Il t .. • h.;;. _'b:!l JUt :i ~ ~ j

sales$per_order <- sa l es$sales_total/sales$num_of _orders

# exp 1L d1ta 1s ''ll' 1.-.. <r u 1 "' L!1< llL t n• t '1'.' name,,
write . t able(sales ," sa l es_modified .txt ", sep= "\t ", row. names=FALSE

Sometimes it is necessary to read data from a database management system (DBMS). Rpackages such
as DBI [6) and RODBC [7] are available for this purpose. These packages provide database interfaces
for communication between Rand DBMSs such as MySQL, Oracle, SQL Server, PostgreSQL, and Pivotal
Greenplum. The following Rcode demonstrates how to instal l the RODBC package with the i ns t al l
. p acka ges () function. The 1 ibr a ry () function loads the package into the Rworkspace. Finally, a
connector (conn ) is initialized for connecting to a Pivotal Greenpl um database tra i n i ng2 via open
database connectivity (ODBC) with user user. The training2 database must be defined either in the
I etc/ODBC . ini configuration file or using the Administrative Tools under the Windows Control Panel.
install . packages ( "RODBC" )
library(RODBC)
conn <- odbcConnect ("t r aining2", uid="user" , pwd= "passwor d " )

Th e con nector needs to be present to su bmit a SQL query to an ODBC database by using the
sq l Qu ery () function from the RODBC package. The following Rcode retrieves specific columns from
the housi ng table in which household income (h inc ) is greater than $1,000,000.

housing_data s qlQuery(conn, "select s erialno , s t ate, persons, r ooms


<-
from housing
where hinc > 1000000")
head(housing_data )
3.1 Introduction t o R

4552088 5 9
4 45"- 88 5 9
5 8699:!93 6 5 5

Although plots can be saved using the RStudio GUI, plots can also be saved using R code by specifying
the appropriate graphic devices. Using the j peg () function, the following R code creates a new JPEG
file, adds a histogram plot to the file, and then closes the file. Such techniques are useful w hen automating
standard repor ts. Other functions, such as png () , bmp () , pdf () ,and postscript () ,are available
in R to save plots in the des ired format.

jpeg ( fil e= "c : /data/ sale s_h ist . j peg" ) creaLe a ne'" jpeg file
h ist(sales$num_of_ o rders ) # export histogt·;un to jpeg
d ev. o ff () ~ shut off the graphic device

More information on data imports and exports can be fou nd at http : I I cran . r-proj e ct . o rgl
doc I ma nuals I r- rel ease i R- da ta . html, such as how to import data sets from statistical software
packages including Minitab, SAS, and SPSS.

3.1.3 Attribute and Data Types


In the earlier exa mple, the sal es variable contained a record for each customer. Several cha racteristics,
such as total an nual sa les, number of orders, and gender, were provided for each customer. In general,
these characteristics or attributes provide the qualitative and quantitative measures for each item or subject
of interest. Attributes can be categorized into four types: nominal, ordinal, interval, and ratio (NOIR) [8).
Table 3-2 distinguishes these four attrib ute types and shows the operations they support. Nominal and
ordinal attributes are considered categorical attributes, w hereas interval and ratio attributes are considered
numeric attributes.

TABLE 3-2 NOIR Attribute Types

Categorical (Qualitative) Numeric (Quantitative)

Nominal Ordinal Interval Rat io

Definition The va lues represent Attributes The difference Both the difference
labels that distin- imply a betw een two and the ratio of
guish one from sequence. values is two values are
another. meaningful. meaningful.

Examples ZIP codes, nationa l- Quality of Temperature in Age, temperature


ity, street names, diamonds, Celsius or in Kelvin, counts,
gender, employee ID academic Fahrenheit, ca l- length, weight
numbers, TRUE or grades, mag- endar dates,
FALSE nitude of latitudes
ea rthquakes

Operations =, >' =, ~, = , ;t., =,~,

<, s , >, 2: <, s , >, c:, <, s , >, ~,

+, - +, - ,

x, .:-
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

Data of one attribute type may be converted to another. For example, the qual it yof diamonds {Fair,
Good, Very Good, Premium, Ideal} is considered ordinal but can be converted to nominal {Good, Excellent}
with a defined mapping. Similarly, a ratio attribute like Age can be converted into an ordinal attribute such
as {Infant, Adolescent, Adult, Senior}. Understanding the attribute types in a given dataset is important
to ensure that the appropriate descriptive statistics and analytic methods are applied and properly inter-
preted. For example, the mean and standard deviation of U.S. postal ZIP codes are not very meaningful or
appropriate. Proper handling of categorical variables will be addressed in subsequent chapters. Also, it is
useful to consider these attribute types during the following discussion on Rdata types.

Numeric, Character, and Logical Data Types


Like other programming languages, Rsupports the use of numeric, character, and logical (Boolean) values.
Examples of such variables are given in the following Rcode.
i <- 1 # create a numeric variable
sport <- "football" # create a character variable
flag <- TRUE # create a logical variable

Rprovides several functions, such as class () and type of (),to examine the characteristics of a
given variable. The class () function represents the abstract class of an object. The typeof () func-
tion determines the way an object is stored in memory. Although i appears to be an integer, i is internally
stored using double precision. To improve the readability of the code segments in this section, the inline
Rcomments are used to explain the code or to provide the returned values.
class(i) # returns "numeric"
typeof(i) # returns "double"

class(sport) # returns "character"


typeof(sport) # returns "character"

class(flag) ..
ttreturns "logical"
typeof (flag) # returns "logical"

Additional Rfunctions exist that can test the variables and coerce a variable into a specific type. The
following Rcode illustrates how to test if i is an integer using the is . integer ( } function and to coerce
i into a new integer variable, j, using the as. integer () function. Similar functions can be applied
for double, character, and logical types.
is.integer(i) # returns FALSE
j <- as.integer(i) # coerces contents of i into an integer
is.integer(j) # returns TRUE

The application of the length () function reveals that the created variables each have a length of 1.
One might have expected the returned length of sport to have been 8 for each of the characters in the
string 11 football". However, these three variables are actually one element, vectors.
length{i) # returns 1
length(flag) # returns 1
length(sport) # returns 1 (not 8 for "football")
3.1 Introduction to R

Vectors
Vectors are a basic building block for data in R. As seen previously, simple Rvariables are actually vectors.
A vector can only consist of values in the same class. The tests for vectors can be conducted using the
is. vector () function.
is.vector(i) !t returns TRUE
is.vector(flag) # returns TRUE
is.vector(sport) ±t returns TRUE

Rprovides functionality that enables the easy creation and manipulation of vectors. The following R
code illustrates how a vector can be created using the combine function, c () or the colon operator, :,
to build a vector from the sequence of integers from 1 to 5. Furthermore, the code shows how the values
of an existing vector can be easily modified or accessed. The code, related to the z vector, indicates how
logical comparisons can be built to extract certain elements of a given vector.
u <- c("red", "yellow", "blue") " create a vector "red" "yello•d" "blue"
u ±; t·eturns "red" "yellow'' "blue"
u[l] returns "red" 1st element in u)
v <- 1:5 # create a vector 1 2 3 4 5
v # returns 1 2 3 4 5
sum(v) It returns 15
w <- v * 2 It create a vector 2 4 6 8 10
w # returns 2 4 6 8 10
w[3] returns 6 (the 3rd element of w)
z <- v + w # sums two vectors element by element
z # returns 6 9 12 15
z > 8 # returns FALSE FALSE TRUE TRUE TRUE
z [z > 8] # returns 9 12 15
z[z > 8 I z < 5] returns 9 12 15 ("!"denotes "or")

Sometimes it is necessary to initialize a vector of a specific length and then populate the content of
the vector later. The vector ( } function, by default, creates a logical vector. A vector of a different type
can be specified by using the mode parameter. The vector c, an integer vector of length 0, may be useful
when the number of elements is not initially known and the new elements will later be added to the end
ofthe vector as the values become available.
a <- vector(length=3) # create a logical vector of length 3
a # returns FALSE FALSE FALSE
b <- vector(mode::"numeric 11 , 3) #create a numeric vector of length 3
typeof(b) # returns "double"
b[2] <- 3.1 #assign 3.1 to the 2nd element
b # returns 0.0 3.1 0.0
c <- vector(mode= 11 integer", 0) # create an integer vectot· of length o
c # returns integer(O)
length(c) # returns o

Although vectors may appear to be analogous to arrays of one dimension, they are technically dimen-
sionless, as seen in the following Rcode. The concept of arrays and matrices is addressed in the following
discussion.
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

length(b) 1·eturns 3
dim(b) ~ 1·etun1s NULL (an undefined value)

Arrays and Matrices


The array () function can be used to restructure a vector as an array. For example, the following Rcode
builds a three-dimensional array to hold the quarterly sales for three regions over a two-year period and
then assign the sales amount of $158,000 to the second region for the first quarter of the first year.

H the dimensions are 3 regions , 4 quarters, and 2 years


quarterly_sales < - array(O, dim=c(3,4,2))
quarterly_sales[2,1,11 <- 158000
quarterly_sales

[. 1, [.:! 1 :. , 1 [,·s!
[1' 1 0 0 0
[:!,1 !58000 c 0 0
[ 3. 1 0 0 0 0

[. 11 (.21 [. 31 [. 41
[ 1' 1 0 0 0 0
[2 ' 1 0 0 0 0
[3 ' 1 0 0 0 0

A two-dimensional array is known as a matrix. The following code initializes a matrix to hold the quar-
terly sales for the three regions. The parameters nrov1 and nco l define the number of rows and columns,
respectively, for the sal es_ma tri x.

sales_matrix <- matrix(O, nrow = 3, neal 4)


sales_matrix

[.11 ;,:!) 1.31 [. .; 1


[ 1, 1 0 0 0 0
[2,1 0 0 0 0
[ 1' 1 n

R provides the standard matrix operations such as addition, subtraction, and multiplication, as well
as the transpose function t () and the inverse matrix function ma t r ix . inve r s e () included in the
matrixcalc package. Th e following Rcode builds a 3 x 3 matrix, M, and multiplies it by its inverse to
obtain the identity matrix.

library(matrixcalc)
M <- matrix(c(1,3,3,5,0,4,3 , 3,3) ,nrow 3,ncol 3) build a 3x3 matrix
3.1 Introduction toR

M %* % matrix . inverse (M} ~ multiply 1·! by inverse (:01}

[. 1] [. 2] [ ' 3]
[ 1' J 0 0
[2' J 0 1 0
[3' J 0 0 1

Data Frames
Similar to the concept of matrices, data frames provide a structure for storing and accessing several variables
of possibly different data types. In fact, as the i s . d ata . fr a me () function indicates, a data frame was
created by the r e ad . csv () function at the beginning of the chapter.

r.import a CSV :ile of the total annual sales :or each customer
s ales < - read . csv ("c : / data/ ye arly_s a l es . c sv" )
i s .da t a . f r ame (sal es ) ~ t·eturns TRUE

As seen earlier, the variables stored in the data frame can be easily accessed using the $ notation. The
following R code illustrates that in this example, each variable is a vector with the exception of gende r ,
which was, by a read . csv () default, imported as a factor. Discussed in detail later in this section, a factor
denotes a categorical variable, typically with a few finite levels such as "F" and "M " in the case of gender.

l ength(sal es$num_of _or ders) returns 10000 (number of customers)

i s . v ector(sales$cust id) returns TRUE


- returns
is . v ector(sales$sales_total) TRUE
i s .vector(sales$num_of_orders ) returns TRUE
is . v ector (sales$gender) returns FALSE

is . factor(s a les$gender ) ~ returns TRUE

Because of their flexibility to handle many data types, data frames are the preferred input format for
many ofthe modeling functions available in R. The foll owing use of the s t r () function provides the
structure of the sal es data frame. This function identifi es the integer and numeric (double) data types,
the factor variables and levels, as well as the first few values for each variable.

str (sal es) # display structure of the data frame object

'data.ft·ame': 10000 obs . of 4 vanables :


$ CUSt id int 100001 100002 100003 100004 100005 100006 ...
$ sales total num 800 . 6 217.5 74.6 498 . 6 723 . 1
$ num of - orders : int 3 3 2 3 4 2 2 2 2 2 . ..
$ gender Factor w/ 2 le,·els UfU "f'-1" : 1 l 2 2 1 1 2 2 1 2 .. .
I

In the simplest sense, data frames are lists of variables of the same length. A subset of the data frame
can be retrieved through subsetting operators. R's subsetting operators are powerful in t hat they allow
one to express complex operations in a succinct fashion and easily retrieve a subset of the dataset.

'! extract the fourth column of the sales data frame


sal es [, 4]
H extract the gender column of the sales data frame
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

sales$gender
# retrieve the first two rows of the data frame
sales[l:2,]
# retrieve the first, third, and fourth columns
sales[,c(l,3,4)]
l! retrieve both the cust_id and the sales_total columns
sales[,c("cust_id", "sales_total")]
# retrieve all the records whose gender is female
sales[sales$gender=="F",]

The following Rcode shows that the class of the sales variable is a data frame. However, the type of
the sales variable is a list. A list is a collection of objects that can be of various types, including other lists.
class(sales)
"data. frame"
typeof(sales)
"list"

Lists
Lists can contain any type of objects, including other lists. Using the vector v and the matrix M created in
earlier examples, the following Rcode creates assortment, a list of different object types.
# build an assorted list of a string, a numeric, a list, a vector,
# and a matrix
housing<- list("own", "rent")
assortment <- list("football", 7.5, housing, v, M)
assortment

[ [1)]
[1) "football"

[ (2])
[1) 7. 5

[ (3])
[ [ 3)) [ [ 1))
[1) "own"

[ [3)) [ [2)]
[1) "rent"

[ [4)]
[1] 1 2 3 4 5

[ [5)]
3.1 Introduction toR

[I 1] [ 1 2] [ 13 J
[11 J 1 5
[21 J 3 0
[3 1 J 3 4

In displaying the contents of assortment, the use of the double brackets, [ [] ] , is of particular
importance. As the following Rcode illustrates, the use of the single set of brackets only accesses an item
in the list, not its content.
# examine the fifth object, loll in the list
class(assortment[S]) .. returns "2.ist"
..
tt

length(assortment[S]) tt returns 1

class(assortment[[S]]) # returns "matrix"


length(assortment[[S]]) # returns 9 {for the 3x3 matrix)

As presented earlier in the data frame discussion, the s tr ( ) function offers details about the structure
of a list.
str(assortment)
List of 5
$ : chr "football"
$ : num 705
$ :List of 2
0 $ : chr "own "
0

0 0$ : chr "rent"
$ int [ 1: 5] 1 2 3 4 5
$ : num [ 1: 3 1 : 3] 1 3 3 5 0 4 3 3 3
1

Factors
Factors were briefly introduced during the discussion of the gender variable in the data frame sales.
In this case, gender could assume one of two levels: ForM. Factors can be ordered or not ordered. In the
case of gender, the levels are not ordered.
class(sales$gender) # returns "factor"
is.ordered(sales$gender) # returns FALSE

Included with the ggplot2 package, the diamonds data frame contains three ordered factors.
Examining the cut factor, there are five levels in order of improving cut: Fair, Good, Very Good, Premium,
and Ideal. Thus, sales$gender contains nominal data, and diamonds$cut contains ordinal data.
head(sales$gender) # display first six values and the levels

F F l-1 1'-1 F F
Levels: F l\1

library(ggplot2)
data(diamonds) # load the data frame into the R workspace
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

str(diamonds)
'data.frame': 53940 obs. of 10 variables:
$ carat num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 ...
$ cut Ord.factor w/ 5 levels "Fair"c"Good"c .. : 5 4 2 4 2 3 ...
$ color Ord.factor w/ 7 levels "D"c"E"c"F"c"G"c .. : 2 2 2 6 7 7
$ clarity: Ord.factor w/ 8 levels "I1"c"SI2"c"SI1"< .. : 2 3 5 4 2
$ depth num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4
$ table num 55 61 65 58 58 57 57 55 61 61 ...
$ price int 326 326 327 334 335 336 336 337 337 338
$ X num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05
$ z num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39

head(diamonds$cut) # display first six values and the levels


Ideal Premium Good Premium Good Very Good
Levels: Fair c Good c Very Good < Premium < Ideal

Suppose it is decided to categorize sales$sales_ totals into three groups-small, medium,


and big-according to the amount of the sales with the following code. These groupings are the basis for
the new ordinal factor, spender, with levels {small, medium, big}.
# build an empty character vector of the same length as sales
sales_group <- vector (mode=''character",
length=length(sales$sales_total))

# group the customers according to the sales amount


sales_group[sales$sales_total<100] <- "small"
sales_group[sales$sales_total>=100 & sales$sales_total<500] <- "medium"
sales_group[sales$sales_total>=500] <- "big"

# create and add the ordered factor to the sales data frame
spender<- factor(sales_group,levels=c("small", "medium", "big"),
ordered = TRUE)
sales <- cbind(sales,spender)

str(sales$spender)
Ord.factor w/ 3 levels "small"c"medium"c .. : 3 2 1 2 3 1 1 1 2 1 ...

head(sales$spender)
big medium small medium big small
Levels: small < medium c big

The cbind () function is used to combine variables column-wise. The rbind () function is used
to combine datasets row-wise. The use of factors is important in several Rstatistical modeling functions,
such as analysis of variance, aov ( ) , presented later in this chapter, and the use of contingency tables,
discussed next.
3.11ntrodudion toR

Contingency Tables
In R, table refers to a class of objects used to store the observed counts across the factors for a given dataset.
Such a table is commonly referred to as a contingency table and is the basis for performing a statistical
test on the independence of the factors used to build the table. The following Rcode builds a contingency
table based on the sales$gender and sales$ spender factors.
# build a contingency table based on the gender and spender factors
sales_table <- table{sales$gender,sales$spender)
sales_table
small medium big
F 1726 2746 563
M 1656 2723 586

class(sales_table) returns "table"


typeof(sales_table) returns "integer"
dim{sales_table) # returns 2 3

# performs a chi-squared test


summary(sales_table)
Number of cases in table: 10000
Number of factors: 2
Test for independence of all factors:
Chisq = 1.516, df = 2, p-value = 0.4686

Based on the observed counts in the table, the summary {) function performs a chi-squared test
on the independence of the two factors. Because the reported p-value is greater than 0.05, the assumed
independence of the two factors is not rejected. Hypothesis testing and p-values are covered in more detail
later in this chapter. Next, applying descriptive statistics in Ris examined.

3.1.4 Descriptive Statistics


It has already been shown that the summary () function provides several descriptive statistics, such as
the mean and median, about a variable such as the sales data frame. The results now include the counts
for the three levels of the spender variable based on the earlier examples involving factors.
summary(sales)
cust ·- id sales - total nurn --- of orders gende1· spender
!,lin. :100001 r~lin. 30.02 !\lin. 1.000 F:5035 small :3382
1st Qu. :102501 1st Qu.: 80.29 1st Qu.: 2.000 1<1:4965 medium:5469
!\led ian :105001 r<ledian : 151.65 1\ledian : 2.000 big : 114 9
!\lean :105001 r-lean 249.46 r'lean 2.428
3rd Qu. : 107500 3rd Qu.: 295.50 3rd Qu.: 3.000
!\lax. :110000 r~Iax. : 76 06 . 0 9 1~1ax. :22.000

The following code provides some common Rfunctions that include descriptive statistics. In parenthe-
ses, the comments describe the functions.
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

:. ca.ls, assig::
x <- sales$sales_total
y <- sales$num_of_orders

cor(x,yl :: returns 0.75080!5 correlatlor.)


cov(x ,y l II returns 345.~111 (covarianc<?)
IQR(x) ff :retu::n~ :ns.:n (int.erquartile range)
mean(x) ,. returns 249.4';'i7 mean)
~

median(x) returns 151.65 (me:lianl


range (x ) te:urns 30.0::: 7 06 . 09 min rna:-:)
"
sd(x) ., returns '-l9.0508 ::;:... :1. :ie·:.
var(x) retur::s ~C17r.l.~ ··ari:'l!!~"",...

The IQR () function provides the difference between the third and the first quarti les. The other fu nc-
tions are fairly self-explanatory by their names. The reader is encouraged to review the available help files
for acceptable inputs and possible options.
The function apply () is useful when the same function is to be applied to several variables in a data
frame. For example, the following Rcode calculates the standard deviation for the first three variables in
sales. In the code, setting MARGIN=2 specifies that the sd () function is applied over the columns.
Other functions, such as lappl y () and sappl y (), apply a function to a list or vector. Readers can refer
to the R help files to learn how to use these functions.

apply (sales[,c (l : 3) ], MARGIN=2, FUN=Sd )

Additional descriptive statistics can be applied wi th user-defined funct ions. The following R code
defines a function, my_ range () , to compute the difference between the maximum and minimum va lues
returned by the range () function. In general, user-defined functions are usefu l for any task or operation
that needs to be frequently repeated. More information on user-defined functions is available by entering
help ( 11 function 11 ) in the console.
# build a functi~n tv plvviJ~ the difterence bet~een
~ -he maxrmum and thE .:m • •.1<
my_ range < - function (v) {range (v ) (2] - range (v) [1)}
my_range (x )

3.2 Exploratory Data Analysis


So far, this chapter has addressed importing and exporting data in R, basic data types and operations, and
generating descriptive statistics. Functions such as summary () can help analysts ea sily get an idea of
th e magnitude and range of the data, but other aspects such as linear relationships and distributions are
more difficult to see from descriptive statistics. For example, the following code shows a summary view of
a data frame data with two columns x and y. The output shows the range of x and y, but it's not clear
what the relationship may be between these two variables.
3.2 Exploratory Data Analysis

summary (data )
·.. y
M1n. : 1.90481 ~·1n .

1st Qu . : -0.66321 !s• 0u


Nedian : 0 . 0"'167 N•d.111
Nean 0.0-52~

3rd Qu .: 0 . 65414 r t ,,u

A useful way to detect patterns and anomalies in the data is through the exploratory data analysis with
visualization. Visualization gives a succinct, holistic view of the data that may be difficult to grasp from the
numbers and summaries alone. Variables x and y of the data frame data can instead be visual ized in a
scatterplot (Figure 3-5). which easily depicts the relationship between two variab les. An important facet
of the initial data exploration, visualization assesses data cleanliness and suggests potentially important
relationships in the data prior to the model planning and building phases.

Scatterplot of X and Y

-1·

2 0 2
X

FIGURE 3-5 A scatterplot can easily show if x andy share a relation

The code to generate data as well as Figure 3-5 is shown next.

x <- rno rm (SO)


y <- x + rnorm(SO , mea n=O , sd=O . S)

data<- as . data . f rame(cbind(x , y ) )


REVIEW OF BASIC DATA ANALYTIC METHODS USING R

s ummary (data )

library (ggplo t 2)
ggpl ot (data, aes (x=x , y=y)) +
geom_point (size=2) +
ggtitle ("Scatterplo t o f X and Y" ) +
theme (axis.text=el emen t_t ex t(s i ze= l 2) ,
axis. title el emen t_text (si ze= l4 ) ,
plot.title = e l ement_ t ex t(si ze=20 , fa ce ="bold" ))

Explo ra tory data analysis [9] is a data ana lysis approach to reveal the important characteristics of a
dataset, mainly through visualization. This section discusses how to use some basic visualization techniques
and the plotting feature in R to perform exploratory data analysis.

3.2.1 Visualization Before Analysis


To illustrate the importance of visualizing data, consider Anscombe's quartet. Anscom be's quartet consists
of four datasets, as shown in Figure 3-6. It was constructed by statistician Francis Anscom be [10] in 1973
to demonstrate the importance of graphs in statistical analyses.

#1 #2 # 3 #4
X y X y X y X y
4 4.26 4 3 10 4 5.39 8 5 25
5 5.68 5 4 74 5 5.73 8 5.56
6 7.24 6 6 13 6 6.08 8 5.76
7 4.82 7 7.26 7 6.42 8 6.58
8 6.95 8 8. 14 6.77 8 6.89
9 8.81 9 8.77 9 7. 11 8 7.04
10 8.04 10 9. 14 10 7.46 8 7.7 1
11 8.33 11 9.26 11 7.81 8 7.91
12 10.84 12 9. 13 12 8. 15 8 8.47
13 7.58 13 8.74 13 12.74 8 8.84
14 9.96 14 8. 10 14 8.84 19 12.50

fiGURE 3-6 Anscom be's quartet

The four data sets in Anscom be'squartet have nearly identical statistical properties, as shown in Table 3-3.

TABLE 3-3 Statistical Properties of Anscombe's Quartet

Statistical Property Value


Meanof x 9

Variance of y 11

M ean ofY 7.50 (to 2 decimal points)


3.2 Exploratory Data Analysis

Variance of Y 4.12 or4.13 (to 2 decimal points)

Correla tions between x andy 0.816

Linear regression line y =3.00 + O.SOx (to 2 decimal points)

Based on the nearly identical statistical properties across each dataset, one might conclude that these
four datasets are quite similar. However, the scatterplots in Figure 3-7 tell a different story. Each dataset is
plotted asa scatterplot, and the fitted lines are the result of applying linear regression models. The estimated
regression line fits Dataset 1 reasonably well. Dataset 2 is definitely nonlinear. Dataset 3 exhibits a linear
trend, with one apparent outlier at x = 13. For Dataset 4, the regression line fits the dataset quite well.
However, with only points at two x values, it is not possible to determine that the linearity assumption is
proper.

~I ~
12

••• ••
• • • •


3 4

12
• •
• ••
:t 5
••• •

10 15
X
~:
5 10 15

FIGURE 3-7 Anscombe's quartet visualized as scatterplots

The Rcode for generating Figure 3-7 is shown next. It requires the Rpackage ggplot2 [11]. which can
be installed simply by running the command install . p ackages ( "ggp lot2" ) . The anscombe
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

dataset for the plot is included in the standard Rdistribution. Enter data ( ) for a list of datasets included
in the R base distribution. Enter data ( Da tase tName) to make a dataset available in the current
workspace.
In the code that follows, variable levels is created using the gl (} function, which generates
factors offour levels (1, 2, 3, and 4), each repeating 11 times. Variable myda ta is created using the
with (data, expression) function, which evaluates an expression in an environment con-
structed from da ta.ln this example, the data is the anscombe dataset, which includes eight attributes:
xl, x2, x3, x4, yl, y2, y3, and y4. The expression part in the code creates a data frame from the
anscombe dataset, and it only includes three attributes: x, y, and the group each data point belongs
to (mygroup).
install.packages(''ggplot2") # not required i f package has been installed

data (anscombe) It load the anscombe dataset into the current \'iOrkspace
anscombe
x1 x2 x3 x4 y1 y2 y3 y4
1 10 10 10 8 8. O·l 9.14 7.-16 6.58
2 8 8 8 8 6.95 8.14 6.77 5.76
13 13 13 7.58 8.74 12.74 7.71
4 9 9 8.81 8.77 7.11 8.84
5 11 11 11 8.33 9.26 7.81 8.·±7
6 14 14 14 8 9. 9G 8.10 8.34 7.04
7 6 6 6 8 7.24 6.13 6. •J8 5.25
8 4 4 4 19 ·l. 26 3.10 5. 3 9 12.50
9 12 12 12 8 10. 8•1 9.13 8.15 5.56
10 7 7 7 8 4.82 7.26 6.-12 7.91
11 5 5 5 8 5.68 4.74 5.73 6.89

nrow(anscombe) It number of rows


[1] 11

# generates levels to indicate which group each data point belongs to


levels<- gl(4, nrow(anscombe))
levels
[1] 1 1 1 1 1 1 1 1 l l l 2 .; 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
[ 34] 4 4 4 4 4 •l ·l 4 4 4 4
Levels: 1 2 3 4

# Group anscombe into a data frame


mydata <- with(anscombe, data.frame(x=c(xl,x2,x3,x4), y=c(yl,y2,y3,y4),
mygroup=levels))

mydata
X y mygroup
10 8.04
2 8 6.95
13 7.58
4 9 8.81
3.2 Exploratory Data Analysis

...
4,
1
B
1 ...
'i.S6
4

-l3 8 7 l 0 4
44 B 6 . 89 4

A Ma~f -~atterp "tF ~siny th ruplot~ package

library (ggplot2 )
therne_set (therne_bw ()) - s L rlot color :~erne

j\11 ' 1 7
ggplot (rnydata, aes (x, y )) +
geom_point (size=4 ) +
geom_srnooth (rnethod="lrn ", fill=NA, f ullrange=TRUE ) +
facet_wrap (-rnygroup )

3.2.2 Dirty Data


Thissection addresses how dirty data ca n be detected in th e data expl oration phase with visualizations. In
general, analysts should look for anomalies, verify the data with domain knowledge, and decide the most
appropriate approach to clean the data.
Consider a scenario in which a bank isconducting data analyses of its account holders to gauge customer
retention. Figure 3-8 shows the age distribution of the account holders.

0
...
0 -

>.
u
~~
~~
cQJ
:J
0"
QJ
u:
8 -J
~

0 -'

Age
FIGURE 3-8 Age distribution of bank account holders

If the age data is in a vector called age, the graph can be created with the following Rscript:

hist(age, br eaks=l OO , main= "Age Distributi on of Account Holders ",


xlab="Age", ylab="Frequency", col ="gray" )

The figure shows that the median age of the account holders is around 40. A few accountswith account
holder age less than 10 are unusual but plausible. These could be custodial accounts or college savings
accounts set up by the parents of young children. These accounts should be retained for future analyses.
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

However, the left side of the graph shows a huge spike of customers who are zero years old or have
negative ages. This is likely to be evidence of missing data. One possible explanation is that the null age
values could have been replaced by 0 or negative values during the data input. Such an occurrence may
be caused by entering age in a text box that only allows numbers and does not accept empty values. Or it
might be caused by transferring data among several systems that have different definitions for null values
(such as NULL, NA, 0, -1, or-2). Therefore, data cleansing needs to be performed over the accounts with
abnormal age values. Analysts should take a closer look at the records to decide if the missing data should
be eliminated or if an appropriate age value can be determined using other available information for each
of the accounts.
In R, the is . na (} function provides tests for missing values. The following example creates a vector
x where the fourth value is not available (NA). The is . na ( } function returns TRUE at each NA value
and FALSE otherwise.
X<- c(l, 2, 3, NA, 4)
is.na(x)
[1) FALSE FALSE FALSE TRUE FALSE

Some arithmetic functions, such as mean ( }, applied to data containing missing values can yield an
na. rm parameter to TRUE to remove the missing value during the
NA result. To prevent this, set the
function's execution.
mean(x)
[1) NA
mean(x, na.rm=TRUE)
[1) 2. 5

The na. exclude (} function returns the object with incomplete cases removed.
DF <- data.frame(x = c(l, 2, 3), y = c(lO, 20, NA))
DF
X y
1 1 10
2 2 20
3 3 NA

DFl <- na.exclude(DF)


DFl
X y
1 1 10
2 2 20

Account holders older than 100 may be due to bad data caused by typos. Another possibility is that these
accounts may have been passed down to the heirs of the original account holders without being updated.
In this case, one needs to further examine the data and conduct data cleansing if necessary. The dirty data
could be simply removed or filtered out with an age threshold for future analyses. If removing records is
not an option, the analysts can look for patterns within the data and develop a set of heuristics to attack
the problem of dirty data. For example, wrong age values could be replaced with approximation based
on the nearest neighbor-the record that is the most similar to the record in question based on analyzing
the differences in all the other variables besides age.
3.2 Exploratory Data Analysis

Figure 3-9 presents another example of dirty data. The distribution shown here corresponds to the age
of mortgages in a bank's home loan portfolio. The mortgage age is calculated by subtracting the origina-
tion date of the loan from the current date. The vertical axis corresponds to the number of mortgages at
each mortgage age.

Portfolio Distribution, Years Since Origination


0
0

"'
~

0
0
~
0
0
u>- (X)
cQj
0
::J 0
cY <D
~
u. 0
..,.
0

0
0
"'
0
I
0 2 6 8 10
Mortgage Age

FIGURE 3-9 Distribution of mortgage in years since origination from a bank's home loan portfolio

If the data is in a vector called mortgage, Figure 3-9 can be produced by the following R script.

hist (mortgage, breaks=lO, xlab='Mortgage Age ", col= "gray•,


main="Portfolio Distribution, Years Since Origination" )

Figure 3-9 shows that the loans are no more than 10 years old, and these 10-year-old loans have a
disproportionate frequency compared to the rest of the population. One possible explanation is that the
10-year-old loans do not only include loans originated 10 years ago, but also those originated earlier than
that. In other words, the 10 in the x-axis actually means"<! 10. This sometimes happens when data is ported
from one system to another or because the data provider decided, for some reason, not to distinguish loans
that are more than 10 years old. Analysts need to study the data further and decide the most appropriate
way to perform data cleansing.
Data analysts shou ld perform sanity checks against domain knowledge and decide if the dirty data
needs to be eliminated. Consider the task to find out the probability of mortga ge loan default. If the
past observations suggest that most defaults occur before about the 4th year and 10-year-old mortgages
rarely default, it may be safe to eliminate the dirty data and assume that the defaulted loans are less than
10 years old. For other ana lyses, it may become necessary to track down the source and find out the true
origination dates.
Dirty data can occur due to acts of omission.ln the sales data used at the beginning of this chapter,
it was seen that the minimum number of orders was 1 and the minimum annual sales amount was $30.02.
Thus, there isa strong possibility that the provided dataset did not include the sales data on all customers,
just the customers who purchased something during the past year.
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

3.2.3 Visualizing a Single Variable


Using visual representations of data is a hallmark of exploratory data analyses: letting the data speak to
its audience rather than imposing an interpretation on the data a priori. Sections 3.2.3 and 3.2.4 examine
ways of displaying data to help explain the underlying distributions of a single variable or the relationships
of two or more variables.
R has many fu nctions avai lable to examine a single variable. Some of these func tions are listed in
Table 3-4.

TABLE 3-4 Example Functions for Visualizing a Single Variable

Function Purpose
p l o t (data ) Scatterplot where x is the index andy is the value;
suitable for low-volume data

barp lot (data ) Barplot with vertical or horizontal bars

dotchart (data ) Cleveland dot plot [12)

hist (data ) Histogram

plot(density (data )) Density p lot (a continuous histogram)

s tem (data) Stem-and -leaf plot

rug (data ) Add a rug representation (1-d plot) of the data to an


existing plot

Dotchart and Barplot


Dotcha rt and barplot portray continuous values with labels from a discrete variable. A dotchart can be
created in R with the function dot cha rt ( x , lab e l= ... ) , where x is a numeric vector and l a bel
is a vector of categorical labels for x. A barplot can be created with the barplot (h e igh t ) function,
w here h eigh t represents a vector or matrix. Figure 3-10 shows (a) a dotchart and (b) a barplot based
on the mtcars dataset, which includes the fuel consumption and 10 aspects of automobile design and
performance of 32 automobiles. This dataset comes with the standard R distribution.
The plots in Figure 3-10 can be produced with the following R code.

data (mtcars )
dotchart (mtcars$mpg,labels=row . names (mtcars ) ,cex=.7,
main= "Mi les Per Gallon (MPG ) of Car Models",
xlab ="MPG" )
barplot (tabl e (mtcars$cyl ) , main="Distribu:ion of Car Cyl inder Counts",
x lab= "Number of Cylinders" )

Histogram and Density Plot


Figure 3-ll(a) includ es a histogram of household income. The histogram shows a clear concentration of
low household incom es on the left and the long tail of the higher incomes on the right.
3.2 Explorat ory Dat a Analysis

l.llles Per Gallon (t.IPG) of Cor Models

Volvo U2f 0

Uastreb Bota 0

Ferran [)n)
Ford Panttra L
Lotus Europa
Pot3che 91 • -2 0

F'"1X1·9 0

Ponbac Frebrd Distribution of Car Cylinder counts


ComaroZ28
AJ.ICJavtlw1 ~
Dodge Chalenger
Toyota Corona
Toyota Carob 0 ~
Hondt CNC 0
0
F,.1 128
~
Chrysltr ~nal 0
L11cok1 Cont11tnt.l
C.d.. c Fleotwoo4 o co
Utrc •54Slt
Utrc 4SOSL
"'

D
Utrc.c.SQSE. 0

Utrc2!0(
Were 280 ....
Wtre230
llerc 2•00
Ousttr 360
Vttant 0

Homtt SportabOU1 0
Homtl' Drrve
Datsun 7 10 6 8
Uazdo R.X' Wag
Uazda RXI 0 Ntmler ol Cylinders

10 15 20 2S 30

UPG

(a) (b)
FIGURE 3-10 (a) Dotchart on the miles per gallon of cars and (b) Barplot on the distribution of car cylinder
counts

Histogram of Income Distribution of Income (log10 scale)

....
0
"
N
0

"' 0
0

"'
..
i';
c:
:>
.,
0

"'
?;-
v;
.,
c:
00
0

CT

u: "'
0
0 "'
0

0
N "'
0

N
~ 0

0
0 0

Oe•OO 1e+05 2e+05 3e+05 4e•05 5e• 05 40 45 50 55

Income N = 4000 Band\Yidlh = 0 02069

FIGURE 3·11 (a) Histogram and (b) Density plot of household income
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

Figure 3-11 (b) shows a density plot of the logarithm of household income values, which emphasizes
the distribution. The income distribution is concentrated in the center portion of the graph. The code to
generate the two plots in Figure 3-11 is provided next. The rug ( } function creates a one-dimensional
density plot on the bottom of the graph to emphasize the distribution of the observation.
# randomly generate 4000 observations from the log normal distribution
income<- rlnorm(4000, meanlog = 4, sdlog = 0.7)
summary (income)
Min. 1st Qu. t.Jedian t>!ean 3rd Qu. f.!ax.
4.301 33.720 54.970 70.320 88.800 659.800
income <- lOOO*income
summary (income)
Min. 1st Qu. f.!edian f.!ean 3rd Qu. 1\!ax.
4301 33720 54970 70320 88800 659800
# plot the histogram
hist(income, breaks=SOO, xlab="Income", main="Histogram of Income")
# density plot
plot(density(loglO(income), adjust=O.S),
main="Distribution of Income (loglO scale)")
# add rug to the density plot
rug(loglO(income))

In the data preparation phase of the Data Analytics Lifecycle, the data range and distribution can be
obtained. If the data is skewed, viewing the logarithm of the data (if it's all positive) can help detect struc-
tures that might otherwise be overlooked in a graph with a regular, nonlogarithmic scale.
When preparing the data, one should look for signs of dirty data, as explained in the previous section.
Examining if the data is unimodal or multi modal will give an idea of how many distinct populations with
different behavior patterns might be mixed into the overall population. Many modeling techniques assume
that the data follows a normal distribution. Therefore, it is important to know if the available dataset can
match that assumption before applying any of those modeling techniques.
Consider a density plot of diamond prices (in USD). Figure 3-12(a) contains two density plots for pre-
mium and ideal cuts of diamonds. The group of premium cuts is shown in red, and the group of ideal cuts
is shown in blue. The range of diamond prices is wide-in this case ranging from around $300 to almost
$20,000. Extreme values are typical of monetary data such as income, customer value, tax liabilities, and
bank account sizes.
Figure 3-12(b) shows more detail of the diamond prices than Figure 3-12(a) by taking the logarithm. The
two humps in the premium cut represent two distinct groups of diamond prices: One group centers around
log10 price= 2.9 (where the price is about $794), and the other centers around log 10 price= 3.7 (where the
price is about $5,012). The ideal cut contains three humps, centering around 2.9, 3.3, and 3.7 respectively.
The Rscript to generate the plots in Figure 3-12 is shown next. The diamonds dataset comes with
the ggplot2 package.
library("ggplot2")
data(diamonds) # load the diamonds dataset from ggplot2

# Only keep the premium and ideal cuts of diamonds


3.2 Exploratory Data Analysis

niceDiamonds <- diamonds [diamonds$cut=="Pr emi um" I


diamonds$cut== " Ide a l •, I

summary(niceDiamonds$cut )
Pr m1u !:! a ..
0 0 137<ll . lSSl

# plot density plot of diamond prices


ggplo t( niceDiamonds, ae s(x=price , fill =cut) ) +
geom_density(alpha = .3, col or=NA)

# plot density plot of the loglO of diamond prices


ggpl ot (niceDiamonds , aes (x=logl O(price) , f il l =cut ) ) +
geom_density (alpha = . 3, color=NA)

As an alternative to ggplot2, the lattice package provides a function ca lled densityplot ()


for making simple density plot s.

.. '
3t
'

~ ~ cut
o;
; '. ..
;;
c Premium
Jcsu•
" "

It
'

0 0

1~300
pri ce togtO(prtce)

(a) (b)

FIGURE 3-12 Density plot s of (a) d iamond prices and (b) t he logarit hm of diamond p rices

3.2.4 Examining Multiple Variables


A scatterp lot (shown previously in Figure 3-1 and Figure 3-5) is a simple and w idely used visualization
for findin g the relationship among multiple va riables. A sca tterplo t ca n represent data with up to fi ve
variables using x-axis, y-axis, size, color, and shape. But usually only t wo to four variables are portrayed
in a scatterplot to minimize confusion. When examining a scatterplot, one needs to pay close attention
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

to the possible relationship between the variables. If the functiona l relationship between the variables is
somewhat pronounced, the data may roughly lie along a straight line, a parabola, or an exponential curve.
If variable y is related exponentially to x, then the plot of x versus log (y) is approximately linea r. If the
plot looks more like a cluster without a pattern, the corresponding variables may have a weak relationship.
The scatterplot in Figure 3-13 portrays the relationship of two variables: x and y . The red line shown
on the graph is the fitted line from the linear regression. Linear regression wi ll be revisited in Chapter 6,
"Advanced Analytical Theory and Methods: Reg ression." Figure 3-13 shows that the regression line does
not fit the data well. This is a case in which linear regression cannot model the relationship between the
variables. Alternative methods such as the l oess () functio n ca n be used to fit a nonlinear line to the
data. The blue curve shown on the graph represents the LOESS curve, which fits the data better than linear
regression.

0
0
0
0
N

.,.,
0

0
0

.,.,
0
0
0 o o 0
0
0
0
0
0

0 2 4 6 8 10

FIGURE 3-13 Examining two variables with regression

The Rcode to produce Figure 3-13 is as follows. The runi f ( 7 5, 0 , 1 0) generates 75 numbers
between 0 to 10 with random deviates, and the numbers conform to the uniform distribution. The
r norm( 7 5 , o , 2 o) generates 75 numbers that conform to the normal distribu tion, with the mean eq ual
to 0 and the standard deviation equal to 20. The poi n ts () function is a generic function that draws a
sequence of points at the specified coordinates. Parameter type=" 1" tells the function to draw a solid
line. The col parameter sets the color of the line, where 2 represents the red color and 4 represents the
blue co lor.

7 numbers ben:een ~nd 10 ~f unifor~ distribution


x c- runif(75 , 0 , 1 0)

x c- sort(x)
y c - 200 + xA 3 - 10 * x A2 + x + rnorm(75, 0 , 20)

lr c- lm(y - x) ::a_l_ .eqL .;~l ..


poly c- loess( y - x) L E5~
3.2 Exploratory Data Analysis

fit <- predict (pol y) fit a nonlinear lice

plot (x, y )

# draw the fitted li~e for the l i near regression


points (x, lr$coeffic i e nts [l ] + lr$coeffic i ents [2 ] • x ,
type= " 1 ", col = 2 )

po ints (x, fit, type = "1" , col = 4 )

Dotchart and Barplot


Dotchart and barplot from the previous section can visualize multiple variables. Both of them use color as
an additional dimension for visualizing the data.
For the same mtcars dataset, Figure 3-14 shows a dotchart that groups vehicle cyl inders at they-axis
and uses colors to distinguish different cylinders. The vehicles are sorted according to their MPG values.
The code to generate Figure 3-14 is shown next.

Miles Per Gallon (MPG) of Car Models


Grouped by Cylinder

4
0
Toyota Corolla
0
Fl8t 128
Lotus Eu ropa 0
Honda Civic 0
F1at X1-9 0
Porscl1e 914- 2 0
t.terc2400 0
Mere 230 0
Datsun 710 0
Toyota Corona 0
Volvo 142E 0

6
Hornet 4 Drive 0
Mazda RX4 Wag 0
Mazda RX4 0
Ferrari Dine 0
I.! ere 280 0
Valiant 0
I.! ere 280C 0

8
Pontiac Fire bird 0
Hornet Sporta bout 0
Mer e 450SL 0
Mere 450SE 0
Ford Pantera L 0
Dodge Challenger 0
AMC Javeun 0
Mere 450SLC 0
1.1aserati Bora 0
Chrysler Imperial 0
Duster 360 0
Camaro Z28 0
Lincoln Continental 0
Cadillac Fleet wood 0

I I I I

10 15 20 25 30

Miles Per Gallon

FIGURE 3-14 Dotplot to visualize multip le variables


REVI EW OF BASIC DATA ANALYTIC METHODS USING R

;: sor- bJ' mpg


cars <- mtcars[or der {mtcars$mpg ) , )

h grouping variab l e must be a factol


cars$cyl < - f actor {cars$cyl )

cars$col or[car s$cyl ==4) <- "red"


cars$color[ cars$cyl== 6) < - "blue •
ca r s$color[cars$cyl==8) < - "darkgreen •

dotchart {cars$mp g, labels=row.names{cars ) , cex - . 7, groups= cars$cyl,


main=" Mi les Per Gallon {MPG ) of Car Mode l s \ nGr ouped by Cylinder• ,
xl ab="Mil es Per Gal l on•, co l or=cars$color, gcolor="bl ac k")

The barplot in Figure 3-15 visualizes the distri bution of car cyli nder counts and number of gears. The
x-axis represents the number of cylinders, and the color represents the number of gears. The code to
generate Figure 3-15 is shown next.

Distribution of Car Cylinder Counts and Gears

~
Number of Gears
• 3
~ 4
!;!
D 5
CD

~
:J <D
0
u

""
N

4 6 8

Number of Cylinders

FIGURE 3-15 Barplot to visualize multiple variables

count s <- t able {mtcars$gear , mtcars$cyl )


barpl ot (counts, main= "Di s tributi on o f Car Cylinder Coun ts and Gears • ,
x l ab="Number of Cylinders • , ylab="Counts •,
col=c ( " #OOO OFFFF" , "# 0080FFFF", "#OOFFFFFF") ,
legend = rownames (counts ) , beside- TRUE,
args. l egend = list (x= "top", title= "Number of Gears" ))
3 .2 Exploratory Data Analysis

Box-and-Whisker Plot
Box-and-whisker plots show the distribution of a continuous variable for each value o f a discrete variable.
The box-and-whisker plot in Figure 3-16 visualizes mean household incomes as a function of region in
th e United States. The first digit of the U.S. postal ("ZIP") code corresponds to a geographical region
in the United States. In Figure 3-16, each data point corresponds to the mean household income from a
particular zip code. The horizontal axis represents the first digit of a zip code, ranging from 0 to 9, where
0 corresponds to the northeast reg ion ofthe United States (such as Maine, Vermont, and Massachusetts),
and 9 corresponds to the southwest region (such as Ca lifornia and Hawaii). The vertical axis rep resents
the logarithm of mean household incomes. Th e loga rithm is take n to bet ter visualize the distr ibution
of th e mean household incomes.

Mean Household Income by Zip Code

so-
..,
E
0
u
c
;:;
0
.c
'""'::J
0
J:
iii ~ 5- •
'"
:::!!
0
Cl
2

' ' ' '


2 5 6 8 9
Zlp1

FIGURE 3-16 A box-and-whisker plot of mean household income and geographical region

In this figure, the scatterplot is displayed beneath the box-and-whisker plot, with some jittering for the
overlap points so that each line of points widens into a strip. The "box" of the box-and-whisker shows t he
range that contains the central 50% of the data, and the line inside the box is the location of the median
value. The upper and lower hinges of the boxes correspond to the first and third quartiles of the data. The
upper whisker extends from the hinge to the highest value that is within 1.5 * IQR of the hinge. The lower
whisker extends from the hinge to the lowest value w ithin 1.5 * IQR of the hinge. IQR is the inter-qua rtile
range, as discussed in Section 3.1.4. The points outside the wh iskers can be considered possible outliers.
REVIEW OF BAS IC DATA ANALYTIC M ETHODS USING R

The graph shows how household income varies by reg ion. The highest median incomes are in region
0 and region 9. Region 0 is slightly higher, but the boxes for the two regions overlap enough that the dif-
ference between the two regions probably is not significant. The lowest household incomes tend to be in
region 7, which includes states such as Louisiana, Arka nsas, and Oklahoma.
Assuming a data frame called DF contains two columns (MeanHousehol din come and Zipl), the
following Rscript uses the ggplot2 1ibrary [11 ] to plot a graph that is similar to Figure 3-16.

library (ggplot2 )
plot the jittered scat-erplot w/ boxplot
H color -code points with z1p codes
h th~ outlier . s.ze pr~vents the boxplot from p:c-•inq •h~ uutlier

ggplot (data=DF, aes (x=as . factor (Zipl ) , y=loglO(MeanHouseholdincome) )) +


geom_point(aes(color=factor (Zipl )) , alpha= 0 .2, pos it i on="j itter") +
geom_boxpl ot(outlier .size=O, alpha=O . l ) +
guides(colour=FALSE) +
ggtitle ("Mean Hous ehold Income by Zip Code")

Alternatively, one can create a simple box-and-whisker plot with the boxplot () function provided
by the Rbase package.

Hexbinplot for Large Data sets


This chapter ha s shown that scat terplot as a popular visualization can visualize data containing one or
more variables. But one should be ca reful about using it on high-volume data. lf there is too much data, the
structure of the data may become difficult to see in a scatterplot. Consider a case to compare the logarithm
of household income against the yearsof education, as shown in Figure 3-17. The cluster in the scatterplot
on the left (a) suggestsa somewhat linear relationship of the two variables. However, one cannot rea lly see
the structure of how the data is distributed inside the cluster. This is a Big Data type of problem. Millions
or billions of data points would require different approaches for exploration, visualization, and analysis.

j
g] Counts
71&8
6328
SS22
8
c
'0
0
C!
</) 0
0
0
0 I" ··- 4 77 1
407S
34
! 8
0
~ 1&15
316
:;)
0 </) 8 0 fa u .J 1640
I

I
c <i 141 8
1051
IB ~ 739
::!i
0
0. 0
0
r .... 432
279
C! 132
.2
" 39
0 0
10

~'•W~.Eduauon
.. 1

0 5 10 15

MeanEduca1ion

(a) (b)

FIGURE 3-17 (a) Scatterplot and (b) Hexbinplot of household income against years of education
3.2 Exploratory Data Analysis

Although color and transparency can be used in a scatterplot to address this issue, a hexbinplot is
sometimes a better alternative. A hexbinplot combines the ideas of scatterplot and histogram. Similar to
a scatterplot, a hexbinplot visualizes data in the x-axis andy-axis. Data is placed into hex bins, and the third
dimension uses shading to represent the concentration of data in each hexbin.
In Figure 3-17(b), the same data is plotted using a hexbinplot. The hexbinplot shows that the data is
more densely clustered in a streak that runs through the center of the cluster, roughly along the regression
line. The biggest concentration is around 12 years of education, extending to about 15 years.
In Figure 3-17, note the outlier data at MeanEducation=O. These data points may correspond to
some missing data that needs further cleansing.
Assuming the two variables MeanHouseholdincome and MeanEduca tion are from a data
frame named zeta, the scatterplot of Figure 3-17(a) is plotted by the following Rcode.
# plot the data points
plot(loglO(MeanHouseholdincome) - MeanEducation, data=zcta)
# add a straight fitted line of the linear regression
abline(lm(loglO(MeanHouseholdincome) - MeanEducation, data=zcta), col='red')

Using the zeta data frame, the hexbinplot of Figure 3-17(b) is plotted by the following R code.
Running the code requires the use of the hexbin package, which can be installed by running ins tall
.packages ( "hexbin").
library(hexbin)
# "g" adds the grid, "r" adds the regression line
# sqrt transform on the count gives more dynamic range to the shading
# inv provides the inverse transformation function of trans
hexbinplot(loglO(MeanHouseholdincome) - MeanEducation,
data=zcta, trans= sqrt, inv = function(x) x ... 2, type=c( 11 g 11 , 11
r 11 ) )

Scatterplot Matrix
A scatterplot matrix shows many scatterplots in a compact, side-by-side fashion. The scatterplot matrix,
therefore, can visually represent multiple attributes of a dataset to explore their relationships, magnify
differences, and disclose hidden patterns.
Fisher's iris dataset [13] includes the measurements in centimeters ofthe sepal length, sepal width,
petal length, and petal width for 50 flowers from three species of iris. The three species are setosa, versicolor,
and virginica. The iris dataset comes with the standard Rdistribution.
In Figure 3-18, all the variables of Fisher's iris dataset (sepal length, sepal width, petal length, and
petal width) are compared in a scatterplot matrix. The three different colors represent three species of iris
flowers. The scatterplot matrix in Figure 3-18 allows its viewers to compare the differences across the iris
species for any pairs of attributes.
REVIEW O F BA SIC DATA ANA LYTIC M ETHODS USIN G R

Fisher's Iris Dataset

..... w.4"t~
.....
20 25 30 35 •o 0510152025

••• I ..... "' ....


~- ..
~· f· . .."'
Sepal. length t' "' "' 0&<4

...... ..;.;.:.··
• ••• •
..... ,. .• "''"
14 11
10>1
'19
~

.. ...
11t
I ll

f.·.
)9

" . ..
;~·:
I

• •
•.,..··~til!~-= Sepal. Width ...
~-· .
.,.. .
..,
Q

0 _ic
"'
~ .. - ~·
•• •
·.t.*
•. .
• =:t • • Petal. length

Petal. Width
"'
Q

H 55 65 75 12 3<567

• setosa D verstcolor • virgimca


FIGURE 3·18 Scatterplot matrix of Fisher's {13] iris dataset

Consider the scatterplot from the first row and third col umn of Figure 3-18, where sepal length is com-
pared against petal length. The horizontal axis is the petal length, and the vertical axis is the sepal length.
The scatterplot shows that versicolor and virginica share similar sepal and petal lengths, although the latter
has longer petals. The petal lengthsof all setosa are about the sa me, and the petal lengths are remarkably
shorter than the other two species. The scatterplot shows that for versicolor and virgin ica, sepal length
grows linearly with the petal length.
The Rcode for generating the scatterplot mat rix is provided next.

I; define the colors


colors<- C( 11 red 11 11
green 11 ,
J
11
blue•')

~ draw the plot ma:rix

pairs(iris[l : 4], main= "Fisher ' s Iris Datase t•,


pch = 21, bg = colors[unclass ( iris$Species)]

= ~Qr qrdp~ica: pa~a~ :e~· - cl~!' p!ot - 1~9 :c :te ~1gure ~~a1o~

par (xpd = TRUE )

" ada l<"go::d


legend ( 0.2, 0 . 02, horiz = TRUE, as.vector (unique ( iris$Species )) ,
fil l = colors, bty = "n" )
3.2 Explorat ory Data Analysis

The vector colors defines th e colo r sc heme for the plot. It could be changed to something like
colors<- c("gray50", "white" , " black " } to makethescatterplotsgrayscale.

Analyzing a Variable over Time


Visua lizing a variable over time is the same as visualizing any pair of variables, but in this case the goal is
to identify time-specific patterns.
Figure 3-19 plots the mon thly total numbers of international airline passengers (in thousands) from
January 1940 to December 1960. Enter plot (AirPassengers} in the Rconsole to obtain a similar
graph. The plot shows that, for each year, a large peak occurs mid-year around July and August, and a sma ll
peak happens around the end of the year, possibly due to the holidays. Such a phenomenon is referred to
as a seasonality effect.

0
0
CD

0
0
II'>

"'Q;0> 0
.,c .... 0

"'"'
"'
Q, 0
0
< (")

0
0
N

0
~

1950 1952 1954 1956 1958 1960

Tune

FIGURE 3-19 Airline passenger counts from 1949 to 1960

Additionally, the overall trend is that the number of air passengers steadily increased from 1949 to
1960. Chapter 8, "Advanced Analytica l Theory and Methods: Time Series Analysis," discusses the analysis
of such data sets in greater detail.

3.2.5 Data Exploration Versus Presentation


Using visualization for data exploration is different from presenting results to stakeholders. Not every type
of plot issuitable for all audiences. Most of the plots presented earlier try to detail the data as clearly as pos-
sible for data scientists to identify structures and relationships. These graphs are more technical in nature
and are better suited to technical audiences such as data scientists. Nontechnical sta keholders, however,
generally prefer simple, clear graphics that focus on the message rather than the data.
Figure 3-20 shows the density plot on the distribution of account va lues from a bank. The data has been
converted to the log 10 scale. The plot includesa rug on the bottom to show the distribution of the variable.
This graph is more suitable for data scientists and business analysts because it provides information that
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

can be relevant to the downstream analysis. The graph shows that the transformed account values follow
an approximate normal distribution, in the range from $100 to $10,000,000. The median account value is
approximately $30,000 (1 o4s), with the majority of the accounts between $1,000 (1 03) and $1,000,000 (1 06).

Distribution of Account Values (log10 scale)


CD
ci
II)
ci
oq:
0
~
·a; ("')
c c::)
cu
0
N
c::)

.....
c::)

0
c::)

2 3 4 5 6 7

N = 5000 Bandwidth= 0.05759


FIGURE 3-20 Density plots are better to show to data scientists

Density plots are fairly technical, and they contain so much information that they would be difficult to
explain to less technical stakeholders. For example, it would be challenging to explain why the account
values are in the log 10 scale, and such information is not relevant to stakeholders. The same message can
be conveyed by partitioning the data into log-like bins and presenting it as a histogram. As can be seen in
Figure 3-21, the bulk of the accounts are in the S1,000-1,000,000 range, with the peak concentration in the
$10-SOK range, extending to $500K. This portrayal gives the stakeholders a better sense of the customer
base than the density plot shown in Figure 3-20.
Note that the bin sizes should be carefully chosen to avoid distortion of the data.ln this example, the bins
in Figure 3-21 are chosen based on observations from the density plot in Figure 3-20. Without the density
plot, the peak concentration might be just due to the somewhat arbitrary appearing choices for the bin sizes.
This simple example addresses the different needs of two groups of audience: analysts and stakehold-
ers. Chapter 12, "The Endgame, or Putting It All Together," further discusses the best practices of delivering
presentations to these two groups.
Following is the Rcode to generate the plots in Figure 3-20 and Figure 3-21.
# Generate random log normal income data
income= rlnorm(SOOO, meanlog=log(40000), sdlog=log(S))

# Part I: Create the density plot


plot(density(loglO(income), adjust=O.S),
main= 11 Distribution of Account Values (loglO scale)")
# Add rug to the density plot
3.3 Statistical Methods for Evaluation

r ug (logl O(income))

l• :l... "1 ! .... < bl:-;.5''

breaks = c(O, 1000, 5000, 10000, 50000, 100000, SeS, le6, 2e7 )
"'! 1.:: . . ... ••' ,
bins = cut(income, breaks, include .lowest =T,
labels c ( "< lK", "1 - SK", "5- lOK" , "10 - SOK",
"50-lOOK" , "100 -S OOK" , "SOCK-1M", "> 1M") )
~ n r •L ri. ..
plot(bins, main "Dis tribut i on of Account Val ues ",
xl ab "Account value ($ USD) ",
ylab = "Number of Accounts", col= "blue ")

Distribution of Account Values

- <1K

FIGURE 3·21 Histograms are better to show to stakeholders


1-5K 5-I OK 10-50K 50- l OOK

AccOirl value (S USO)


100.500K --
500K-1M > 11.1

3.3 Statistical Methods for Evaluation


Visualization is useful for data exploration and presentation, but statistics is crucial because it may exist
throughout the entire Data Analytics Lifecycle. Statistical techniques are used during the initial data explo-
ration and data preparation, model building, evaluation of the final models, and assessment of how the
new models improve the situation when deployed in the field. In particular, statisticscan help answer the
following questions for data analytics:

• Model Building and Planning

• What are the best input variables for the model?

• Can the model predict the outcome given the input?


REVIEW OF BASIC DATA ANALYTIC METHODS USING R

• Model Evaluation

• Is the model accurate?

• Does the model perform better than an obvious guess?

• Does the model perform better than another cand idate model?

• Model Deployment

• Is the prediction sound?

• Does the model have the desired effect (such as reducing the cost}?

This sec tion discusses some useful statistical tools that may answer these questions.

3.3.1 Hypothesis Testing


When compari ng populations, such as testing or evaluating the difference of the means from two samples
of data (Figure 3-22}, a common technique to assess the difference or the significance of the difference is
hyp othesis testing.

FIGURE 3-22 Distributions of two samples of data

The basic concept of hypothesis testing is to form an assertion and test it with data. When perform-
in g hypothesis tests, the common assumption is that there is no difference between two samples. This
assumption is used as the default position for building the test or conducting a scientific experiment.
Statisticians refer to this as the null hyp o thesis (H0 ). The altern a tive hyp o thesis (H) is that there is a
3.3 Statistical Methods for Evaluation

difference between two samples. For example, if the task is to identify the effect of drug A compared to
drug Bon patients, the null hypothesis and alternative hypothesis would be th is.

• fl 0: Drug A and drug Bhave the same effect on patients.


• fl A: Drug A has a greater effect than drug Bon patients.

If the task is to identify whether advertising Campaign C is effective on reducing customer churn, the
null hypothesis and alternative hypothesis wou ld be as follows.

• fl0 : Campaign Cdoes not reduce customer churn better than the cu rrent campa ign method.
• flA: Campaign C does reduce customer churn better than the current campa ign.

It is important to state the null hypothesis and alternative hypothesis, because misstating them is likely
to undermine the subsequent steps of the hypothesis testing process. A hypothesis test leads to either
rejecting the null hypothesis in favor of the alternative or not rejecting the null hypothesis.
Table 3·5 includes some examples of null and alternative hypotheses that should be answered during
the analytic lifecycle.

TABLE 3-5 Example Null Hypotheses and Alternative Hypotheses

Application Null Hypothesis Alternative Hypothesis


Accuracy Forecast Model Xdoes not predict better Model X predicts better than the existing
than the existing model. model.

Recommendation Algorithm Y does not produce Algorithm Y produces better recommen·


Engine better recommendations than dations than the current algorithm being
the current algorithm being used.
used.

Regression Thisvariable does not affect the This variable affects outcome because its
Modeling outcome because its coefficient coefficient is not zero.
is zero.

Once a model is built over the t raining data, it needs to be eva luated over the testing data to see if the
proposed model predicts better than the existing model curren tly being used. Th e null hypothesis is that
the proposed model does not predict better than the existing model. The alternative hypothesis is that
the proposed model indeed predicts better than the existing model. In accuracy forecast, the null model
could be that the sales of the next month are the same as the prior month. The hypothesis test needs to
evaluate if the proposed model provides a better prediction. Take a recommendation engine as an example.
The null hypothesis could be that the new algorithm does not produce better recommendations than the
current algorithm being deployed. The alternative hypothesis is that the new algorithm produces better
recommendations than the old algorithm.
When eva luating a model, sometimes it needs to be determined if a given input variable improves the
model. In regression analysis (Chapter 6), for example, this is the same as asking if the regression coefficient
for a variable is zero. The null hypothesis is that the coefficient is zero, which means the variable does not
have an impact on the outcome. The alternative hypothesis is that the coefficient is nonzero, which means
the variable does have an impact on the outcome.
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

A common hypothesis test is to compare the means of two populations. Two such hypothesis test sare
discussed in Section 3.3.2.

3.3.2 Difference of Means


Hypothesis testing is a common approach to draw inferences on whether or not the two populations,
denoted pop1 and pop2, are different from each other. This section provides two hypothesis tests to com-
pare the means of the respective populations based on sam ples randomly drawn from each population.
Specifically, the two hypothesis tests in this section consider the following null and alternative hypotheses.

• Ho: II , = ll2
• HA: II , ""' ll2

The 1', and 112 denote the population means of pop1 and pop2, respectively.
The basic testing approach is to compare the observed sample means, X,and X2, corresponding to each
population. If the values of X1 and X2 are approximately equal to each other, the distributions of X,and
X2 overlap substantially (Figure 3-23), and the null hypothesis is supported. A large observed difference
between the sample means indicates that the null hypothesis should be rejected. Formally, the difference
in means can be tested using Student's t-test or the Welch's t-test.

Irx, '" :X2 .


this area is
large

FIGURE 3-23 Overlap of the two distributions is large if X1 ::::: X2

Student's t-test
Stud ent 's t- test ass umes that distributions of the t wo populations have equal but unknow n
variances. Suppose n1 and n2 samples are random ly and independently selected from two populations,
pop1 and pop2, respectively. If each population is normally distributed with the same mean (Jt 1 = Jt 2) and
wi th the sa me variance, then T (the t-statistic ), given in Equation 3-1, follows a t-distribution w ith
n, + n2 - 2 degrees of freedom (df).

where (3-1)
3.3 Statistical Methods for Evaluation

The shape of the t-distribution is similar to the normal distribution. In fact, as the degrees of freedom
approaches 30 or more, the t-distribution is nearly identical to the normal distribution. Because the numera-
tor ofT is the difference of the sample means, if the observed value ofT is far enough from zero such that
the probability of observing such a value of Tis unlikely, one would reject the null hypothesis that the
population means are equal. Thus, for a small probability, say a= 0.05, T* is determined such that
P(ITI2: T*) = 0.05. After the samples are collected and the observed value ofT is calculated according to
Equation 3-1, the null hypothesis (p,1 = p 2) is rejected ifiTI2: r·.
In hypothesis testing, in general, the small probability, n, is known as the significance level of the test.
The significance level of the test is the probability of rejecting the null hypothesis, when the null hypothesis
is actually TRUE.In other words, for n = 0.05, if the means from the two populations are truly equal, then
in repeated random sampling, the observed magnitude ofT would only exceed r· 5% of the time.
In the following Rcode example, 10 observations are randomly selected from two normally distributed
populations and assigned to the variables x andy. The two populations have a mean of 100 and 105,
respectively, and a standard deviation equal to 5. Student's t-test is then conducted to determine if the
obtained random samples support the rejection of the null hypothesis.
# generate random observations from the two populations
x <- rnorm(lO, mean=lOO, sd=S) # normal distribution centered at 100
y <- rnorm(20, mean=lOS, sd=S) ll no~:mal distribution centered at 105

t.test(x, y, var.equal=TRUE) # run the Student's t-test


Two Sample t-test

data: x and y
t = -1.7828, df = 28, p-value = 0.08547
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.1611557 0.4271393
sample estimates:
mean of x mean of y
102.2136 105.0806

From the R output, the observed value of Tis t = -1.7828. The negative sign is due to the fact that the
sample mean of xis less than the sample mean of y. Using the qt () function in R, a Tvalue of 2.0484
corresponds to a 0.05 significance level.
# obtain t value for a two-sidec test at a 0.05 significance level
qt(p=O.OS/2, df=28, lower.tail= FALSE)
2.048407

Because the magnitude of the observed T statistic is less than the T value corresponding to the 0.05
significance level Q-1.78281< 2.0484), the null hypothesis is not rejected. Because the alternative hypothesis
is that the means are not equal (p 1 :;z:: 11 2), the possibilities of both p, > 112 and p 1 < 11 2 need to be considered.
This form of Student's t-test is known as a two-sided hypothesis test, and it is necessary for the sum of the
probabilities under both tails of the t-distribution to equal the significance level. It is customary to evenly
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

divide the significance level between both tails. So, p = 0.05/2 = 0.025 was used in the qt () function to
obtain the appropriate t-value.
To simplify the comparison of the t-test results to the significance level, the Routput includes a quantity
known as the p -value. ln the preceding example, the p-value is 0.08547, which is the sum of P(T ~ - 1.7828)
and P(T ~ 1.7828). Figure 3-24 illustrates the t-statistic for the area under the tail of a t-distribution. The -t
and tare the observed values of the t-statistic. ln the Routput, t = 1.7828. The left shaded area corresponds
to the P(T ~ - 1.7828), and the right shaded area corresponds to the P(T ~ 1.7828).

-t 0
FIGURE 3-24 Area under the tails (shaded) of a student's t-distribution

In the Routput, for a significance level of 0.05, the null hypothesis would not be rejected because the
likelihood of a Tvalue of magnitude 1.7828 or greater would occur at higher probability than 0.05. However,
based on the p -value, if the significance level was chosen to be 0.10, instead of 0.05, the null hypothesis
would be rejected. In general, the p-value offers the probability of observing such a sample result given
the null hypothesis is TRUE.
A key assumption in using Student's t-test is that the population variances are equal. In the previous
example, the t . test ( ) function call includes var . equal=TRUE to specify that equality of the vari-
ances should be assumed. If that assumption is not appropriate, then Welch's t-test should be used.

Welch 's t-test


When the equal population variance assumption is not justified in performing Student's t-test for the dif-
ference of means, Welch's t-test [14] can be used based on T expressed in Equation 3-2.

(3-2)

where X,. 5,2, and n, correspond to the i-th sample mean, sample variance, and sample size. Notice that
Welch's t-test uses the sample va riance (5ll for each population instead of the pooled sample variance.
In Welch's test, under the remaining assumptions of random samples from two normal populations with
the same mean, the distribution of Tisapproximated by the t-distribution. The following Rcode performs
the We lch's t-test on the same set of data analyzed in the earlier Student's t-test example.
3.3 Statistical Methods for Evaluation

t.test(x, y, var.equal=FALSE) # run the Welch's t-test

l'lelch Two Sample t-test

data: x andy
t = -1.6596, df = 15.118, p-value = 0.1176
alternative hypothesis: true difference in neans is not equal to o
95 percent confidence interval:
-6.546629 0.812663
sample estimates:
mean of x mean of y
102.2136 105.0806

In this particular example of using Welch's t-test, the p-value is 0.1176, which is greater than the p-value
of 0.08547 observed in the Student's t-test example. In this case, the null hypothesis would not be rejected
at a 0.10 or 0.05 significance level.
It should be noted that the degrees of freedom calculation is not as straightforward as in the Student's
t-test. In fact, the degrees of freedom calculation often results in a non-integer value, as in this example.
The degrees of freedom for Welch's t-test is defined in Equation 3-3.

(3-3)

df=l~r l~:r
--+--
n,-1 n2 -1
In both the Student's and Welch's t-test examples, the Routput provides 95% confidence intervals on
the difference of the means. In both examples, the confidence intervals straddle zero. Regardless of the
result of the hypothesis test, the confidence interval provides an interval estimate of the difference of the
population means, not just a point estimate.
A confidence interval is an interval estimate of a population parameter or characteristic based on
sample data. A confidence interval is used to indicate the uncertainty of a point estimate.lfx is the estimate
of some unknown population mean f..L, the confidence interval provides an idea of how close xis to the
unknown p. For example, a 95% confidence interval for a population mean straddles the TRUE, but
unknown mean 95% of the time. Consider Figure 3-25 as an example. Assume the confidence level is 95%.
If the task is to estimate the mean of an unknown value Jt in a normal distribution with known standard
deviation u and the estimate based on n observations is x, then the interval ± ~ straddles the unknown
x
value of Jl with about a 95% chance. If one takes 100 different samples and computes the 95% confi-
dence interval for the mean, 95 of the 100 confidence intervals will be expected to straddle the population
mean Jt.
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

FIGURE 3-25 A 95% confidence interval straddlin g the unknown population mean 1J

Confidence intervals appear again in Section 3.3.6 on ANOVA. Return ing to the discussion of hypoth-
es is testing, a key assumpti on in b oth t he Stud ent 's and Welch 's t-test is that the relevant population
attri bute is norma lly distributed. For non-normally dist ributed data, it is sometimes p ossible to transform
the co llected data to approx imate a normal distribution. For example, taki ng the logarithm of a d ataset
can often transform skewed d ata to a dataset that is at least symmetric arou nd its mean. Howeve r, if such
transformations are ineffective, there are tests like the Wi lcoxon ra nk-su m test that can be ap plied to see
if t wo population distributions are different.

3.3.3 Wilcoxon Rank-Sum Test


At-test represents a parametric test in that it makes assumptions about the population distributions from
w hich th e sa mples are drawn. If the populations cann ot be assu med or transformed to follow a normal
distribution, a n onparametric test can be used . The Wilcoxo n rank-sum test [15] is a nonpa rametric
hypothesis test that checks w hether two populations are identically distributed. Assuming the two popula-
tions are identica lly distributed, one would expect that the ordering of any sampled observations would
be evenly intermixed among themselves. For example, in orderi ng the observations, one would not expect
to see a large number of observations from one population grouped together, especially at the beginning
or the end of ordering.
Let the t wo p opulations again be popl and pop2, w ith independently random samples of size n1 and
n2 respectively. The total number of observations is then N = n1 + n2• The first step of the Wilcoxon test is
to rank the set of observations from the t wo groups as if they came from one la rge group. The smallest
observation receives a rank of 1, the second smallest observation receives a rank of 2, and so on with the
largest observation being assig ned the rank of N. Ties among the observations receive a ran k equal to
t he average of the ranks they span. The test uses ranks instead of numerical outcom es to avoid specific
assumptions about the shape of the distribution.
After ranking all the ob servations, the assig ned ranks are summed for at least one population's sample.
If the distribution of popl is shifted to the right of the other distribution, the rank-sum correspondi ng to
popl's sa mple shou ld be larger than the rank-sum of pop2. The Wilcoxon rank-sum test determines the
3.3 Statistical Methods for Evaluation

significance of the observed rank-su ms. The following Rcode performs the test on the same dataset used
for the previous t-test.

wilcox.test(x, y, conf.int TRUE)

:·:~.c :-c n r rJ.:

! 1 t

The wilcox. test ( l function ranks the observations, determines the respective rank-sums cor-
responding to each population's sample, and then determines the probability of such rank-sums of such
magnitude being observed assuming that the population distributions are identical. In this example, the
probability is given by the p-value of 0.04903. Thus, the null hypothesis would be rejected at a 0.05 sig-
nificance level. The reader is cautioned against interpreting that one hypothesis test is clearly better than
another test based solely on the examples given in this section.
Because the Wilcoxon test does not assume anything about the population distribution, it is generally
considered more robust than the t-test. In other words, there are fewer assumptions to violate. However,
when it is reasonable to assume that the data is normally distributed, Student's or Welch's t-test is an
appropriate hypothesis test to consider.

3.3.4 Type I and Type II Errors


A hypothesis test may result in two types of errors, depending on whether the test accepts or rejects the
null hypothesis. These two errors are known as type I and type II errors.

• A type I error is the rejection of the null hypothesis when the null hypothesis is TRUE. The probabil-
ity of the type I error is denoted by the Greek letter n .

• A type II error is the acceptance of a null hypothesis when the null hypothesis is FALSE. The prob-
ability of the type II error is denoted by the Greek letter .1.

Table 3-61ists the four possible states of a hypothesis test, including the two types of errors.

TABLE 3-6 Type I and Type II Error

H0 is true H0 is false

H0 is accepted Correct outcome Type II Error

H0 is rejected Type/error Correct outcome


REVIEW OF BASIC DATA ANALYTIC METHODS USING R

The significance level, as mentioned in the Student's t-test discussion, is equivalent to the type I error.
For a significance level such as o = 0.05, if the null hypothesis (Jt 1= J1 1) is TRUE, there is a So/o chance that
the observed Tvalue based on the sample data will be large enough to reject the null hypothesis. By select-
ing an appropriate sig nificance level, the probability of commi tting a type I error can be defined before
any data is collected or analyzed.
The probability of committing a Type II error is somewhat more difficult to determine. If two population
means are truly not equal, the probability of committing a type II error will depend on how far apart the
means truly are. To reduce the probability of a type II error to a reasonable level, it is often necessary to
increase the sample size. This topic is addressed in the next section.

3.3.5 Power and Sample Size


The power of a test is the probability of correctly rejecting the null hypothesis. It is denoted by 1- /.3, where
f] is the probability of a type II error. Because the power of a test improves as the sample size increases,
power is used to determine the necessary sample size. In the difference of means, the power of a hypothesis
test depends on the true difference of the population means. In other words, for a fixed significance level,
a larger sample size is required to detect a smaller difference in the means. In general, the magnitude of
the difference is known as the effect size. As the sample size becomes larger, it is easier to detect a given
effec t size, 6, as illustrated in Figure 3-26.

Moderate Sample Size Larger Sample Size

1------1 1------1
a' a'

F IGURE 3-26 A larger sample size better identifies a fixed effect size

With a large enough sample size, almost any effect size can appear statistically significant. However, a
very small effect size may be useless in a practical sense. It is importan t to consider an appropriate effect
size for the problem at hand.

3.3.6ANOVA
The hypothesis tests presented in the previous sections are good for analyzing means between two popu-
lations. But what if there are more than two populations? Consider an example of testing the impact of
3.3 Statistical Methods for Evaluation

nutrition and exercise on 60 candidates between age 18 and 50. The candidates are randomly split into six
groups, each assigned with a different weight loss strategy, and the goal is to determine which strategy
is the most effective.
o Group 1 only eats junk food.
o Group 2 only eats healthy food.
o Group 3 eats junk food and does cardia exercise every other day.
o Group 4 eats healthy food and does cardia exercise every other day.
o Group 5 eats junk food and does both cardia and strength training every other day.

o Group 6 eats healthy food and does both cardia and strength training every other day.

Multiple t-tests could be applied to each pair of weight loss strategies. In this example, the weight loss
of Group 1 is compared with the weight loss of Group 2, 3, 4, 5, or 6. Similarly, the weight loss of Group 2 is
compared with that of the next 4 groups. Therefore, a total of 15 t-tests would be performed.
However, multiplet-tests may not perform well on several populations for two reasons. First, because the
number oft-tests increases as the number of groups increases, analysis using the multiplet-tests becomes
cognitively more difficult. Second, by doing a greater number of analyses, the probability of committing
at least one type I error somewhere in the analysis greatly increases.
Analysis of Variance (ANOVA) is designed to address these issues. AN OVA is a generalization of the
hypothesis testing of the difference of two population means. AN OVA tests if any of the population means
differ from the other population means. The null hypothesis of ANOVA is that all the population means are
equal. The alternative hypothesis is that at least one pair of the population means is not equal. In other
words,
0 Ho:Jll = J12 = ··· = Jln
o HA: Jl; ::= J1i for at least one pair of i,j
As seen in Section 3.3.2, "Difference of Means," each population is assumed to be normally distributed
with the same variance.
The first thing to calculate for the AN OVA is the test statistic. Essentially, the goal is to test whether the
clusters formed by each population are more tightly grouped than the spread across all the populations.
Let the total number of populations be k. The total number of samples N is randomly split into the k
groups. The number of samples in the i-th group is denoted as n1, and the mean of the group is X1 where
iE[l,k]. The mean of all the samples is denoted as X0•
s;,
The between-groups mean sum of squares, is an estimate of the between-groups variance. It
measures how the population means vary with respect to the grand mean, or the mean spread across all
the populations. Formally, this is presented as shown in Equation 3-4.
k
582 =-1
-~n.·(x.-x0 )2
k-1L...i I I
(3-4)
1=1

The within-group mean sum ofsquares, s~. is an estimate of the within-group variance. It quantifies
the spread of values within groups. Formally, this is presented as shown in Equation 3-5.
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

(3-5)

s;
If is much larger than 5~, then some of the population means are different from each other.
The F-test statistic is defined as the ratio of the between-groups mean sum of squares and the within-
group mean sum of squares. Formally, this is presented as shown in Equation 3-6.

(3-6)

The F-test statistic in ANOVA can be thought of as a measure of how different the means are relative to
the variability within each group. The larger the observed F-test statistic, the greater the likelihood that
the differences between the means are due to something other than chance alone. The F-test statistic
is used to test the hypothesis that the observed effects are not due to chance-that is, if the means are
significantly different from one another.
Consider an example that every customer who visits a retail website gets one of two promotional offers
or gets no promotion at all. The goal is to see if making the promotional offers makes a difference. ANOVA
could be used, and the null hypothesis is that neither promotion makes a difference. The code that follows
randomly generates a total of 500 observations of purchase sizes on three different offer options.
offers<- sample(c("offerl", "offer2", "nopromo"), size=SOO, replace=T)

# Simulated 500 observations of purchase sizes on the 3 offer options


purchasesize <- ifelse(offers=="offerl", rnorm(SOO, mean=SO, sd=30),
ifelse(offers=="offer2", rnorm(SOO, mean=SS, sd=30),
rnorm(SOO, mean=40, sd=30)))

# create a data frame of offer option and purchase size


offertest <- data.frame(offer=as.factor(offers),
purchase_amt=purchasesize)

The summary ofthe offertest data frame shows that 170 offerl, 161 offer2, and 169
nopromo (no promotion) offers have been made. It also shows the range of purchase size (purchase_
amt) for each of the three offer options.
# display a summary of offertest where o:fer="offer1"
summary(offertest[offertest$offer=="offerl",])
offer purchase_amt
nopromo: t·li;;.. 4.521
offe:n :170 1 s:: Qu . : 5 8 . 1 5 8
offer2 : i·iedian : 76. 944
I·1ean .Sl. 936
3 rd Qu. : 1 D4 . 9 59
t•la:·:. :130.507

# display a summary of offertest where o:fer="offer2"


summary(offertest[offertest$offer=="offer2",])
3.3 Statistical Methods for Evaluation

offer purchase_amt
nopromo: 0 ~lin. 14.04
offer! 0 1st Qu . : 6 9 . 4 6
offer2 :161 t·ledian : 90.20
r•lean 89.09
3 rd Qu. : 10 7. 4 8
!•lax. : 154. 3 3

# display a summary of offertest where offer="nopromo"


summary(offertest[offertest$offer== 11 nopromo 11 , ] )
offer purchase_amt
nopromo:169 Min. :-27.00
offerl 0 1st Qu.: 20.22
offer2 : 0 t<ledian : 42.44
f.lean 40.97
3rd Qu.: 58.96
i'olax. :164.04
The aov ( } function performs the AN OVA on purchase size and offer options.
# fit ANOVA test
model <- aov(purchase_amt - offers, data=offertest)

The summary (} function shows a summary of the model. The degrees of freedom for offers is 2,
which corresponds to the k -1 in the denominator of Equation 3-4. The degrees of freedom for residuals
is 497, which corresponds to then- k in the denominator of Equation 3-5.
summary (model)
Of Sum Sq !-lean Sq F value Pr (>F)
offers 2 225222 112611 130.6 <2e-16
Residuals 4 97 428470 862

Signif. codes: 0 1
*** 1
0.001 1
**' 0.01 1
* 1
0.05 '. 1
0.1 1 1
1

The output also includes the 5~ (112,611), 5~ (862), the F-test statistic (130.6), and the p-value (< 2e-16).
The F-test statistic is much greater than 1 with a p-value much less than 1. Thus, the null hypothesis that
the means are equal should be rejected.
However, the result does not show whether offerl is different from offer2, which requires addi-
tional tests. The TukeyHSD (} function implements Tukey's Honest Significant Difference (HSD) on all
pair-wise tests for difference of means.
TukeyHSD(model)
Tukey multiple comparisons of means
95% family-wise confidence level

Fit: aoviformula purchase amt - offers, data offertest)

$offers
diff lwr upr p adj
offerl-nopromo 40.961437 33.4638483 48.45903 0.0000000
REVIEW OF BASIC DATA ANALYTIC METHODS USING R

offer2 nopromo 48 . 120286 40 . 5189 4~6 55 . 72163 O. OOOOOCO


offer2--fferl 7 . 1"8849 -0 . 4315769 14 . 74928 o . r692895

The result includes p -values of pair-wise comparisons of the three offer options. The p-values for
of ferl- nopromo and of fer- nop romo are equal to 0, smaller than the significance level 0.05.
Thi s suggests that both of ferl and offer2 are significantly different from n opromo. A p-value of
0.0692895 for off er2 against of fer 1 is greater than the significance level 0.05. This suggests that
of fer2 is not significantly different from offerl.
Because only the influence of one factor (offers) was executed, the presented ANOVA is known asone-
way ANOVA. If the goal is to analyze two factors, such as offers and day of week, that would be a two-way
ANOVA [16]. 1f the goal is to model more than one outcome variable, then multivariate AN OVA (or MAN OVA)
cou ld be used.

Summary
Ris a popular package and programming language for data exploration, analytics, and visualization. As an
introduction toR, thischapter coversthe RGUI, data 1/0, attribute and data types, and descriptive statistics.
This chapter also discusses how to useR to perform exploratory data analysis, including the discovery of
dirty data, visua lization of one or more variables, and customization of visualization for different audiences.
Finally, the chapter introduces some basic statistical methods. The first statistical method presented in the
chapter isthe hypothesis testing. The Student's t-test and Welch's t-test are included astwo example hypoth-
esis testsdesigned for testing the difference of means. Other statistical methods and tools presented in this
chapter include confidence interva ls, Wilcoxon rank-sum test, type I and II errors, effect size, and ANOVA.

Exercises
1. How many levels does fdata contain in the following R code?

data = c(1 , 2,2,3,1,2 , 3,3 ,1 , 2,3,3 , 1)


fdata = factor(data)

2. Two vectors, vl and v2, are created with the following Rcode:

vl <- 1:5
v2 <- 6 : 2

What are the results of cbi nd (vl , v2) and rbind (vl , v2)?
3. What Rcomma nd(s) would you use to remove null values from a dataset?

4. What Rcommand can be used to install an additional Rpackage?

5. What R function is used to encode a vector as a category?

6. What is a rug plot used for in a density plot?

7. An online retailer wa nts to study the purchase behaviors of its customers. Figure 3-27 shows the den-
sity plot of the purchase sizes (in dollars). What wou ld be your recommendation to enhance the plot
to detect more structures that otherwise might be missed?
Bibliography

Be-04

6e-04

£
"'
~ 4e-04

2e-04

09+(}0

0 2000 4000 6000 8000 10000


purchase size (dollars)

fiGURE 3-27 Density plot of purchase size

8. How many sections does a box-and-whisker divide the data into? What are these sections?

9. What attributes are correlated according to Figure 3-18? How would you describe their relationships?

10. What function can be used to tit a nonlinear line to the data?

11. If a graph of data is skewed and all the data is positive, what mathematical technique may be used to
help detect structures that might otherwise be overlooked?

12. What is a type I error? What is a type II error? Is one always more serious than the other? Why?

13. Suppose everyone who visits a retail website gets one promotional offer or no promotion at all. We
want to see if making a promotional offer makes a difference. What statistical method wou ld you
recommend for this analysis?

14. You are ana lyzing two norma lly distributed populations, and your null hypothesis is that the mean f1 1
of the first population is equal to the mean 112 of the second. Assume the significance level is set at
0.05. If the observed p·value is 4.33e-05, what will be your decision regarding the null hypothesis?

Bibliography
[1] The RProject for Statistical Computing, "R Licenses." [Online). Available: http : I l www. r-
proj ec t. orgiLicensesl. [Accessed 10 December 2013].
[2] The RProject for Statistical Computing, "The Comprehensive R Arch ive Network." [Online].
Available: http: I lcran . r-project. orgl. [Accessed 10 December 2013].

You might also like