100% found this document useful (1 vote)

201 views45 pages

Eda PDF

The document discusses various concepts related to data science and machine learning including: 1) Types of variables such as numerical, categorical, ordinal, nominal variables and different types like response and explanatory variables. 2) Relationships between variables and how correlated or independent variables can be. 3) Visualizing data through scatterplots, histograms, boxplots and other charts to understand patterns, distributions, outliers and relationships. 4) Key statistical measures used to summarize data like mean, median, mode, variance, standard deviation, range and interquartile range.

Uploaded by

La Magnifico

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

201 views45 pages

Eda PDF

Uploaded by

La Magnifico

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

DATA SCIENCE & MACHINE

LEARNING COURSE
https://www.facebook.com/diceanalytics/
Sensitivity: Internal
Data Organization
Data is stored in the form of a Data Matrix

Variable Names

Observation
(Row)

Variable
(Column)

Sensitivity: Internal
Types of Variables
Two main
types
Variables
Arithmetic Qualitative;
operations limited
can be number of
performed distinct
categories
Numerical Categorical

Discrete Continuous Ordinal Nominal

Levels with
Distinct set of Infinite values No order,
inherent
values within a range Incomparable
ordering

Sensitivity: Internal
Types of Variables

http://www.statisticshowto.com/types-variables/
https://statistics.laerd.com/statistical-guides/types-of-variable.php

Sensitivity: Internal
Types of Variables
• Response Variable: It is the focus of a question in a
study or experiment. It is the variable we want to
predict or observe. It is the dependent variable.

• Explanatory Variable: It is the variable on

whom the response variable depends, or the
variable which ‘explains’ the response variable.
It is assumed to be independent variable.

Sensitivity: Internal
Relationship b/w Variables

• Two variables that show connection

with each other are called A
Associated/Correlated (Dependent)

• Two variables that do not show

connection with each other are
called Independent

• An observation that is away that is

B
not close to majority of data is
called Outlier

Sensitivity: Internal
Data Visualisation

Sensitivity: Internal
Visualising
Numerical Data

Sensitivity: Internal
Scatterplot

Income

Response
Variable

Age

Explanatory
Variable
Sensitivity: Internal
Characteristics of Relationship
Direction Shape Strength Outliers

+ve
curved

strong

-ve linear

weak

Sensitivity: Internal
Correlation (example)

Sensitivity: Internal
Histograms
100

• Help to view data density

• Help to see shape of distribution

1) Skewness
2) Modality 50

0
20 30 40
Age Group50 60

Sensitivity: Internal
Skewness
Left Skewed Symmetric Right Skewed

20 30 40 60 20 30 40 60 20 30 40 60
-ve Skewness Zero Skewness +ve Skewness

• Draw a smooth curve to see skewness

• Don’t rely on jagged edges

Sensitivity: Internal
Modality

unimodal bimodal uniform multimodal

20 30 40 60 20 30 40 60 20 30 40 60 20 30 40 60

Sensitivity: Internal
Modality (Example)

20 30 40 60 20 30 40 60 20 30 40 60

Normal Distribution Two separate groups No trend

Sensitivity: Internal
Binwidth

20 30 40 50 20 25 30 35 40 45 50

Sensitivity: Internal
Measures of Center
Data : 56, 87, 34, 65, 77, 62, 90, 45, 77, 79

Arithmetic Average
Mean Mean = 56 + 87 + 34 + 65 + 77 + 62 + 90 + 45 + 77 + 79
10
Mean = 67.2

Most frequent value/observation

Mode Mode = 77

Midpoint of distribution (50th percentile)

Median Median = 77 + 62 = 69.5
2

Sensitivity: Internal
Box Plots
Box
IQR Whisker
outliers

Min. Value Q1 Q2 Q3 Max. Value

Min. Value :Lower Extreme (that’s not an outlier)

Q1 :Lower Quartile (25% of observations)
Q2 :Median (50% of observations)
Q3 :Upper Quartile (75% of observations)
Max. Value :Upper Extreme (that’s not an outlier)
IQR :Inter-Quartile Range = Q3 - Q1 (middle 50% of observations)

Sensitivity: Internal
Box Plots & Skewness

Left Skewed Symmetric Right Skewed

Sensitivity: Internal
Skewness vs Measures of Center
Mean
Median
Mode

Mean < Median < Mode Mean = Median = Mode Mean > Median > Mode

Left Skewed Symmetric Right Skewed

Sensitivity: Internal
Intensity/Heat Maps

Sensitivity: Internal
Time Plots
225

180

135

0
Jan Feb March April June July Aug Sep Oct Nov Dec

Sensitivity: Internal
Measures of Spread

Range Variance

Standard Inter-quartile
Deviation Range

Sensitivity: Internal
Range

• Range = Max. Value - Min. Value

• Data : 56, 87, 34, 65, 77, 62, 90, 45, 77, 79

• Range = 90 - 34 = 56

Sensitivity: Internal
Variance
• A measure of how much data (a variable)
varies; how spread out a data set is about the
mean.
• Average squared deviation from mean; has
squared units of the variable

• Sample Variance

• Population Variance

Sensitivity: Internal
Variance (Example)
• Data : 56, 87, 34, 65, 77, 62, 90, 45, 77, 79

(56 - 67.2)2 + (87 - 67.2)2 + …… + (79 - 67.2)2

10-1

2995.6
9 Sum of Squares
332.8

Sensitivity: Internal
Why Square The Differences?

• Get rid of negatives, so

that the negatives and
positives do not cancel
each other during
addition.
• Increase larger 0 2 4 6 8

deviations more than (2-4) + (6-4) = -2 + 2 = 0

smaller ones so that they
are weighed more
heavily.

Sensitivity: Internal
Standard Deviation (SD)
• Square root of Variance
• It has the same units as the variable, which
makes it useful in comparisons and
calculations

• Sample SD
√ √
• Population SD
√ √
Sensitivity: Internal
Spread

0 2 4 6 8 0 2 4 6 8

Less Spread More Spread

Low Variance High Variance
Low Deviation High Deviation

Sensitivity: Internal
Robust Statistics
• Measures on which extreme observations or
outliers have little effect

Robust Non-Robust

Spread IQR SD, Range

Center Median Mean

Skewed Symmetric
Sensitivity: Internal
Data Transformations
• Applying a Function f(x) to adjust scales of
data.
• Done usually when data is skewed, so that it
becomes easier to perform modelling.
• Done to convert non-linear relationship into a
linear relationship.

Transformed
Data f(x) Data

Sensitivity: Internal
(Natural) Log Transformation
• To transform data that is positively skewed
• Usually done when data is concentrated near
Zero (relative to the few large values in data)

Right Skewed Symmetric

Natural
Log

Sensitivity: Internal
Log Transformation
• To make the relationship between two variable
more linear
• Most of the simple methods for modelling work
only when relationship is linear

Log

Sensitivity: Internal
Other Transformation

• You may use other transformations or create of

your own

• For instance: Square Root, Square, Inverse

Sensitivity: Internal
Visualising
Categorical Data

Sensitivity: Internal
Bar Plot
300

225

150

Frequency

Sensitivity: Internal
Bar Plot vs Histogram

• Bar Plot for Categorical Variables, Histogram

for Numerical Variables

• X-axis in Histogram must be a Number Line

• Ordering of bars is not interchangeable in

Histogram as compared to Bar Plot

Sensitivity: Internal
Pie Chart
Cricket
Very Somewhat
Football Not Very
Hockey Not At All
Squash Not Sure

12%
46%
39%

Use Bar Plot instead

Sensitivity: Internal
Segmented Bar Plot
For visualising For comparing
conditional frequency relative frequencies
distributions to explore
225
relationship between
variables

150

75
Squash

Hockey
0 Football

Cricket

Sensitivity: Internal
Relative Frequency Segmented
Bar Plot
1.2

0.9

0.6

0.3

0
Cricket Football Hockey Squash

Sensitivity: Internal
AGE
60 120
100
80
40
20 Side-by-Side Box Plots

Cricket Football Hockey Squash Badminton Other Sports

Sensitivity: Internal
Bubble Plot
Bowler Batsman
37.5
75

30
70
Weight (kg)

22.5
65

7.5
60
55

0
50

-7.5
-4 0 4 8 12 16 20
4’8 5’0 5’4 5’8 6’0 6’4 6’8 7’0
Height

Sensitivity: Internal
Outliers

Sensitivity: Internal
Why do EDA
• To understand data properties
• To find patterns in data
• To suggest modelling strategies
• To "debug" analyses
• To communicate results

(From JHU)
Sensitivity: Internal
Why do EDA

https://www.youtube.com/watch?v=jbkSRLYSojo

Sensitivity: Internal

Hill Climbing Vs Simulated Annealing
100% (1)
Hill Climbing Vs Simulated Annealing
14 pages
Leer Los Datos: Import As Import As Import As From Import From Import
100% (1)
Leer Los Datos: Import As Import As Import As From Import From Import
14 pages
Unit 2 AI
No ratings yet
Unit 2 AI
107 pages
EDA Lecture Module 2
100% (1)
EDA Lecture Module 2
42 pages
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
100% (1)
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
16 pages
Import As
100% (1)
Import As
27 pages
Homework 2
100% (1)
Homework 2
12 pages
Model With One-Word Context: 2vec 2vec 2vec 2vec
100% (1)
Model With One-Word Context: 2vec 2vec 2vec 2vec
17 pages
Dev Answer Key
100% (1)
Dev Answer Key
17 pages
Dokumen - Pub Approaching Almost Any Machine Learning Problem 9788269211528 L 5276104
100% (1)
Dokumen - Pub Approaching Almost Any Machine Learning Problem 9788269211528 L 5276104
151 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
Python For You and Me: Release 0.3.alpha1
100% (1)
Python For You and Me: Release 0.3.alpha1
143 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
100% (1)
Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131
10 pages
Soft Max
No ratings yet
Soft Max
6 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
28 pages
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
100% (1)
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
72 pages
Data Science Project
No ratings yet
Data Science Project
3 pages
Data Analysis Nirvana: Excel 2013 Business Intelligence Features
100% (1)
Data Analysis Nirvana: Excel 2013 Business Intelligence Features
27 pages
Quiz Feedback1 - Coursera
100% (1)
Quiz Feedback1 - Coursera
7 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Lecture Notes - Logistic Regression
100% (1)
Lecture Notes - Logistic Regression
11 pages
Decision Tree Classification
100% (1)
Decision Tree Classification
11 pages
Logistic Regression Example
100% (1)
Logistic Regression Example
22 pages
Stat1012 Cheatsheet Double-Sided
100% (1)
Stat1012 Cheatsheet Double-Sided
2 pages
Course Title: Data Pre-Processing and Visualization
100% (2)
Course Title: Data Pre-Processing and Visualization
11 pages
Python Vs R in Data and Machine Learning PDF
100% (1)
Python Vs R in Data and Machine Learning PDF
6 pages
AIML - 04 Single Layer Perceptron
No ratings yet
AIML - 04 Single Layer Perceptron
11 pages
Machine Learning (Analytics Vidhya) : What Is Logistic Regression?
100% (1)
Machine Learning (Analytics Vidhya) : What Is Logistic Regression?
5 pages
Stats For Managers - Intro
100% (1)
Stats For Managers - Intro
101 pages
7. Heteroscedasticity: y = β + β x + · · · + β x + u
100% (1)
7. Heteroscedasticity: y = β + β x + · · · + β x + u
21 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
Data Science Intervieew Questions
100% (1)
Data Science Intervieew Questions
16 pages
Cluster
100% (1)
Cluster
72 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Taller Practica Churn
50% (2)
Taller Practica Churn
6 pages
Employee Attrition Miniblogs
100% (1)
Employee Attrition Miniblogs
15 pages
Bias Variance Tradeoff
No ratings yet
Bias Variance Tradeoff
6 pages
Introduction To STATISTICS-new
100% (1)
Introduction To STATISTICS-new
46 pages
Supervised Learning 1 PDF
100% (1)
Supervised Learning 1 PDF
162 pages
Bootstrap Powerpoint
100% (1)
Bootstrap Powerpoint
20 pages
The Problem of Overfitting: Overfitting With Linear Regression
No ratings yet
The Problem of Overfitting: Overfitting With Linear Regression
32 pages
Poly
100% (1)
Poly
108 pages
6 - Train - Test - Split - Ipynb - Colaboratory
No ratings yet
6 - Train - Test - Split - Ipynb - Colaboratory
5 pages
QuantEconlectures Python3 PDF
100% (1)
QuantEconlectures Python3 PDF
1,125 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
ch9 Ensemble Learning
No ratings yet
ch9 Ensemble Learning
19 pages
Matplotlib PDF
No ratings yet
Matplotlib PDF
16 pages
Statistics Presentation
No ratings yet
Statistics Presentation
21 pages
Scip y Lectures
100% (1)
Scip y Lectures
329 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
Kinds & Classification of Research: Reported By: Marina G. Servan
No ratings yet
Kinds & Classification of Research: Reported By: Marina G. Servan
52 pages
2) Basic Chart Reading
No ratings yet
2) Basic Chart Reading
39 pages
Technical Seminar: Sapthagiri College of Engineering
No ratings yet
Technical Seminar: Sapthagiri College of Engineering
18 pages
Data Science Introduction
No ratings yet
Data Science Introduction
82 pages
Lab I TENSOR FLOW AND KERAS
No ratings yet
Lab I TENSOR FLOW AND KERAS
3 pages
Gradient Descent
No ratings yet
Gradient Descent
15 pages
Statistical Tests - Handout PDF
No ratings yet
Statistical Tests - Handout PDF
21 pages
Probability - Statistics - Take Home Problems
No ratings yet
Probability - Statistics - Take Home Problems
8 pages
Knowledge Based Systems (Sistem Berbasis Pengetahuan) : Ir. Wahidin Wahab M.SC PH.D
No ratings yet
Knowledge Based Systems (Sistem Berbasis Pengetahuan) : Ir. Wahidin Wahab M.SC PH.D
33 pages
What Is Time Series Analysis
No ratings yet
What Is Time Series Analysis
28 pages
Financial Econometrics-II 2013
No ratings yet
Financial Econometrics-II 2013
25 pages
Module 5 - AI
No ratings yet
Module 5 - AI
16 pages
Unit 2
No ratings yet
Unit 2
28 pages
Probablity Distribution
No ratings yet
Probablity Distribution
10 pages
Test Duncan
No ratings yet
Test Duncan
9 pages
Test of Hypothesis
No ratings yet
Test of Hypothesis
3 pages
De La Salle University - Dasmariñas: College of Science and Computer Studies
No ratings yet
De La Salle University - Dasmariñas: College of Science and Computer Studies
4 pages
Inflasi ARIMA Model
No ratings yet
Inflasi ARIMA Model
7 pages
Am (101-120) Analisis Multinivel
No ratings yet
Am (101-120) Analisis Multinivel
20 pages
Chapter 8 Review
No ratings yet
Chapter 8 Review
6 pages
RN10 BEEA StatPro RN Correlation and Regression Analyses MP RM FD
No ratings yet
RN10 BEEA StatPro RN Correlation and Regression Analyses MP RM FD
33 pages
CST 42315 Dam - L9 1
No ratings yet
CST 42315 Dam - L9 1
15 pages
Package Bayeslogit': R Topics Documented
No ratings yet
Package Bayeslogit': R Topics Documented
15 pages
Lab 1 Measurement in Physics Lab
No ratings yet
Lab 1 Measurement in Physics Lab
5 pages
Ch6 Multiple Regression
No ratings yet
Ch6 Multiple Regression
29 pages
January 2022 QP - PDF s3
No ratings yet
January 2022 QP - PDF s3
24 pages
What Is Noise?: John A. Scales and Roel Snieder
No ratings yet
What Is Noise?: John A. Scales and Roel Snieder
3 pages
Mod 5
No ratings yet
Mod 5
19 pages
6 - Problems On Sampling Distributions
No ratings yet
6 - Problems On Sampling Distributions
15 pages
MIT6 041F10 Rec14
No ratings yet
MIT6 041F10 Rec14
2 pages
Research Chapter 4
No ratings yet
Research Chapter 4
5 pages
Stat 6201 Midterm Exam I Solutions Octo
No ratings yet
Stat 6201 Midterm Exam I Solutions Octo
6 pages
Homework 1
No ratings yet
Homework 1
2 pages
OTM Theoretical Distribution Dec 23
No ratings yet
OTM Theoretical Distribution Dec 23
8 pages
Sta - Chap 14
No ratings yet
Sta - Chap 14
4 pages
MATH 241 - Probability and Statistics I - COs
No ratings yet
MATH 241 - Probability and Statistics I - COs
2 pages
Practice Problems - Correlation and Regression
No ratings yet
Practice Problems - Correlation and Regression
2 pages
Python Natural Language Processing Cookbook: Over 60 recipes for building powerful NLP solutions using Python and LLM libraries
From Everand
Python Natural Language Processing Cookbook: Over 60 recipes for building powerful NLP solutions using Python and LLM libraries
Zhenya Antić
No ratings yet

Eda PDF

Uploaded by

Eda PDF

Uploaded by

DATA SCIENCE & MACHINE

Discrete Continuous Ordinal Nominal

• Explanatory Variable: It is the variable on

• Two variables that show connection

• Two variables that do not show

• An observation that is away that is

• Help to view data density

• Help to see shape of distribution

• Draw a smooth curve to see skewness

unimodal bimodal uniform multimodal

Normal Distribution Two separate groups No trend

Most frequent value/observation

Midpoint of distribution (50th percentile)

Min. Value Q1 Q2 Q3 Max. Value

Min. Value :Lower Extreme (that’s not an outlier)

Left Skewed Symmetric Right Skewed

Left Skewed Symmetric Right Skewed

• Range = Max. Value - Min. Value

(56 - 67.2)2 + (87 - 67.2)2 + …… + (79 - 67.2)2

• Get rid of negatives, so

deviations more than (2-4) + (6-4) = -2 + 2 = 0

Less Spread More Spread

Spread IQR SD, Range

Center Median Mean

Right Skewed Symmetric

• You may use other transformations or create of

• For instance: Square Root, Square, Inverse

• Bar Plot for Categorical Variables, Histogram

• X-axis in Histogram must be a Number Line

• Ordering of bars is not interchangeable in

Use Bar Plot instead

Cricket Football Hockey Squash Badminton Other Sports

You might also like