0% found this document useful (0 votes)

147 views41 pages

Lecture 4 - Data Wrangling

The document discusses exploring data through visualization and summary statistics. It covers exploring one-dimensional data through measures like minimum, maximum, mean, median, and standard deviation. Histograms are presented as a tool to visualize the distribution of continuous data by grouping it into buckets. The steps of data wrangling like obtaining, understanding, exploring, transforming, augmenting and visualizing data are also outlined.

Uploaded by

Anusuya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

147 views41 pages

Lecture 4 - Data Wrangling

Uploaded by

Anusuya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 41

CS 620 / DASC 600

Introduction to Data Science

& Analytics

Lecture 4- Data Wrangling

Dr. Sampath Jayarathna
Old Dominion University
1
Exploring Your Data

• Working with data is both an art and a science. We’ve mostly

been talking about the science part, getting your feet wet
with Python tools for Data Science. Lets look at some of the
art now.
• After you’ve identified the questions you’re trying to answer
and have gotten your hands on some data, you might be
tempted to dive in and immediately start building models and
getting answers. But you should resist this urge. Your first
step should be to explore your data.
Exploring Your Data
Data Wrangling

• The process of transforming “raw” data into data that can be

analyzed to generate valid actionable insights
• Data Wrangling : aka
• Data preprocessing
• Data preparation
• Data Cleansing
• Data Scrubbing
• Data Munging
• Data Transformation
• Data Fold, Spindle, Mutilate……
Data Wrangling Steps

• Iterative process of
• Obtain
• Understand
• Explore
• Transform
• Augment
• Visualize
Data Wrangling Steps
Data Wrangling Steps
Exploring Your Data

• The simplest case is when you have a one-dimensional data

set, which is just a collection of numbers. For example,
• daily average number of minutes each user spends on your site,
• the number of times each of a collection of data science tutorial videos was
watched,
• the number of pages of each of the data science books in your data science
library.
• An obvious first step is to compute a few summary statistics.
• You’d like to know how many data points you have, the smallest, the largest,
the mean, and the standard deviation.
• But even these don’t necessarily give you a great understanding.
Summary statistics of a single data set
• Information (numbers) that give a quick and simple
description of the data
• Maximum value
• Minimum value
• Range (dispersion): max – min
• Mean
• Median
• Mode
• Quantile
• Standard deviation 0 quartile = 0 quantile = 0 percentile
• Etc. 1 quartile = 0.25 quantile = 25 percentile
2 quartile = .5 quantile = 50 percentile (median)
3 quartile = .75 quantile = 75 percentile
4 quartile = 1 quantile = 100 percentile
Mean vs average vs median vs mode

• (Arithmetic) Mean: the “average” value of the data

def mean(a): return sum(a) / float(len(a))
def mean(a): return reduce(lambda x, y: x+y, a) / float(len(a))
• Average: can be ambiguous
• The average household income in this community is $60,000
• The average (mean) income for households in this community is $60,000
• The income for an average household in this community is $60,000
• What if most households are earning below $30,000 but one household is
earning $1M
• Median: the “middlest” value, or mean of the two middle values
• Can be obtained by sorting the data first Quantile: a generalization of
median.
• Does not depend on all values in the data. E.g. 75 percentile is the value
• More robust to outliers which 75% of values are less
• Mode: the most-common value in the data than or equal to
Variance and standard deviation

• Describes the spread of the data from the mean

• Is the mean squared of the deviation
• Standard deviation (square root of the variance): 
• Easier to understand than variance
• Has the same unit as the measurement
• Say the data measures height of people in inch, the unit of  is also inch. The
unit for 2 is square inch …
CDC BRFSS Dataset
• The Behavioral Risk Factor Surveillance System (BRFSS) is
the nation's premier system of health-related telephone
surveys that collect state data about U.S. residents regarding
their health-related risk behaviors, chronic health conditions,
and use of preventive services.
• https://www.cs.odu.edu/~sampath/courses/f19/cs620/files/data/brfss.csv
Activity 8
• Download the brfss.csv file and load it to your python module.
• https://www.cs.odu.edu/~sampath/courses/f19/cs620/files/data/brfss.csv
• Display the content and observe the data
• Create a function cleanBRFSSFrame() to clean the dataset
• Drop the sex from the dataframe
• Drop the rows of NaN values (every single NaN)
• Use describe() method to display the count, mean, std, min, and
quantile data for column weight2.
• Find the median (median()) and mode (mode()) of the age
Population vs sample

Sampling is a process used in statistical analysis in which a

predetermined number of observations are taken from a larger
population
Population vs sample

• Population: all members of a group in a study

• The average height of men
• The average height of living male ≥ 18yr in USA between 2001 and 2010
• The average height of all male students ≥ 18yr registered in Fall’17
• Sample: a subset of the members in the population
• Most studies choose to sample the population due to cost/time or other factors
• Each sample is only one of many possible subsets of the population
• May or may not be representative of the whole population
• Sample size and sampling procedure is important

df = pd.read_csv('brfss.csv')
print(df.sample(100)) # random sample of 100 values
Why do we sample?
• Enables research/ surveys to be done more quickly/ timely
• Less expensive and often more accurate than large CENSUS
( survey of the entire population)
• Given limited research budgets and large population sizes, there
is no alternative to sampling.
• Sampling also allows for minimal damage or lost
• Sample data can also be used to validate census data
• A survey of the entire universe (gives real estimate not sample estimate)
Simple Random Sampling
• In Simple Random Sampling, each element of the larger
population is assigned a unique ID number, and a table of
random numbers or a lottery technique is used to select
elements, one at a time, until the desired sample size is reached.
• Simple random sampling is usually reserved for use with
relatively small populations with an easy-to-use sampling frame
( very tedious when drawing large samples).
• Bias is avoided because the person drawing the sample does not
manipulate the lottery or random number table to select certain
individuals.
Random Selection

• Selects at random np.random.choice

• With replacement
• From any array
• A specified number of times

np.random.choice(some_array, sample size)

Example:
import numpy as np
d = np.arange(6) + 1
s = np.random.choice(d, 1000)
print(s)
Systematic Sampling
• Systematic sampling is a type of probability sampling method in which
sample members from a larger population are selected according to a random
starting point and a fixed periodic interval.
• In this approach, the estimated number of elements in the larger population is
divided by the desired sample size to yield a SAMPLNG INTERVAL. The
sample is then drawn by listing the population in an arbitrary order and
selecting every nth case, starting with a randomly selected.
• This is less time consuming and easier to implement.
• Systematic sampling is useful when the units in your sampling frame are not
numbered or when the sampling frame consists of very long list.
Stratified Sampling
• Populations often consist of strata or groups that are different from each
other and that consist of very different sizes.
• Stratified Sampling ensures that all relevant strata of the population are
represented in the sample.
• Stratification treats each stratum as a separate population- arranging the
sampling frame first in strata before either a simple random technique or a
systematic approach is used to draw the sample.
Convenience Sampling
• Convenience sampling is where subjects are selected because of their
convenient accessibility and proximity to the researcher.
• Convenience Sampling involves the selection of samples from whatever
cases/subjects or respondents that happens to be available at a given place or
time.
• Also known as Incidental/Accidental, Opportunity or Grab Sampling.
Snow- ball Sampling is a special type of convenience sampling where
individuals or persons that have agreed or showed up to be interviewed in the
study serially recommend their acquaintances.
Other Sampling
• In Cluster Sampling, samples are selected in two or more stages

• Non-probability sampling involves a technique where samples

are gathered in a process that does not give all the individuals in
the population equal chances of being selected.
• Nonprobability sampling procedures are not valid for obtaining a sample that is
truly representative of a larger population
Exploring Your Data

• Good next step is to create a histogram, in which you group

your data into discrete buckets and count how many points
fall into each bucket:
df = pd.read_csv('brfss.csv', index_col=0) A histogram is a plot that lets
df['weight2'].hist(bins=100) you discover, and show, the
underlying frequency
distribution (shape) of a set
of continuous data. This
allows the inspection of the
data for its underlying
distribution (e.g., normal
distribution), outliers,
skewness, etc.
Regression vs Correlation
Regression – estimation of the relationship between variables
• Linear regression
• Assessing the assumptions
• Non-linear regression

Correlation
• Correlation coefficient quantifies the association strength
• Sensitivity to the distribution
Relationship No Relationship
Relationship
Linear, Strong Linear, Weak Non-Linear
Residuals

Residuals

Residuals
Regression vs Correlation
Correlation quantifies the degree to which two variables are
related.
• Correlation does not fit a line through the data points. You simply are
computing a correlation coefficient (r) that tells you how much one
variable tends to change when the other one does.
• When r is 0.0, there is no relationship. When r is positive, there is a trend
that one variable goes up as the other one goes up. When r is negative,
there is a trend that one variable goes up as the other one goes down.

Linear regression finds the best line that predicts Y from X.

• Correlation is almost always used when you measure both variables. It

rarely is appropriate when one variable is something you experimentally
manipulate.
• Linear regression is usually used when X is a variable you manipulate
Correlation only measures linear relationship
Feature Matrix

• We can review the relationships between attributes by

looking at the distribution of the interactions of each pair of
attributes.
from pandas.tools.plotting import scatter_matrix
scatter_matrix(df[['weight2', 'wtyrago', 'htm3' ]]) This is a powerful plot from
which a lot of inspiration
about the data can be drawn.
For example, we can see a
possible correlation between
weight and weight year ago
Types of data

There are two basic types of data: numerical and

categorical data.

Numerical data: data to which a number is

assigned as a quantitative value.
age, weight, shoe size….

Categorical data: data defined by the classes or

categories into which an individual member falls.
eye color, gender, blood type, ethnicity

3 - 29
Continuous or Non-continuous data

• A continuous variable is one in which it can

theoretically assume any value between the lowest and
highest point on the scale on which it is being
measured
• (e.g. weight, speed, price, time, height)

• Non-continuous variables, also known as discrete

variables, that can only take on a finite number of
values
• Discrete data can be numeric -- like numbers of
apples -- but it can also be categorical -- like red or
blue, or male or female, or good or bad.
Qualitative vs. Quantitative Data

• A qualitative data is one in which the “true” or naturally

occurring levels or categories taken by that variable are not
described as numbers but rather by verbal groupings
• Open ended answers

• Quantitative data on the other hand are those in which the

natural levels take on certain quantities (e.g. price, travel time)
• That is, quantitative variables are measurable in some
numerical unit (e.g. pesos, minutes, inches, etc.)
• Likert scales, semantic scales, yes/no, check box
Data transformation

• Transform data to obtain a certain distribution

• transform data so different columns became comparable / compatible
• Typical transformation approach:
• Z-score transformation
• Scale to between 0 and 1
• mean normalization
Rescaling
• Many techniques are sensitive to the scale of your data. For
example, imagine that you have a data set consisting of the heights
and weights of hundreds of data scientists, and that you are trying to
identify clusters of body sizes.

data = {"height_inch":{'A':63, 'B':67, 'C':70},

"height_cm":{'A':160, 'B':170.2, 'C':177.8},
"weight":{'A':150, 'B':160, 'C':171}}
df2 = DataFrame(data)
print(df2)
Why normalization (re-scaling)
height_inch height_cm weight
A 63 160.0 150
B 67 170.2 160
C 70 177.8 171

from scipy.spatial import distance

a = df2.iloc[0, [0,2]]
b = df2.iloc[1, [0,2]]
c = df2.iloc[2, [0,2]]
print("%.2f" % distance.euclidean(a,b)) #10.77
print("%.2f" % distance.euclidean(a,c)) # 22.14
print("%.2f" % distance.euclidean(b,c)) #11.40
Boxplot
The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying
the distribution of data based on the five number summary: minimum, first quartile,
median, third quartile, and maximum. In the simplest box plot the central rectangle
spans the first quartile to the third quartile (the interquartile range or IQR). A
segment inside the rectangle shows the median and "whiskers" above and below the
box show the locations of the minimum and maximum.
Boxplot example
df=DataFrame({'a': np.random.rand(1000),
'b': np.random.randn(1000),'c': np.random.lognormal(size=(1000))})
print(df.head())
df.boxplot()

a b c
0 0.316825 -1.418293 2.090594
1 0.451174 0.901202 0.735789
2 0.208511 -0.710432 1.409085
3 0.254617 -0.637264 2.398320
4 0.256281 -0.564593 1.821763
Boxplot example 2
df2 = pd.read_csv('brfss.csv', index_col=0)
df2.boxplot()
Activity 9
• Use the brfss.csv file and load it to your python module.
• https://www.cs.odu.edu/~sampath/courses/f19/cs620/files/data/brfss.csv
• Use the min-max algorithm to re-scale the data. Remember to
drop the column ‘sex’ from the dataframe before the rescaling.
(Activity 8)
• (series – series.min())/(series.max() – series.min())
• Create a boxplot (DataFrame.boxplot()) of the dataset.
Z-score transformation

• Z scores, or standard scores, indicate how many standard

deviations an observation is above or below the mean. These
scores are a useful way of putting data from different sources
onto the same scale.
• The z-score linearly transforms the data in such a way, that
the mean value of the transformed data equals 0 while their
standard deviation equals 1. The transformed values
themselves do not lie in a particular interval like [0,1] or so.

Z score: Z = (x - sample mean)/sample standard deviation.

Z-score transformation

def zscore(series):
return (series - series.mean(skipna=True)) /
series.std(skipna=True);
df3 = df2.apply(zscore)
df4.boxplot() df3.boxplot()
Mean-based scaling

def meanScaling(series):
return series / series.mean()
df8 = df4.apply(meanScaling) * 100
df8.boxplot()

Mechanical & Electrical Building Systems
100% (9)
Mechanical & Electrical Building Systems
142 pages
Bio Statistics
No ratings yet
Bio Statistics
217 pages
ECAP Learning - Vijay Kumar
No ratings yet
ECAP Learning - Vijay Kumar
7 pages
PLC and Human Machine Interface (HMI) MCQ
80% (10)
PLC and Human Machine Interface (HMI) MCQ
1 page
STAT100 - Full Course Notes
No ratings yet
STAT100 - Full Course Notes
27 pages
Highly Available iSCSI Storage With DRBD and Pacemaker: Florian Haas
No ratings yet
Highly Available iSCSI Storage With DRBD and Pacemaker: Florian Haas
24 pages
Ifrs 7 Financial Instruments Disclosures
No ratings yet
Ifrs 7 Financial Instruments Disclosures
70 pages
Data Science Algorithmen Master - 02 Data Handling
No ratings yet
Data Science Algorithmen Master - 02 Data Handling
76 pages
3-Data Description
No ratings yet
3-Data Description
91 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
30 pages
719 Final Syllabus Merged
No ratings yet
719 Final Syllabus Merged
200 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
97 pages
ML Lecture 6 7 Preprocess
No ratings yet
ML Lecture 6 7 Preprocess
43 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
87 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
Stats and Its Real World Applications.
No ratings yet
Stats and Its Real World Applications.
53 pages
Data Analysis and Visualization EDA
No ratings yet
Data Analysis and Visualization EDA
51 pages
DSOST2
No ratings yet
DSOST2
44 pages
Year 12 Statistics
No ratings yet
Year 12 Statistics
62 pages
Intro To Probability and Statistics
No ratings yet
Intro To Probability and Statistics
147 pages
3sampling and Simulation
No ratings yet
3sampling and Simulation
52 pages
Business Statistics - Session 1 - 3
No ratings yet
Business Statistics - Session 1 - 3
63 pages
Principles of AI Laboratory Varshadr
No ratings yet
Principles of AI Laboratory Varshadr
54 pages
02 Exploratory Data Analytics
No ratings yet
02 Exploratory Data Analytics
41 pages
Pedrali Chair 3
No ratings yet
Pedrali Chair 3
139 pages
Unit 3
No ratings yet
Unit 3
45 pages
Tuesday, 16 January 2024 2:58 PM
No ratings yet
Tuesday, 16 January 2024 2:58 PM
46 pages
Math236 Lecture 1
No ratings yet
Math236 Lecture 1
47 pages
3 Data Description
No ratings yet
3 Data Description
87 pages
Kef Kit100 SM
100% (1)
Kef Kit100 SM
60 pages
PRW Questions
No ratings yet
PRW Questions
31 pages
Statistics
No ratings yet
Statistics
23 pages
Stats Lect
No ratings yet
Stats Lect
77 pages
MAT 211 Introduction To Business Statistics I Lecture Notes
No ratings yet
MAT 211 Introduction To Business Statistics I Lecture Notes
69 pages
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
No ratings yet
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
44 pages
Payslip
No ratings yet
Payslip
1 page
Unit 1,2
No ratings yet
Unit 1,2
17 pages
Unit 4 - Statistical Thinking
No ratings yet
Unit 4 - Statistical Thinking
59 pages
WECC Models
No ratings yet
WECC Models
10 pages
Data Management
No ratings yet
Data Management
43 pages
Research Methodogy Class 4
No ratings yet
Research Methodogy Class 4
29 pages
Research Methodogy Class 5
No ratings yet
Research Methodogy Class 5
29 pages
Comm 215.MidtermReview
No ratings yet
Comm 215.MidtermReview
71 pages
Comprehensive Ebook of Statistics For Data Science - Chaitali
No ratings yet
Comprehensive Ebook of Statistics For Data Science - Chaitali
21 pages
Lecture Module 2
No ratings yet
Lecture Module 2
9 pages
Research Methods: Dr. Abeer Yasin
No ratings yet
Research Methods: Dr. Abeer Yasin
109 pages
Business Statistics NOtes
No ratings yet
Business Statistics NOtes
46 pages
Week2 Lab
No ratings yet
Week2 Lab
8 pages
All The Statistical Concept You Required For Data Science
No ratings yet
All The Statistical Concept You Required For Data Science
26 pages
Chapter 08 Statistics 2
No ratings yet
Chapter 08 Statistics 2
47 pages
2012 SP CHSL
No ratings yet
2012 SP CHSL
16 pages
Nummerical Summaries
No ratings yet
Nummerical Summaries
11 pages
Philippine National Standards For Drinking Water 2017 (DOH AO 2017-0010) - Drinking Water - Water Quality
No ratings yet
Philippine National Standards For Drinking Water 2017 (DOH AO 2017-0010) - Drinking Water - Water Quality
1 page
Mathematical Statistics: Instructor: Dr. Deshi Ye
No ratings yet
Mathematical Statistics: Instructor: Dr. Deshi Ye
42 pages
Igual-SeguÃ 2017 Chapter StatisticalInference
No ratings yet
Igual-SeguÃ 2017 Chapter StatisticalInference
15 pages
Ass-3 Ds
No ratings yet
Ass-3 Ds
7 pages
Sampling and Standard Error
No ratings yet
Sampling and Standard Error
33 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
22 pages
Locker System
No ratings yet
Locker System
31 pages
Data Analysis
No ratings yet
Data Analysis
13 pages
Statistics and Probability Q3
No ratings yet
Statistics and Probability Q3
6 pages
Swagatam Letter
No ratings yet
Swagatam Letter
5 pages
Dear Ms. Lovely Khatun, Urban Essentials India PVT LTD.: Date 03/01/2024
No ratings yet
Dear Ms. Lovely Khatun, Urban Essentials India PVT LTD.: Date 03/01/2024
2 pages
HR Analytics at ScaleneWorks - Behavioral Modeling To Predict Renege IMB551
No ratings yet
HR Analytics at ScaleneWorks - Behavioral Modeling To Predict Renege IMB551
12 pages
DAL Oral Question Bank
No ratings yet
DAL Oral Question Bank
7 pages
Ultimate Python Web Dev Roadmap
No ratings yet
Ultimate Python Web Dev Roadmap
3 pages
Vce Summmer Internship Program (Equity Research)
No ratings yet
Vce Summmer Internship Program (Equity Research)
9 pages
Lab Plan 5: Statistics and Probability: Describing A Single Set of Data
No ratings yet
Lab Plan 5: Statistics and Probability: Describing A Single Set of Data
19 pages
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
No ratings yet
Chapter 1: Descriptive Statistics: Example 1: Making Steel Rods
20 pages
Dice Resume CV Geirman Prostov
No ratings yet
Dice Resume CV Geirman Prostov
8 pages
WK 1 3
No ratings yet
WK 1 3
5 pages
Compatible USB Device List For PSR-S775/S975
No ratings yet
Compatible USB Device List For PSR-S775/S975
3 pages
Linear Regression: 1 1 N N I I I D I I
No ratings yet
Linear Regression: 1 1 N N I I I D I I
20 pages
08 Task Performance
No ratings yet
08 Task Performance
3 pages
Parameter Statistic Parameter Population Characteristic Statistic Sample Characteristic
No ratings yet
Parameter Statistic Parameter Population Characteristic Statistic Sample Characteristic
9 pages
Ge4 Notes (Lec 9 and 10)
No ratings yet
Ge4 Notes (Lec 9 and 10)
6 pages
An Assignment
No ratings yet
An Assignment
8 pages
Sampling Error: in Statistics, Sampling Error Is Incurred When The Statistical Characteristics of
No ratings yet
Sampling Error: in Statistics, Sampling Error Is Incurred When The Statistical Characteristics of
15 pages
Smart Materials Final
No ratings yet
Smart Materials Final
4 pages
001 Good SC
No ratings yet
001 Good SC
15 pages
Project 2
No ratings yet
Project 2
3 pages
Population
No ratings yet
Population
3 pages
Aspen Exchanger Design and Rating Shell & Tube V10: File: Printed: 1/31/2023 at 3:41:24 PM TEMA Sheet
No ratings yet
Aspen Exchanger Design and Rating Shell & Tube V10: File: Printed: 1/31/2023 at 3:41:24 PM TEMA Sheet
1 page
Invoice CT-2237192
No ratings yet
Invoice CT-2237192
2 pages
Alfredo Medina - Resume - June 09
No ratings yet
Alfredo Medina - Resume - June 09
3 pages
Danish Ayub
No ratings yet
Danish Ayub
3 pages
Bond Auctions and Futures PDF
No ratings yet
Bond Auctions and Futures PDF
2 pages
Elementary Statistics
From Everand
Elementary Statistics
jay prakash Maheshwari
5/5 (1)
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Sampling in Statistics
From Everand
Sampling in Statistics
Stephanie Glen
No ratings yet

Lecture 4 - Data Wrangling

Uploaded by

Lecture 4 - Data Wrangling

Uploaded by

CS 620 / DASC 600

Introduction to Data Science

Lecture 4- Data Wrangling

• Working with data is both an art and a science. We’ve mostly

• The process of transforming “raw” data into data that can be

• The simplest case is when you have a one-dimensional data

• (Arithmetic) Mean: the “average” value of the data

• Describes the spread of the data from the mean

Sampling is a process used in statistical analysis in which a

• Population: all members of a group in a study

• Selects at random np.random.choice

np.random.choice(some_array, sample size)

• Non-probability sampling involves a technique where samples

• Good next step is to create a histogram, in which you group

Linear regression finds the best line that predicts Y from X.

• Correlation is almost always used when you measure both variables. It

• We can review the relationships between attributes by

There are two basic types of data: numerical and

Numerical data: data to which a number is

Categorical data: data defined by the classes or

• A continuous variable is one in which it can

• Non-continuous variables, also known as discrete

• A qualitative data is one in which the “true” or naturally

• Quantitative data on the other hand are those in which the

• Transform data to obtain a certain distribution

data = {"height_inch":{'A':63, 'B':67, 'C':70},

from scipy.spatial import distance

• Z scores, or standard scores, indicate how many standard

Z score: Z = (x - sample mean)/sample standard deviation.

You might also like