Lecture 4 - Data Wrangling
Lecture 4 - Data Wrangling
• Iterative process of
• Obtain
• Understand
• Explore
• Transform
• Augment
• Visualize
Data Wrangling Steps
Data Wrangling Steps
Exploring Your Data
df = pd.read_csv('brfss.csv')
print(df.sample(100)) # random sample of 100 values
Why do we sample?
• Enables research/ surveys to be done more quickly/ timely
• Less expensive and often more accurate than large CENSUS
( survey of the entire population)
• Given limited research budgets and large population sizes, there
is no alternative to sampling.
• Sampling also allows for minimal damage or lost
• Sample data can also be used to validate census data
• A survey of the entire universe (gives real estimate not sample estimate)
Simple Random Sampling
• In Simple Random Sampling, each element of the larger
population is assigned a unique ID number, and a table of
random numbers or a lottery technique is used to select
elements, one at a time, until the desired sample size is reached.
• Simple random sampling is usually reserved for use with
relatively small populations with an easy-to-use sampling frame
( very tedious when drawing large samples).
• Bias is avoided because the person drawing the sample does not
manipulate the lottery or random number table to select certain
individuals.
Random Selection
Example:
import numpy as np
d = np.arange(6) + 1
s = np.random.choice(d, 1000)
print(s)
Systematic Sampling
• Systematic sampling is a type of probability sampling method in which
sample members from a larger population are selected according to a random
starting point and a fixed periodic interval.
• In this approach, the estimated number of elements in the larger population is
divided by the desired sample size to yield a SAMPLNG INTERVAL. The
sample is then drawn by listing the population in an arbitrary order and
selecting every nth case, starting with a randomly selected.
• This is less time consuming and easier to implement.
• Systematic sampling is useful when the units in your sampling frame are not
numbered or when the sampling frame consists of very long list.
Stratified Sampling
• Populations often consist of strata or groups that are different from each
other and that consist of very different sizes.
• Stratified Sampling ensures that all relevant strata of the population are
represented in the sample.
• Stratification treats each stratum as a separate population- arranging the
sampling frame first in strata before either a simple random technique or a
systematic approach is used to draw the sample.
Convenience Sampling
• Convenience sampling is where subjects are selected because of their
convenient accessibility and proximity to the researcher.
• Convenience Sampling involves the selection of samples from whatever
cases/subjects or respondents that happens to be available at a given place or
time.
• Also known as Incidental/Accidental, Opportunity or Grab Sampling.
Snow- ball Sampling is a special type of convenience sampling where
individuals or persons that have agreed or showed up to be interviewed in the
study serially recommend their acquaintances.
Other Sampling
• In Cluster Sampling, samples are selected in two or more stages
Correlation
• Correlation coefficient quantifies the association strength
• Sensitivity to the distribution
Relationship No Relationship
Relationship
Linear, Strong Linear, Weak Non-Linear
Residuals
Residuals
Residuals
Regression vs Correlation
Correlation quantifies the degree to which two variables are
related.
• Correlation does not fit a line through the data points. You simply are
computing a correlation coefficient (r) that tells you how much one
variable tends to change when the other one does.
• When r is 0.0, there is no relationship. When r is positive, there is a trend
that one variable goes up as the other one goes up. When r is negative,
there is a trend that one variable goes up as the other one goes down.
3 - 29
Continuous or Non-continuous data
a b c
0 0.316825 -1.418293 2.090594
1 0.451174 0.901202 0.735789
2 0.208511 -0.710432 1.409085
3 0.254617 -0.637264 2.398320
4 0.256281 -0.564593 1.821763
Boxplot example 2
df2 = pd.read_csv('brfss.csv', index_col=0)
df2.boxplot()
Activity 9
• Use the brfss.csv file and load it to your python module.
• https://www.cs.odu.edu/~sampath/courses/f19/cs620/files/data/brfss.csv
• Use the min-max algorithm to re-scale the data. Remember to
drop the column ‘sex’ from the dataframe before the rescaling.
(Activity 8)
• (series – series.min())/(series.max() – series.min())
• Create a boxplot (DataFrame.boxplot()) of the dataset.
Z-score transformation
def zscore(series):
return (series - series.mean(skipna=True)) /
series.std(skipna=True);
df3 = df2.apply(zscore)
df4.boxplot() df3.boxplot()
Mean-based scaling
def meanScaling(series):
return series / series.mean()
df8 = df4.apply(meanScaling) * 100
df8.boxplot()