lecture3
lecture3
of Big Data
Lecture 3
Things to reminder
‣ 2.17 - 2.21 (next week), 2.24 - 2.28, 3.24 - 3.28, 3.31 - 4.4
Weekday Start time End Venue
THU 10:30 time
11:20 HW311
THU 11:30 12:20 HW311
THU 13:30 14:20 MB226
THU 14:30 15:20 MB226 If you haven’t registered, please
THU 15:30 16:20 MB226 do it asap!
THU 16:30 17:20 MB226
THU 17:30 18:20 MB226
FRI 10:30 11:20 MB154
FRI 11:30 12:20 MB127
FRI 13:30 14:20 MB127
Things to reminder
‣ The detailed description of the nal project will be released by this evening.
fi
Recap of Previous Lecture
Volume
‣ Daily applications
‣ Web data
‣ Sensor data
‣ Key points:
Then, after collecting data, how to organize then and do some simple analysis?
ff
Things Will be Covered
‣ Data transformations
‣ Data visualizations
‣
url = 'https://raw.githubusercontent.com/nytimes/
Data is typically organized as a table covid-19-data/master/us-states.csv'
Data = pd.read_csv(url)
Rows
Data Table and Type
Observation
ff
ff
Data Table and Type
‣ Di erent features
‣ …
‣ …
https://en.wikipedia.org/wiki/Moore%27s_law
ff
ff
ff
ff
ff
ff
ff
ff
ff
One Single Observation
‣Covid data
observation
• Observation
• Each observation is a
collection of feature values.
• Di erent observations refer to
di erent states and the
corresponding information
ff
ff
Managing Data Types
‣ To better make use of it, we need to convert the string to date time
String Datetime
One single date time string can be used to derive a number of new
features, e.g., year, month, day, …
Managing Data Types: Categorical
‣ The number of groups may be large, merge some values into one group.
California, US California US
Arizona, US Arizona US
One-hot encoding
Correct?
• Each variable is represented
TRUE [1, 0, 0]
as an one-hot vector.
TRUE
FALSE [0, 1, 0] • Dimension/number of entries
FALSE of the vector is the number of
Don’t know Don’t know [0, 0, 1] categorical groups.
• One entry being 1 and
remaining being 0.
Handling Missing Values
NaN NaN
Handling Missing Values
‣ Predict the missing value by prior knowledge (e.g., too large, too small)
‣ Predict the missing value using simple rules (mean, mode, etc.)
‣ There are many more powerful methods for handling missing value by
using more data points.
‣ Visualization can help you get a qualitative understand about the data
‣ Visualization can also help you understand the data from di erent
perspective
Daily cases comparison at di erent date What can we conclude and reason about?
‣ mean, median
‣ Covariance
‣ Correlation
Statistics for Single Variables: Location Measures
[ 121 -5 93 54 39 13 45 10 51 -357 52 63 34 31 23 43
89 92 112 62 45 33 68 140 115 122 71 70 47 70 124]
Statistics for Single Variables: Location Measures
Given a set of values x1, x2, …, xn, the mean Given a set of values x1, x2, …, xn, the mode
is calculated using the following formula is calculated using the following formula
[ 121 -5 93 54 39 13 45 10 51 -357 52 63 34 31 23 43 89
92 112 62 45 33 68 140 115 122 71 70 47 70 124]
[-357 -5 10 13 23 31 33 34 39 43 45 45 47 51 52 54
Sorted sequence
62 63 68 70 70 71 89 92 93 112 115 121 122 124 140]
[ 121 -5 93 54 39 13 45 10 51 -357 52 63 34 31 23 43
89 92 112 62 45 33 68 140 115 122 71 70 47 70 124]
[10 13 23 31 33 34 39 43 45 45 47 51 52 54 62 63
Sorted sequence
68 70 70 71 89 92 93 112 115 121 122 124 140]
Mean = 66.6, median = 62
Mathematical Description of Mean and Median
‣ De nition of centers
‣ How to nd them
n
1
x n∑
x̄ = arg min dist(x, xi)
i=1
fi
fi
fi
Mathematical Description of Mean and Median
‣ De nition of centers
As the data distribution become more balanced, mean and median will be closer.
ff
ff
Spread of the data
Mean is the center of the data, spread characterizes the mean of deviation of a
data to the center.
Statistics for Single Variables: Spread Measures
[ 121 -5 93 54 39 13 45 10 51 -357 52 63 34 31 23 43
89 92 112 62 45 33 68 140 115 122 71 70 47 70 124]
Variance and Standard Deviation
‣ Recall 1 n
2
n∑
x̄ = arg min (x − xi)
x
i=1 When x̄ is estimated, we typically
‣ Variance is then denoted as use the following formula
n
1 n 1 2
n−1∑
2 Var = (x̄ − xi)
n∑
Var = (x̄ − xi)
i=1 i=1
‣ Calculations
[ 121 -5 93 54 39 13 45 10 51 -357 52 63 34 31 23 43
89 92 112 62 45 33 68 140 115 122 71 70 47 70 124]
[ 121 -5 93 54 39 13 45 10 51 -357 52 63 34 31 23 43
89 92 112 62 45 33 68 140 115 122 71 70 47 70 124]
Variance and standard deviation are even more sensitive to the outliers.
Statistics for Two Variables
‣ What happens if xi = − yi ?
‣ What happens if xi = − yi ?
‣ Calculations
[ 144 91 93 15 73 153 176 164 87 50 113 110 225 178 123 86 72 52
86 158 130 178 30 16 161 68 148 145 109 18]
[12022 9727 10551 7054 11141 9061 11053 9861 9154 7392 16056 8746 8760
12292 10053 8263 7068 8696 6669 7557 9726 7495 1555 1179 19950 5305 7143
9241 8033 1363]
‣ Data transformations
‣ Data visualizations
‣ Simple Statistics