0% found this document useful (0 votes)
14 views

lecture3

The document outlines the schedule and key topics for Lecture 3 of CCST9047: The Age of Big Data, including tutorials and the upcoming final project details. It recaps previous lectures on data types and introduces concepts of data transformation, visualization, and statistics. The lecture emphasizes the importance of data preparation, handling missing values, and various visualization methods to analyze and interpret data effectively.

Uploaded by

lokhimtam11
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

lecture3

The document outlines the schedule and key topics for Lecture 3 of CCST9047: The Age of Big Data, including tutorials and the upcoming final project details. It recaps previous lectures on data types and introduces concepts of data transformation, visualization, and statistics. The lecture emphasizes the importance of data preparation, handling missing values, and various visualization methods to analyze and interpret data effectively.

Uploaded by

lokhimtam11
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

CCST9047: The Age

of Big Data
Lecture 3
Things to reminder

‣ We only have 4 tutorials in the weeks of:

‣ 2.17 - 2.21 (next week), 2.24 - 2.28, 3.24 - 3.28, 3.31 - 4.4
Weekday Start time End Venue
THU 10:30 time
11:20 HW311
THU 11:30 12:20 HW311
THU 13:30 14:20 MB226
THU 14:30 15:20 MB226 If you haven’t registered, please
THU 15:30 16:20 MB226 do it asap!
THU 16:30 17:20 MB226
THU 17:30 18:20 MB226
FRI 10:30 11:20 MB154
FRI 11:30 12:20 MB127
FRI 13:30 14:20 MB127
Things to reminder

‣ The detailed description of the nal project will be released by this evening.

fi
Recap of Previous Lecture

Volume
‣ Daily applications

‣ Web data

‣ Social Network data

‣ Image and Video


Big Data
‣ Text data

‣ Sensor data

‣ Medical data Velcolity Variety


Recap of Previous Lecture

‣ Key points:

‣ How data is generated and collected in daily applications.

‣ How these data can enable better services?

‣ How data is formalized and organized with various data types.

‣ Properties of di erent data.

‣ Simple calculations of the data size.

Then, after collecting data, how to organize then and do some simple analysis?
ff
Things Will be Covered

‣ Data transformations

‣ Data visualizations

‣ Simple data statistics


Data Transformation

‣ Data are the results of deliberate human intervention

‣ Data vary across domains

‣ Di erent domains have di erent form of data

‣ Data vary within domains

‣ Data within the domain have di erent values.

The raw data is complex, and we need some about of preparation!


ff
ff
ff
Data Preparation

‣ Deliverables from big data

‣ Building prediction tools

‣ Building decision-making models

‣ Generate gures and reports

Lot of things need to do before preparing the deliverables.


fi
Data Preparation Python

‣ After collecting the data, we should import pandas as pd


url = 'https://raw.githubusercontent.com/nytimes/
Data is typically organized as a table covid-19-data/master/us-states.csv'
Data = pd.read_csv(url)

‣ Understand what the variables are

‣ Manage column types Columns

‣ Handle missing values

Rows
Data Table and Type

‣ What do tables mean?

‣ Collection of di erent variables

‣ Collections of multiple observations


Di erent variables

Observation
ff
ff
Data Table and Type

‣ What do columns mean?

‣ Di erent features

‣ Di erent aspects of the data point


Di erent Features
ff
ff
ff
Data Table and Type

‣ How were the data collected?

‣ Province_state and country_region are collected by human

‣ Last_update time is collected by the clock

‣ Lat and Long_ are collected by GPS

‣ Con rmed and deaths are collected by human

Column Data type Description


Province_State String State name
Country_Region String Country name
Last_update Datetime Time of update
Lat Float Value of latitude
Long_ Float Value of longtitude
Con rmed Int Number of con rmed cases
Death Int Number of deaths
fi
fi
fi
Features and Observations

‣ Features represent the values collected in di erent manners

‣ Collected by di erent sensors

‣ Collected at di erent locations

‣ Collected for di erent objects

‣ Collected at di erent times

‣ …

‣ Observations represent the data point containing the corresponding features

‣ Can be di erent at di erent times

‣ Can be di erent for di erent individuals

‣ …
https://en.wikipedia.org/wiki/Moore%27s_law
ff
ff
ff
ff
ff
ff
ff
ff
ff
One Single Observation

‣Covid data

observation

• Observation
• Each observation is a
collection of feature values.
• Di erent observations refer to
di erent states and the
corresponding information
ff
ff
Managing Data Types

‣ Data has di erent “types”

‣ Numerica, categorial, dates, integers, string

‣ We need to manage some of them to make their information clearer

‣ Numerical methods cannot directly handle string-type data

Column Data type Description


Province_State String State name
Country_Region String Country name
Last_update String Time of update
Lat Float Value of latitude
Long_ Float Value of longtitude
Con rmed Int Number of con rmed cases
Death Int Number of deaths
fi
fi
ff
Managing Data Types: Dates

‣ The update time is stored as string

‣ To better make use of it, we need to convert the string to date time

String Datetime

“2023-01-23 04:31:38” → datetime(2023, 1, 23, 04, 31, 38)

One single date time string can be used to derive a number of new
features, e.g., year, month, day, …
Managing Data Types: Categorical

‣ Categorical type is general in many dataset

‣ categorical variable is a variable that can take on one of a limited, and


usually xed, number of possible values/groups

Province_State is a categorical variable as it


can only be one of the states or small
islands in the US.
fi
Managing Data Types: Categorical

‣ Managing categorical variables

‣ The number of groups may be large, merge some values into one group.

{Alabama, Alaska, American Samoa,


Arizona, Arkansas, California} → {Arizona,
California, and others}.

7 groups reduce to 3 groups.


Managing Data Types: Categorical

‣ Managing categorical variables

‣ A single categorical might encode multiple pieces of information, need


to decompose them to better organize information.

Location Province_State Country

California, US California US

Arizona, US Arizona US

Guangdong, China Guangdong China

British Columbia, Canada British Columbia Canada

Decompose categorical variable by using more features.


Managing Data Types: Categorical

‣ Managing categorical variables

‣ Converting categorical variables to numerical values

Using integers, ps code maps


state to numbers
fi
Managing Data Types: Categorical

‣ Managing categorical variables

‣ Converting categorical variables to numerical values

One-hot encoding

Correct?
• Each variable is represented
TRUE [1, 0, 0]
as an one-hot vector.
TRUE
FALSE [0, 1, 0] • Dimension/number of entries
FALSE of the vector is the number of
Don’t know Don’t know [0, 0, 1] categorical groups.
• One entry being 1 and
remaining being 0.
Handling Missing Values

‣ Missing value occurs frequently

‣ Caused by human error, sensor failures, data collection constraints

‣ If we can choose not to use it, it is typically marked as NA (or NaN).

‣ We can consider removing the entire rows with missing values

NaN NaN
Handling Missing Values

‣ Missing value occurs frequently

‣ If we need to use it for building models, we may need imputation

‣ Using the average from the observed cases

‣ Predict the missing value by prior knowledge (e.g., too large, too small)

‣ Predict the missing value using simple rules (mean, mode, etc.)

NaN The deaths of American


Samoa will be very
likely smaller than 100.
Handling Missing Values

‣ Missing value occurs frequently

‣ If we need to use it for building models, we may need imputation

‣ There are many more powerful methods for handling missing value by
using more data points.

‣ many modern machine learning models allow masking the missing


value, but implicit complete it during the model training.
Data Visualization

‣ Visualization can help you get a qualitative understand about the data

‣ Visualization can help you know what happens, good or bad

‣ See if the visualization matches your understanding

‣ Visualization can also help you understand the data from di erent
perspective

‣ Correlations between di erent features

‣ Trend of the data


ff
ff
Visualization Encodings

‣ Visualization represents data using graphical marks

‣ Di erent attributes of the marks encode data variables

‣ Marks allows us to make comparisons

‣ We can further reason about the data based on the visualization

x,y positions Di erent size Di erent color Di erent shape

Reasoning: the value will keep increasing in the future


ff
ff
ff
ff
Visualization Encodings

‣ Visualization represents data using graphical marks

‣ Some encodings are better for some variables types

‣ Size is better for continuous, not good for categorical

‣ For categorical, it would be better to use position

‣ Some encodings are easier to perceive

‣ Color is better than shape

‣ But do not use two many colors


Different Visualization

‣ Given a dataset, we have di erent visualization methods

‣ Using curve to see the trend import matplotlib.pyplot as plt


data_ca = data[data['state']=='California']
daily_cases = np.diff(data_ca['cases'].to_numpy())
plt.plot(daily_cases)

Daily covid cases in California


What can we conclude and reason about?

• There are 4 waves.


• The second wave is stronger than
the rst one.
• The daily cases will likely to
decrease in the future days.
Each point can be viewed as a data pair (x, y)
fi
ff
Different Visualization

‣ Given a dataset, we have di erent visualization methods

‣ Using curve to make comparisons


Daily covid cumulative cases in
California and Texas What can we conclude and reason about?

• California has more cases than


Texas.
• The di erence is become larger.
• The cumulative cases are still
increasing with a steady rate.
ff
ff
Different Visualization

‣ Given a dataset, we have di erent visualization methods

‣ Using pie chart to highlight the ratio of di erent components


Cumulative cases pie chart
What can we conclude and reason about?

• California cases take 12% of the all


cases.
• There are 4 states taking over 5% of
all cases.
• Can help us understand which
component takes the largest ratio.
ff
ff
Different Visualization

‣ Given a dataset, we have di erent visualization methods

‣ Using bar chart to highlight the comparison

Daily cases comparison at di erent date What can we conclude and reason about?

• 2020-08-12 has signi cantly more


cases than 2020-03-15.
• 2021-02-28 has fewer cases than
2020-08-12.
• Cases in Aug. 2020 will be more than
those in May 2020.
fi
ff
ff
Different Visualization

‣ Given a large number of observations, we can

‣ Use histogram to visualize the distribution density

‣ Count within bin ranges


Histogram of the daily cases What can we see and reason about?
• Mostly the daily cases are less than
20000.
• There are some days with cases
greater than 30000.
• Daily cases are mostly concentrating
around 2000-4000.
Different Visualization

‣ Given a large number of observations, we can

‣ Use histogram to visualize the distribution density

‣ Count within bin ranges

‣ Histogram of the daily cases


Normalizing to distribution: make total area is 1.

• Only the y-axis changes


• The shape is the same!
Different Visualization

‣ Given a large number of observations, we can

‣ Use scatter plot to visualize the distribution

‣ Also infer the relationships between di erent features


Distribution of cases and deaths What can we see and reason about?

• There are some outliers.


• Larger daily cases typically imply
larger daily deaths.
ff
Different Visualization

‣ Visualizations are now easy to generate using Python

Some details will be discussed in the tutorial!


Simple Data Stats

‣ Simple data statistics help summarize the property of data

‣ Statistics provide a quantitative understanding

‣ Statistics for a single variable

‣ mean, median

‣ variance, standard deviation

‣ Statistics for two variables

‣ Covariance

‣ Correlation
Statistics for Single Variables: Location Measures

‣ Given a large number of observations, we aim to understand

‣ What’s the typical value of them

‣ What’s the center of them

Daily Covid deaths in Aug. 2021 in California

[ 121 -5 93 54 39 13 45 10 51 -357 52 63 34 31 23 43
89 92 112 62 45 33 68 140 115 122 71 70 47 70 124]
Statistics for Single Variables: Location Measures

‣ Location measures: mean and median

‣ Mean denotes the average of samples

‣ Median denotes the sample at the center of the distribution,


i.e.,# samples < median = # samples > median

Given a set of values x1, x2, …, xn, the mean Given a set of values x1, x2, …, xn, the mode
is calculated using the following formula is calculated using the following formula

n • First sort the values from smallest to largest:


1
x(1), x(2), …, x(n)
n∑
x̄ = mean(x) = xi
i=1
• Then pick the middle values:
• If n is odd, pick x(n+1)/2
• If n is even, pick [xn/2 + xn/2+1]/2.
Statistics for Single Variables: Location Measures

‣ Location measures: mean and median

‣ Mean denotes the average of samples

‣ Median denotes the sample at the center of the distribution,


i.e.,# samples < median = # samples > median

Daily Covid deaths in Aug. 2021 in California

[ 121 -5 93 54 39 13 45 10 51 -357 52 63 34 31 23 43 89
92 112 62 45 33 68 140 115 122 71 70 47 70 124]

[-357 -5 10 13 23 31 33 34 39 43 45 45 47 51 52 54
Sorted sequence
62 63 68 70 70 71 89 92 93 112 115 121 122 124 140]

Mean = 50.6, median = 54


Statistics for Single Variables: Location Measures

‣ Location measures: mean and median

‣ Change the samples lead to the change of mean and median

‣ Median is more stable after removing a small number of outliers

Daily Covid deaths in Aug. 2021 in California Removing unreasonable samples

[ 121 -5 93 54 39 13 45 10 51 -357 52 63 34 31 23 43
89 92 112 62 45 33 68 140 115 122 71 70 47 70 124]
[10 13 23 31 33 34 39 43 45 45 47 51 52 54 62 63
Sorted sequence
68 70 70 71 89 92 93 112 115 121 122 124 140]
Mean = 66.6, median = 62
Mathematical Description of Mean and Median

‣ Find the center of a number of samples

‣ De nition of centers

‣ How to nd them

Mathematically, center is de ned as the point such that the average


distance to this point is minimized.

n
1
x n∑
x̄ = arg min dist(x, xi)
i=1
fi
fi
fi
Mathematical Description of Mean and Median

‣ Find the center of a number of samples

‣ De nition of centers

‣ Distance metric can be absolute deviation: | xi − xj |

‣ Distance metric can be squared deviation: (xi − xj) 2

Mean minimizes the average Median minimizes the average


distance using squared deviation distance using absolute deviation
n n
1 2 1
x n∑ x n∑
x̄ = arg min (x − xi) x̄ = arg min | x − xi |
i=1 i=1
fi
Visualization of the mean and median

‣ For di erent distribution, median and mean have di erent behaviors

As the data distribution become more balanced, mean and median will be closer.
ff
ff
Spread of the data

‣ How to characterize the spread of the data

Mean is the center of the data, spread characterizes the mean of deviation of a
data to the center.
Statistics for Single Variables: Spread Measures

‣ Given a large number of observations, the spread also shows

‣ The uncertainty among them

‣ The variation of the values around the mean

Daily Covid deaths in Aug. 2021 in California

[ 121 -5 93 54 39 13 45 10 51 -357 52 63 34 31 23 43
89 92 112 62 45 33 68 140 115 122 71 70 47 70 124]
Variance and Standard Deviation

‣ Spread measures: variance and standard deviation

‣ The average distance to the mean

‣ Recall 1 n
2
n∑
x̄ = arg min (x − xi)
x
i=1 When x̄ is estimated, we typically
‣ Variance is then denoted as use the following formula
n
1 n 1 2
n−1∑
2 Var = (x̄ − xi)
n∑
Var = (x̄ − xi)
i=1 i=1

‣ Standard deviation is the squared root of variance 1 n


2
n−1∑
std = (x̄ − xi)
n
Std has the same unit 1 2 i=1

std = (x̄ − xi)
as the sample. n i=1
Variance and Standard Deviation

‣ Calculations

[ 121 -5 93 54 39 13 45 10 51 -357 52 63 34 31 23 43
89 92 112 62 45 33 68 140 115 122 71 70 47 70 124]

Variance = 6852, std = 83

[ 121 -5 93 54 39 13 45 10 51 -357 52 63 34 31 23 43
89 92 112 62 45 33 68 140 115 122 71 70 47 70 124]

Variance = 1232, std = 35

Variance and standard deviation are even more sensitive to the outliers.
Statistics for Two Variables

‣ Given a large number of observations of two features, we aim to


understand

‣ What’s the correlation between them

Daily Covid deaths in Sept. 2021 in California

[ 144 91 93 15 73 153 176 164 87 50 113 110 225 178 123 86 72 52


86 158 130 178 30 16 161 68 148 145 109 18]

Daily Covid cases in Sept. 2021 in California


[12022 9727 10551 7054 11141 9061 11053 9861 9154 7392 16056 8746 8760
12292 10053 8263 7068 8696 6669 7557 9726 7495 1555 1179 19950 5305 7143
9241 8033 1363]
Statistics for Two Variables

• Deaths and cases are positively


correlated.
• How to handle the correlation?
Covariance

‣ Covariance: mean of the multiplication of deviations


n
1
n∑
cov(x, y) = (xi − x̄)(yi − ȳ)
i=1

‣ Consider extreme cases:

‣ What happens if xi = yi?

‣ What happens if xi = − yi ?

‣ What are the units of the covariance?


Correlation
‣ Correlation

‣ Covariance depends on the scaling of the variables

‣ But correlation does not


cov(x, y)
corr(x, y) =
std(x)std(y)

‣ Consider extreme cases:

‣ What happens if xi = yi?

‣ What happens if xi = − yi ?

‣ What is the unit of the correlation? [-1, 1]


Covariance and Correlations

‣ Calculations
[ 144 91 93 15 73 153 176 164 87 50 113 110 225 178 123 86 72 52
86 158 130 178 30 16 161 68 148 145 109 18]

[12022 9727 10551 7054 11141 9061 11053 9861 9154 7392 16056 8746 8760
12292 10053 8263 7068 8696 6669 7557 9726 7495 1555 1179 19950 5305 7143
9241 8033 1363]

Covariance = 115786, correlation = 0.55


Hard to interpret Easier to interpret

Correlation is more informative than covariance.


Summary

‣ Data transformations

‣ Features, observations, data type, tidy data

‣ Data visualizations

‣ Markers, plots, comparisons

‣ Simple Statistics

‣ Mean, median, variance, correlation

You might also like