FIT1043 - Lecture 3 - 2024
FIT1043 - Lecture 3 - 2024
Mahsa Salehi*
Semester 2, 2024
6 Regression analysis
Weeks 9-10
Week 3
Week 4
Weeks 5-7
Week 11 Tools for
Weeks 2&8 data science
Week 12
Assessments Overview
Assessments:
• Assignment 1 (Weeks 2,3,4)
• Assignment 2 (Weeks 2-7)
• Assignment 3 (Weeks 8,9, 10)
• Final Exam (Weeks 1-12)
Weeks 9-10
Week 3
Week 4
Weeks 5-7
Week 11 Tools for
Weeks 2&8 data science
Week 12
Assessment
• Assignment 1
• Python assessment
• Will be released later this week
Our Standard Value Chain
This week!
Last week
Tools for
data science
Aggregation and groupby
Split
Input Apply (mean)
Gender Age
Gender Age
female 38 Gender Age Combine
male 22
female 26 female 33 Class Average
female 38 Age
female 35
female 26 female 33
female 35 male 28.5
Gender Age
Gender Age mahsasalehi868
male 35
male 22
male 28.5
male 35
Poll:
Write the Python code.
Aggregation and groupby
Split
Input Apply (mean)
Gender Age
Gender Age
female 38 Gender Age Combine
male 22
female 26 female 33 Class Average
female 38 Age
female 35
female 26 female 33
female 35 male 28.5
Gender Age
Gender Age
male 35
I'm
male
an 22LGBTIQA+
male Ally.
28.5
male 35
Find out more at monash.edu/lgbtiqa
Advanced Aggregation (1)
Run multiple aggregation operators at once:
>>> fun = {'who':'count','age':'mean'}
>>> groupbyClass = titanic.groupby('class').agg(fun)
Advanced Aggregation (2)
Write custom aggregators using anonymous
functions:
>>> fun = {'age':{'nunique',lambda x: sum(e>50 for e in
x)}}
>>> groupbyClass = titanic.groupby('class').agg(fun)
▪ Data visualisation
▪ Why?
▪ Basic data types
▪ Different graphical representations
▪ Descriptive statistics
• Categorical-Nominal:
• Discrete numbers of values, no inherent ordering
• E.g., country of birth, sex
• Categorical-Ordinal:
• Discrete number of states, but with an ordering
• E.g., Education status, State of disease progression
• Numeric-Discrete:
• Numeric, but the values are enumerable
• E.g., Number of live births, Age (in whole years)
• Numeric-Continuous:
• Numeric, not enumerable (i.e., real numbers)
• E.g., Weight, Height, Distance from CBD
Data Visualisation
• It is often useful to visualise data
• Can sometimes quickly reveal patterns
• However, going beyond two dimensions is
problematic
Data Visualisation
• It is often useful to visualise data
• Can sometimes quickly reveal patterns
• However, going beyond two dimensions is
problematic
• For categorical data, standard visualisations
include:
• Bar graphs
• Pie charts
• For numeric data (continuous and discrete), we
can use:
• Histograms
• Box plots
Frequency Tables
Age (years) Number of
People
0-9 2,967,425
10-19 2,818,778
20-29 3,231,395
30-39 3,265,526
40-49 3,164,712
50-59 2,977,883
60-69 2,488,396
70-79 1,540,373
80+ 947,411
2.5
Mil lions of People
1.5
0.5
0
0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80+
Age (years)
where
max{y}− min{y}
w=
k
is the width if the bins
→ plot v 1 , … , v k using bar-chart
• Histograms are a type of bar chart but specifically for
continuous data.
• Bar-charts only applicable to categorical data
Histograms: Example
15000
Counts
10000
5000
0
-4 -2 0 2 4
6000
5000
Counts
4000
3000
2000
1000
0
-4 -2 0 2 4
3000
2500
Counts
2000
1500
1000
500
0
-4 -2 0 2 4
History
• The GapMinder technology was bought by Google
and the name of motion charts changed to bubble
charts
• But the GapMinder website is now up as a not-for-
profit.
Motion Charts
Visualizing data in five dimensions: x-axis,
y-axis, size of bubble, color of bubble, and time
Motion Charts
Advantages:
► timedimension allows deeper insights and observing trends
► good for exploratory work
A. 4,2,3
B. 5,3,2
C. 4,3,3
D. 4,3,2
mahsasalehi868
Poll
Compute Mean, Median and Mode of
1,2,2,3,4,7,9
Example Result
Type
Mean (1+2+2+3+4+7+9) / 7 4
Median 1, 2, 2, 3, 4, 7, 9 3
Mode 1, 2, 2, 3, 4, 7, 9 2
Mean vs Median
• The mean uses all the values of the sample
• Any change to any sample changes the mean
• The mean can be changed as much as desired by changing
just one sample by a large enough amount
• Example:
y = (1, 2, 3, 4, 5) ⇒ y¯ = 3, med(y) = 3
y = (1, 2, 3, 4, 50) ⇒ y¯ = 12, med(y) = 3
• Why might we want to use mean over median
then?
Mean vs Median: Symmetric
Distributions
The distribution refers to how the data is spread out around certain values or ranges
8000
Mean
Median
6000
4000
2000
0
-4 -2 0 2 4
6000
4000
2000
0
0 10 20 30 40
6000
4000
2000
4000
2000
0
0 10 20 30 40
Measures of Spread (1)
60
BMI (kg/m2)
50 Maximum
40 3rd quartile
Median
30
1st quartile
20 Minimum
Outliers
60
BMI (kg/m2)
50 "Whiskers"
40 3rd quartile
upper quartile (Q3)
Median
30
1st quartile
20 lower quartile (Q1)
Example
6000
4000
2000
0-8 -6 -4 -2 0 2 4 6 8
Example
6000
4000
2000
0-8 -6 -4 -2 0 2 4 6 8
-5
-10-4 -2 0 2 4
R ≈ 0.44
Correlation/Scatter Plot
Example (2)
4
-2
-4-4 -2 0 2 4
R = 0.9
Correlation/Scatter Plot
Example (3)
4
-2
-4-4 -2 0 2 4
R ≈ 0.999
Correlation/Scatter Plot
Example (4)
15
10
mahsasalehi868
0
-4 -2 0 2 4
Poll:
Is there a linear association between x
and y?
Correlation/Scatter Plot
Example (4)
15
10
0
-4 -2 0 2 4
0.5
-0.5
800
Price ($1000s)
600
400
200
Bayswater Knox
1000
Price ($1000s)
800
600
400
200
500
400
300
200
100
0
Bulgarian German
mahsasalehi868
Poll:
“Frequency of cancer" changes
substantially with blood pressure?
Data visualisation in Python
From Python Data Science Handbook by
J. Vanderplas
X Y
Scatter Plots
>>> df.col_name.hist(bins=4)
Boxplots
>>> df.boxplot(column='col_name')
Home Activity: Motion Chart
Open your terminal (mac) or windows prompt,
and enter the following:
pip install motionchart
pip install pyperclip
• Data visualisation