0% found this document useful (0 votes)
52 views69 pages

FIT1043 - Lecture 3 - 2024

Uploaded by

dilipkbose47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views69 pages

FIT1043 - Lecture 3 - 2024

Uploaded by

dilipkbose47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

FIT1043 Lecture 3

Introduction to Data Science

Mahsa Salehi*

Faculty of Information Technology, Monash University

Semester 2, 2024

*with material from © Daniel Schmidt


Unit Schedule
Week Activities Assignments
1 Overview of data science

2 Introduction to Python for data science

3 Data visualisation and descriptive statistics

4 Data sources and data wrangling

5 Data analysis theory Assignment 1

6 Regression analysis

7 Classification and clustering

8 Introduction to R for data science

9 Characterising data and "big" data Assignment 2

10 Big data processing

11 Issues in data management

12 Industry guest lecture Assignment 3


Unit Overview
Collection: getting the data
Engineering: storage and computational resources
Governance: overall management of data
Wrangling: data preprocessing, cleaning
Analysis: discovery (learning, visualisation, etc.)
Presentation: arguing that results are significant and useful
Operationalisation: putting the results to work
Week 1 Overview of data science

Weeks 9-10

Week 3
Week 4
Weeks 5-7
Week 11 Tools for
Weeks 2&8 data science
Week 12
Assessments Overview
Assessments:
• Assignment 1 (Weeks 2,3,4)
• Assignment 2 (Weeks 2-7)
• Assignment 3 (Weeks 8,9, 10)
• Final Exam (Weeks 1-12)

Week 1 Overview of data science

Weeks 9-10

Week 3
Week 4
Weeks 5-7
Week 11 Tools for
Weeks 2&8 data science
Week 12
Assessment
• Assignment 1
• Python assessment
• Will be released later this week
Our Standard Value Chain

This week!

Last week
Tools for
data science
Aggregation and groupby

Split
Input Apply (mean)
Gender Age
Gender Age
female 38 Gender Age Combine
male 22
female 26 female 33 Class Average
female 38 Age
female 35
female 26 female 33
female 35 male 28.5
Gender Age
Gender Age mahsasalehi868
male 35
male 22
male 28.5
male 35

Poll:
Write the Python code.
Aggregation and groupby

Split
Input Apply (mean)
Gender Age
Gender Age
female 38 Gender Age Combine
male 22
female 26 female 33 Class Average
female 38 Age
female 35
female 26 female 33
female 35 male 28.5
Gender Age
Gender Age
male 35
I'm
male
an 22LGBTIQA+
male Ally.
28.5
male 35
Find out more at monash.edu/lgbtiqa
Advanced Aggregation (1)
Run multiple aggregation operators at once:
>>> fun = {'who':'count','age':'mean'}
>>> groupbyClass = titanic.groupby('class').agg(fun)
Advanced Aggregation (2)
Write custom aggregators using anonymous
functions:
>>> fun = {'age':{'nunique',lambda x: sum(e>50 for e in
x)}}
>>> groupbyClass = titanic.groupby('class').agg(fun)

The syntax to create an anonymous function is to write 'lambda x:' followed by


the function itself, where 'x' is the name of the variable that appears in the
function.
Outline

▪ Data visualisation
▪ Why?
▪ Basic data types
▪ Different graphical representations

▪ Descriptive statistics

▪ Python for data visualization


Learning Outcomes (Week 3)

By the end of this week you should be able to:


► Comprehend the importance/power of data visualisation
► Differentiate between approaches for data visualisation,
and explain where each approach is appropriate to be
used
► Explain/differentiate different concepts
in descriptive statistics
► Comprehend more sophisticated group-by operations
and graphing in Python
Data Visualisation
From Introduction to Probability and Statistics for
Engineers and Scientists, by S. M. Ross

data visualisation is useful as a preliminary form of data


analysis to get a "feel" for the data:
Motivation: Data Visualisation

extract from “Turning powerful stats into art” by Chris Jordan,


starting at minute 1:00
Basic Types of Data

• Categorical-Nominal:
• Discrete numbers of values, no inherent ordering
• E.g., country of birth, sex

• Categorical-Ordinal:
• Discrete number of states, but with an ordering
• E.g., Education status, State of disease progression

• Numeric-Discrete:
• Numeric, but the values are enumerable
• E.g., Number of live births, Age (in whole years)

• Numeric-Continuous:
• Numeric, not enumerable (i.e., real numbers)
• E.g., Weight, Height, Distance from CBD
Data Visualisation
• It is often useful to visualise data
• Can sometimes quickly reveal patterns
• However, going beyond two dimensions is
problematic
Data Visualisation
• It is often useful to visualise data
• Can sometimes quickly reveal patterns
• However, going beyond two dimensions is
problematic
• For categorical data, standard visualisations
include:
• Bar graphs
• Pie charts
• For numeric data (continuous and discrete), we
can use:
• Histograms
• Box plots
Frequency Tables
Age (years) Number of
People
0-9 2,967,425
10-19 2,818,778
20-29 3,231,395
30-39 3,265,526
40-49 3,164,712
50-59 2,977,883
60-69 2,488,396
70-79 1,540,373
80+ 947,411

Australian Population by Age


(2016 Census)
Bar Charts
3.5

2.5
Mil lions of People

1.5

0.5

0
0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80+
Age (years)

Australian Population by Age


(2016 Census)
Pie Chart

• Pie Chart is a type of graph in which a circle is


divided into sectors that each represent a
proportion of the whole

Ima ge src: vars itytutors.com


Histograms
• Group numeric data into categories by putting into bins
• If y = (y1, . . . , y n ) are our data points, we divide
them between K equally spaced bins, i.e.,
• The number of samples that fall in bin (category) k are
v k = #{y j ∈ (min{y}+ (k − 1)w, min{y}+ kw)}

where
max{y}− min{y}
w=
k
is the width if the bins
→ plot v 1 , … , v k using bar-chart
• Histograms are a type of bar chart but specifically for
continuous data.
• Bar-charts only applicable to categorical data
Histograms: Example

15000
Counts

10000

5000

0
-4 -2 0 2 4

Histogram with K = 20 bins


Histograms: Example
7000

6000

5000
Counts

4000

3000

2000

1000

0
-4 -2 0 2 4

Histogram with K = 50 bins; lookssmoother


Histograms: Example
3500

3000

2500
Counts

2000

1500

1000

500

0
-4 -2 0 2 4

Histogram with K = 100; starting to look ragged


Why Motion Chart
Motivation
• Motion Charts are interactive multi-dimensional data
visualisations
• Originally introduced to the world as GapMinder by
Hans Rosling and made famous by his TED talks.

History
• The GapMinder technology was bought by Google
and the name of motion charts changed to bubble
charts
• But the GapMinder website is now up as a not-for-
profit.
Motion Charts
Visualizing data in five dimensions: x-axis,
y-axis, size of bubble, color of bubble, and time
Motion Charts
Advantages:
► timedimension allows deeper insights and observing trends
► good for exploratory work

► “appeal to the brain at a more instinctual intuitive level”


Disadvantages:
► not suited for static media
► display can be overwhelming, and controls are complex
► “data scientists who branch into visualization must be
aware of the limitations of uses”
Descriptive Statistics
From Introduction to Probability and Statistics for
Engineers and Scientists, by S. M. Ross

main objective is to interpret key features of a dataset


numerically
Descriptive Statistics
• Descriptive statistics summarise aspects of the data
• Usually lose information, but gain easy
comprehension
• Contrast with inferential statistics

• But what is a “statistic”?


• Let y denote a sample of data
• Then a statistic is any function s(y) of the data
• Some functions (statistics) more useful than others
• But all describe properties of the data
Measures of Centrality
• Let y = (y1, . . . , y n ) be a sample of n data points
• The most common measure of centrality, or
averageness, is the arithmetic mean

• The mode is the most frequently occurring value in


the sample
• Another common measure is the median, med(y)
• Value such that 50% of samples have values less than med(y)
• Easily found by sorting samples and finding middle sample
Poll
Which option is the Mean, Median and Mode
of the following set of values respectively?
1,2,2,3,4,7,9

A. 4,2,3
B. 5,3,2
C. 4,3,3
D. 4,3,2
mahsasalehi868
Poll
Compute Mean, Median and Mode of
1,2,2,3,4,7,9

Example Result
Type

Mean (1+2+2+3+4+7+9) / 7 4

Median 1, 2, 2, 3, 4, 7, 9 3

Mode 1, 2, 2, 3, 4, 7, 9 2
Mean vs Median
• The mean uses all the values of the sample
• Any change to any sample changes the mean
• The mean can be changed as much as desired by changing
just one sample by a large enough amount

• The median uses at most two of the values of the


sample
Is very resistant to changes to the samples not in the middle

• Example:
y = (1, 2, 3, 4, 5) ⇒ y¯ = 3, med(y) = 3
y = (1, 2, 3, 4, 50) ⇒ y¯ = 12, med(y) = 3
• Why might we want to use mean over median
then?
Mean vs Median: Symmetric
Distributions
The distribution refers to how the data is spread out around certain values or ranges
8000
Mean
Median

6000

4000

2000

0
-4 -2 0 2 4

Symmetric distribution of data; mean and median (nearly) the same


Mean vs Median: Positively
Skewed Data
8000
Mean
Median

6000

4000

2000

0
0 10 20 30 40

Positively skewed data; mean greater than median


Mean vs Median: Negatively
Skewed Data
8000

6000

4000

2000

0-40 -30 -20 -10 0

Negatively skewed data; mean less than median


Percentiles
• More generally, we can define the percentiles
• The p-th percentile is the value, Q(y, p) such that
p% of the values of the sample are lower than
Q(y, p)
Percentiles
• More generally, we can define the percentiles
• The p-th percentile is the value, Q(y, p) such that
p% of the values of the sample are lower than
Q(y, p)

• The median is simply the 50th percentile, Q(y, 50)


• Other important percentiles are the 1st and 3rd
quartiles
• i.e., the 25th and 75th percentiles
Percentiles
8000
25th percentile
Median
75th percentile
6000

4000

2000

0
0 10 20 30 40
Measures of Spread (1)

• Measures of centrality tell us about the typical


value of the sample
• Measures of spread tell us how much the samples
differ, on average, from the typical value

• The most straightforward is the range


rng(y) = max{y}− min{y}
where
min{y} denotes the minimum value in the sample;
max{y} denotes the maximum value in the sample.
Boxplots
70

60
BMI (kg/m2)

50 Maximum

40 3rd quartile

Median
30
1st quartile
20 Minimum

• Boxplot displays the five-number summary of a set of data.


• Boxplot graphically captures centrality, spread and skewness in
one plot
Visualising Continuous Data:
Boxplots
70

Outliers
60
BMI (kg/m2)

50 "Whiskers"

40 3rd quartile
upper quartile (Q3)
Median
30
1st quartile
20 lower quartile (Q1)

• Interquartile range (IQR) = Q3 – Q1


• Add 1.5 x (IQR) to the third quartile. Any number greater than this
is a suspected outlier.
Measures of Spread (2)
• The most common measure of spread used is the sample
standard deviation

• The sample standard deviation is the arithmetic mean of the


squared deviations from the sample mean

• Like the mean, is sensitive to changes in the sample

• Often, the sample variance


v(y) = s 2(y)
is used, as it can be easier to work with
Measures of Spread: Example
Measures of Spread:
8000

Example
6000

4000

2000

0-8 -6 -4 -2 0 2 4 6 8

rng(y) = 4.63 (min{y}= −2.61, max{y}= 2.01), s(y) = 0.5


Measures of Spread: Example
Measures of Spread:
8000

Example
6000

4000

2000

0-8 -6 -4 -2 0 2 4 6 8

rng(y) = 13.89 (min{y}= −7.84, max{y}= 6.05), s(y) = 1.5


Association Between Two
Continuous Variables
• Let x = (x 1 , . . . , x n ) and y = (y1 , . . . , y n ) be two
numeric variables measured on the same objects
• We might ask if there is an association between x and y
• Pearson correlation measures linear association

• Correlation is always between -1 (completely negatively


correlated) and 1 (completely positively correlated)
• A correlation of zero implies there is no linear association
⇒ does not imply no non-linear association
• Remember: correlation not equal causation!
Scatter Plots

• Scatter plots help us visualise relationships


between two (usually) numeric variables
• Plot points, with one variable on x-axis and one on y-axis
• Can be used to visually look for association
• Correlation coefficients are statistics that
quantitatively measure the strength of the
association between two variables
Correlation/Scatter Plot
Example (1)
10

-5

-10-4 -2 0 2 4

R ≈ 0.44
Correlation/Scatter Plot
Example (2)
4

-2

-4-4 -2 0 2 4

R = 0.9
Correlation/Scatter Plot
Example (3)
4

-2

-4-4 -2 0 2 4

R ≈ 0.999
Correlation/Scatter Plot
Example (4)
15

10

mahsasalehi868

0
-4 -2 0 2 4

Poll:
Is there a linear association between x
and y?
Correlation/Scatter Plot
Example (4)
15

10

0
-4 -2 0 2 4

R ≈ 0.01 – though clearly associated, as y = x 2 + noise


Correlation/Scatter Plot
Example (5)
1

0.5

-0.5

-1-1 -0.5 0 0.5 1

R = 0, though there is a deterministic association between x and y


Association Between Categorical
and Numeric Variables
• If x is categorical, and y is numeric, how to
visualise?
• A standard approach is the side-by-side boxplot
• Divide the data between categories, then plot boxplots for
each group
• Do the boxplots look different?

• If x and y are both categorical, we can use a


side-by-side bargraph instead
• Are the distributions/bargraphs different between
categories? If so, there is a possible association
Example: Categorical and
Numeric Variables (1)
1000

800
Price ($1000s)

600

400

200
Bayswater Knox

Distribution of price similar between suburbs


Example: Categorical and
Numeric Variables (2)
1200

1000
Price ($1000s)

800

600

400

200

1 room 2-4 rooms > 5 rooms

Distribution of price varies greatly with number of rooms


Example: Two Categorical
Variables (1)
700
No canc er
Cancer
600

500

400

300

200

100

0
Bulgarian German

Frequency of cancer does not seem to change with ethnicity; unlikely to be


associated
Example: Two Categorical
Variables (2)
700
No canc er
Cancer
600

mahsasalehi868

Low blood pressure High blood pressure

Poll:
“Frequency of cancer" changes
substantially with blood pressure?
Data visualisation in Python
From Python Data Science Handbook by
J. Vanderplas

Plotting data with Matplotlib


Plotting Data in Python:
Matplotlib
We can use the matplotlib library to plot data in Python
>>> import matplotlib.pyplot as plt

Define a table with the data to plot:


>>> myd= { 'Class' : ['First', 'Second', 'Third'],
'Passengers' : [194, 177, 450],
'Average Age' : [39,30,25] }
>>> df = pd.DataFrame(myd)

There are many types of plots for visualising data in


Python
Bar Charts
>>> plt.bar(df['Class'],df['Passengers'])
Pie Chart
>>> df.plot.pie(y='Average Age')

You can add labels:


>>> df.plot.pie(y='Average Age'
, labels=df['Class'])

You can add index like this:


>>> df1 = pd.DataFrame({
'Passengers' : [194, 177, 450],
'Average Age' : [39,30,25] },
index=['First', 'Second',
'Third'])
>>> df1.plot.pie(y='Average Age')
Basic plots

df = pd.DataFrame({ 'X' : [0,1,2,3,4,5,6],


'Y' : [0,1,4,9,16,25,36]})
>>> plt.plot(df.col_name)

X Y
Scatter Plots

>>> plt.scatter(df['X'], df['Y'])


>>> plt.show()
Histograms

>>> df.col_name.hist(bins=4)
Boxplots

>>> df.boxplot(column='col_name')
Home Activity: Motion Chart
Open your terminal (mac) or windows prompt,
and enter the following:
pip install motionchart
pip install pyperclip

>>> from motionchart.motionchart


import MotionChart
>>> mChart = MotionChart(df =
sampleData)
>>> mChart.to_notebook()
Learning Outcomes (Recap)

This week we learnt the following:


► Comprehend the importance/power of data visualisation
► Differentiate between approaches for data visualisation,
and explain where each approach is appropriate to be
used
► Explain/differentiate different concepts
in descriptive statistics
► Comprehend more sophisticated group-by operations
and graphing in Python
Applied Session- Week 3

• Advanced data aggregation

• Data visualisation

• Solutions will be released end of the week

You might also like