SCA - Module 4
SCA - Module 4
Week 4
Dealing with uncertainty
2
Decision making process
3
Statistical methods
Descriptive statistics
▪ It is the discipline of quantitatively describing the main features of a
collection of data
▪ Collecting, summarizing, processing data into useful information
Inferential statistics
▪ Provides the basis for prediction, and estimates that are used to
transform information about sample into inferences about population
4
Descriptive vs. inferential
5
Statistical analysis
▪ Data/information (what)
▪ Scientific reasoning (Who? where? how? what happens?)
▪ Finding (what results?)
▪ Lesson/conclusion (so what? so how? )
Key definitions
7
Descriptive statistics
Why they are useful:
• Help us summarize the characteristics of our data/observations
8
Descriptive statistics
The Mean:
• Mean is another word for average.
• Most commonly used statistic that tells us about the centre of a data
set.
• It is the average of the numbers in your dataset.
• The sum of all values divided by the number of observations.
n=4
+ + + = 307
307 ÷4
9
Descriptive statistics
The Mean:
• The mean is important because it is involved in many
statistical tests.
• E.g., T-test, ANOVA etc.
10
Descriptive statistics
The Median :
• Is the middle point of your ordered data
Example 2: 1,3,10,5,8
Step1: Order the observations
1,3,5,8,10
11
Descriptive statistics
Example 1: 4,12,0,21,4,13 (n=6)
• Step1: Order observations:
0,4,4,12,13,21
• 6 observations not a single point in the middle
12
Descriptive statistics
Difference between Mean and Median
• Both are single values that represent the center of a data set
• But median is more robust or less affected to outliers
Example 3: 3,1,4,2.5,50 (n=5)
• 50 is a value quite extreme compared to the rest (outlier)
3+1+4+2.5+50
• Mean= =12.1
5
• Median: 1,2.5,3,4,50 =3
13
Measures of spread
14
Descriptive statistics
The Variance (𝑠 2 )
• A measure to show how spread out our observations are
• Based on the distance of each observation from the mean
15
Descriptive statistics
The Variance (𝑠 2 )
• Squared
• The average of the squared differences from the mean
Dog 1: 10kg
Dog 2: 12kg
Dog 3: 13kg
Dog 4: 17kg
Dog 5: 20kg
Dog 6: 24kg
Mean = 16kg
16
Descriptive statistics
The Variance (𝑠 2 )
1. Work out the difference from the mean and square
each value
2. Add all the squared values together
Dog 1: -6 36
Dog 2: -4 16
Dog 3: -3 9
Dog 4: 1 1
Dog 5: 4 16
Dog 6: 8 64
Mean = 16kg 142
17
Descriptive statistics
The Variance (𝑠 2 )
3. Divide the squared value by the number of
observations (n).
n=6
142
6 5- 1
Variance value:
28.4kg²
18
Descriptive statistics
The Standard deviation (s) :
• It is the square root of variance
• The result is in the same measurement units as our data (e.g., kg)
• This makes it easier to interpret and understand the spread of the
data than variance alone
19
Descriptive statistics
The Standard deviation (s) :
28.4
= 5.3kg
20
Descriptive statistics
The Standard deviation (s) :
• Often plotted as error bars above and below the mean (5.3kg
in our example).
25
• Small value indicates
20 data are gathered close
to the mean
Weight (kg)
15
10
• Large value indicates
5 data are gathered far
0 from the mean
21
Know the data and explore it
22
Purpose and Benefits
EDA: Initial investigation of data using summary statistics and diagrams
Objectives of EDA are to
▪ understand data (what it is, where it comes from, what does it represent,
kind of values, specific characteristics of data)
▪ find out if there are missing values? (how to deal with them!)
▪ spot anomalies (are there outliers?)
▪ discover patterns (how does the data look like?)
▪ understand relationships between features (measure similarity, distance
and relationship type)
▪ check the assumptions
▪ visually describe the data
23
Purpose and Benefits
24
Data object and Attribute
Data object
▪ represents an entity in the data set
▪ also called data item, point, instance, example, sample, row, observation
▪ e.g. a patient, movie, student, customer, product, book, tweet
▪ described by a set of attributes
Attribute
▪ is a data field, representing a feature/characteristic of data objects
▪ also called variable, feature, dimension, column, coordinate, field
▪ e.g. reaction to a test, course, address, price/category, author, publisher
25
Size and dimension of the data
▪ Size of Data refers to number of data objects
▪ Dimension of Data refers to number of attributes
Sparsity in Data
If most of the feature values are missing, then the data is called sparse
▪ Missing values could be represented as NaN, blank, -, 0
▪ This could be a problem for many statistical methods
▪ For efficient computation, can use libraries for sparse data
▪ e.g. sparse matrix multiplication, sparse storage schemes
26
Resolution of data
Different resolution reveal different patterns
▪ If resolution is too fine, a pattern may be buried in noise
▪ See number of bins in histograms below
27
Types of data
Types of data based on number
of attributes
▪ Univariate Data
▪ Bivariate Data
▪ Multivariate Data
28
Types of data
29
Types of data
30
Normal Distribution (Bell-curve)
31
3-Sigma Rule
32
Bivariate measures
33
Contingency Table
34
Covariance and correlation
35
Covariance and correlation
36
Covariance and correlation
37
Covariance and correlation
38
Diagrammatic Representations of Data
▪ Easy to understand:
▪ Numbers do not tell all the story.
▪ Diagrammatic representation of data makes it easier to understand
▪ Simplified Presentation:
▪ Large volumes of complex data can be represented in a simplified and diagram
▪ Reveals hidden facts:
▪ Diagrams help in bringing out the facts and relationships between data not
noticeable in raw/tabular form
▪ Easy to compare:
▪ Diagrams make it easier to compare data
39
Diagrammatic Representations of Dat
▪ Bar Charts
▪ Histogram
▪ Box Plot
▪ Scatter Plot
▪ Heat map
▪ Line Graph
40
Bar Chart
41
Histogram
42
Histogram
43
Box plot
▪ Top and bottom lines of the box are 3rd and 1st
quartiles of data
▪ Length of the box is the inter-quartile range
▪ The line in the middle of the box is median of data
▪ The top whisker denotes the largest value in the data
▪ Similarly the bottom whisker
▪ Anything above and below the whiskers are
considered outliers
▪ Relative location of median within the box tells us
about data distribution
▪ We find out at what end are the outliers if any
44
Side by Side box plot
▪ Extremely useful for comparisons of two or
more variables.
▪ To compare numeric variables, we draw
their box-plots in parallel
▪ Groups are based on values of a categorical
variable
▪ It addresses whether the location of data
differ between groups
▪ To some extent it also reveals whether
distribution and variation differ between
groups
45
Scatter plot
46
Heat maps
▪ Presents pairwise relationship between
attributes of multivariate data
▪ Provides a numerical value of the
correlation between each variable
▪ Also provides an easy-to-understand visual
representation of those numbers (colors
shades)
▪ Darker red showing high correlation
▪ Dark blue showing none or negative
correlation
▪ Can be used to visualize any matrix
47
Line graphs
48
Type I and II Error