0% found this document useful (0 votes)
19 views49 pages

SCA - Module 4

The document discusses descriptive statistics and exploratory data analysis techniques. It covers topics like measures of central tendency, variance, standard deviation, data types, bivariate analysis, contingency tables, covariance, correlation, and diagrammatic representations of data.

Uploaded by

mahnoor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views49 pages

SCA - Module 4

The document discusses descriptive statistics and exploratory data analysis techniques. It covers topics like measures of central tendency, variance, standard deviation, data types, bivariate analysis, contingency tables, covariance, correlation, and diagrammatic representations of data.

Uploaded by

mahnoor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Descriptive Statistics

Visualizing a Single Variable


Examining Multiple Variables
Type I and Type II Errors

Week 4
Dealing with uncertainty

2
Decision making process

3
Statistical methods

Descriptive statistics
▪ It is the discipline of quantitatively describing the main features of a
collection of data
▪ Collecting, summarizing, processing data into useful information
Inferential statistics
▪ Provides the basis for prediction, and estimates that are used to
transform information about sample into inferences about population

4
Descriptive vs. inferential

5
Statistical analysis

▪ An analysis must have four elements

▪ Data/information (what)
▪ Scientific reasoning (Who? where? how? what happens?)
▪ Finding (what results?)
▪ Lesson/conclusion (so what? so how? )
Key definitions

Population vs. sample

7
Descriptive statistics
Why they are useful:
• Help us summarize the characteristics of our data/observations

• They form the initial step in a statistical analysis

• We get information about the centrality (e.g., mean, median)

• We get information about the variability (e.g., variance, standard


deviation)

8
Descriptive statistics
The Mean:
• Mean is another word for average.
• Most commonly used statistic that tells us about the centre of a data
set.
• It is the average of the numbers in your dataset.
• The sum of all values divided by the number of observations.

n=4

+ + + = 307
307 ÷4

9
Descriptive statistics
The Mean:
• The mean is important because it is involved in many
statistical tests.
• E.g., T-test, ANOVA etc.

10
Descriptive statistics
The Median :
• Is the middle point of your ordered data

Example 2: 1,3,10,5,8
Step1: Order the observations
1,3,5,8,10

The median is the number in the middle


median=5

11
Descriptive statistics
Example 1: 4,12,0,21,4,13 (n=6)
• Step1: Order observations:
0,4,4,12,13,21
• 6 observations not a single point in the middle

• Step 2: Average the two middle points


4+12
𝑚𝑒𝑑𝑖𝑎𝑛 = =8
2

12
Descriptive statistics
Difference between Mean and Median
• Both are single values that represent the center of a data set
• But median is more robust or less affected to outliers
Example 3: 3,1,4,2.5,50 (n=5)
• 50 is a value quite extreme compared to the rest (outlier)

3+1+4+2.5+50
• Mean= =12.1
5
• Median: 1,2.5,3,4,50 =3

13
Measures of spread

14
Descriptive statistics
The Variance (𝑠 2 )
• A measure to show how spread out our observations are
• Based on the distance of each observation from the mean

15
Descriptive statistics
The Variance (𝑠 2 )
• Squared
• The average of the squared differences from the mean

Dog 1: 10kg
Dog 2: 12kg
Dog 3: 13kg
Dog 4: 17kg
Dog 5: 20kg
Dog 6: 24kg
Mean = 16kg

16
Descriptive statistics
The Variance (𝑠 2 )
1. Work out the difference from the mean and square
each value
2. Add all the squared values together
Dog 1: -6 36
Dog 2: -4 16
Dog 3: -3 9
Dog 4: 1 1
Dog 5: 4 16
Dog 6: 8 64
Mean = 16kg 142

17
Descriptive statistics
The Variance (𝑠 2 )
3. Divide the squared value by the number of
observations (n).
n=6

142
6 5- 1
Variance value:
28.4kg²
18
Descriptive statistics
The Standard deviation (s) :
• It is the square root of variance
• The result is in the same measurement units as our data (e.g., kg)
• This makes it easier to interpret and understand the spread of the
data than variance alone

19
Descriptive statistics
The Standard deviation (s) :

28.4
= 5.3kg

20
Descriptive statistics
The Standard deviation (s) :
• Often plotted as error bars above and below the mean (5.3kg
in our example).
25
• Small value indicates
20 data are gathered close
to the mean
Weight (kg)

15

10
• Large value indicates
5 data are gathered far
0 from the mean
21
Know the data and explore it

▪ Exploratory data analysis (EDA): Purpose and Benefits


▪ Size, Dimension, and Resolution of Data
▪ Types of Attributes
▪ Statistical EDA
▪ Measures of Central Tendencies and Spread
▪ Bivariate EDA: Correlation, Contingency Table
▪ Graphical EDA Types of Diagrams

22
Purpose and Benefits
EDA: Initial investigation of data using summary statistics and diagrams
Objectives of EDA are to
▪ understand data (what it is, where it comes from, what does it represent,
kind of values, specific characteristics of data)
▪ find out if there are missing values? (how to deal with them!)
▪ spot anomalies (are there outliers?)
▪ discover patterns (how does the data look like?)
▪ understand relationships between features (measure similarity, distance
and relationship type)
▪ check the assumptions
▪ visually describe the data

23
Purpose and Benefits

24
Data object and Attribute
Data object
▪ represents an entity in the data set
▪ also called data item, point, instance, example, sample, row, observation
▪ e.g. a patient, movie, student, customer, product, book, tweet
▪ described by a set of attributes
Attribute
▪ is a data field, representing a feature/characteristic of data objects
▪ also called variable, feature, dimension, column, coordinate, field
▪ e.g. reaction to a test, course, address, price/category, author, publisher
25
Size and dimension of the data
▪ Size of Data refers to number of data objects
▪ Dimension of Data refers to number of attributes

Sparsity in Data
If most of the feature values are missing, then the data is called sparse
▪ Missing values could be represented as NaN, blank, -, 0
▪ This could be a problem for many statistical methods
▪ For efficient computation, can use libraries for sparse data
▪ e.g. sparse matrix multiplication, sparse storage schemes
26
Resolution of data
Different resolution reveal different patterns
▪ If resolution is too fine, a pattern may be buried in noise
▪ See number of bins in histograms below

27
Types of data
Types of data based on number
of attributes
▪ Univariate Data
▪ Bivariate Data
▪ Multivariate Data

28
Types of data

29
Types of data

30
Normal Distribution (Bell-curve)

31
3-Sigma Rule

32
Bivariate measures

33
Contingency Table

34
Covariance and correlation

35
Covariance and correlation

36
Covariance and correlation

37
Covariance and correlation

38
Diagrammatic Representations of Data
▪ Easy to understand:
▪ Numbers do not tell all the story.
▪ Diagrammatic representation of data makes it easier to understand
▪ Simplified Presentation:
▪ Large volumes of complex data can be represented in a simplified and diagram
▪ Reveals hidden facts:
▪ Diagrams help in bringing out the facts and relationships between data not
noticeable in raw/tabular form
▪ Easy to compare:
▪ Diagrams make it easier to compare data

39
Diagrammatic Representations of Dat
▪ Bar Charts
▪ Histogram
▪ Box Plot
▪ Scatter Plot
▪ Heat map
▪ Line Graph

40
Bar Chart

41
Histogram

42
Histogram

43
Box plot
▪ Top and bottom lines of the box are 3rd and 1st
quartiles of data
▪ Length of the box is the inter-quartile range
▪ The line in the middle of the box is median of data
▪ The top whisker denotes the largest value in the data
▪ Similarly the bottom whisker
▪ Anything above and below the whiskers are
considered outliers
▪ Relative location of median within the box tells us
about data distribution
▪ We find out at what end are the outliers if any
44
Side by Side box plot
▪ Extremely useful for comparisons of two or
more variables.
▪ To compare numeric variables, we draw
their box-plots in parallel
▪ Groups are based on values of a categorical
variable
▪ It addresses whether the location of data
differ between groups
▪ To some extent it also reveals whether
distribution and variation differ between
groups

45
Scatter plot

46
Heat maps
▪ Presents pairwise relationship between
attributes of multivariate data
▪ Provides a numerical value of the
correlation between each variable
▪ Also provides an easy-to-understand visual
representation of those numbers (colors
shades)
▪ Darker red showing high correlation
▪ Dark blue showing none or negative
correlation
▪ Can be used to visualize any matrix

47
Line graphs

48
Type I and II Error

The true status of the null hypothesis…

about the null hypothesis… True False


The researcher’s decision

True Correct Type II Error

False Type I Error Correct

You might also like