0% found this document useful (0 votes)
7 views

Descriptive Stats

Uploaded by

e.markowicz86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Descriptive Stats

Uploaded by

e.markowicz86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

DESCRIPTIVE STATISTICS

Lecturer: Vania Filipova


Statistics Meaning?

The transformation of raw data into a form that will be easy


to understand and interpret; rearranging, ordering, and
manipulating data to generate descriptive information
 Number of people/variables in population/measured
characteristics of population
 Trends in employment
 The objective is to determine a set of statistics that
summarise or represent data
Distribution in statistics

 Thedistribution of a statistical data set (or a


population) is a listing or function showing all the
possible values (or intervals) of the data and how
often they occur.
Types of Probability Distributions

Examples of a discrete distribution are:

 The number of students in a class.


 The number of children in a family.
 The number of cars entering a carwash in a hour.
 Number of home mortgages approved by Irish
Life and Permanent this week
The Normal Distribution

 “Bell Shaped”
f(X)
 Symmetrical
 Mean, Median and
Mode are Equal X
 Interquartile Range 
Equals 1.33 s Mean
 Random Variable Median
Mode
Has Infinite Range

© 2003 Prentice-Hall, Inc.


Chap 6-5
Types of Probability Distributions

Examples of a continuous distribution include:

 The distance students travel to class.


 The time it takes to drive here from Blackrock
 The length of an afternoon nap.
 The length of time of a particular phone call.
Frequency distribution

 Businessresearchers often answer research


questions based on a single variable useful for large
quantity of data
 How many users of the brand may be characterized as loyal?
 What percentage of the market consists of heavy users, medium
users, light users and non-users?
 Frequency distributions examine one variable at a
time and provide counts of the different responses for
the various values of the variable. The objective of a
frequency distribution is to display the number of
responses associated with each value of a variable.
 Can help detect item non-response.
Frequency distribution
requency Distribution for Variable X “Proud” on
15

Employee Survey with Missing Data


Valid Frequency Percent Valid Cumulative
responses percent percent
4 10 14.1 14.5 14.5
5 28 39.4 40.6 55.1
6 21 29.6 30.4 85.5
Strongly 10 14.1 14.5 100.0
agree = 7
Total 69 97.2 100.0
Missing 2 2.8
Total 71 100.0
Simple survey example
Frequency distribution
Descriptive statistics can also be used
to check for mistakes…
Bar charts show the data in the form of bars that
can be displayed either horizontally or vertically
Pie Charts

 Good for
Number of People

categorical data
 Good for Green

reporting survey Blue


Brown
results
 Danger of
misunderstandin
g or confusion if Number of People

too many
segments
Green
Blue
 Pull out a Brown

segment for
emphasis
 Easy to construct
from Excel
Conclusions

 Statistical charts are about communication


 When assessing a chart you need to ask if it
succeeds in telling you what is going on.
 There are few “right” answers in terms of which
diagram to use,
 but some may be viewed as “more appropriate”
than others
 https://www.youtube.com/watch?time_continue=
7&v=kiQ6MUQZHSs

 https://www.youtube.com/watch?time_continue=
79&v=EqeVXI4WNHM
Measures of location, variability
and shape
 Measures of location (measures of central
tendency)
 Mean (average)

 Where
▪ Xi = observed variable X
▪ n = number of observations
Measures of location, variability and
shape
 Mode
 The value that occurs most frequently

 Median
 Middle value when arranged in ascending and
descending order
Calculating the mean = average

 Example: 5 salaries: £6500 £6500 £6500 £6500


£10500
 To calculate the mean, we add to find the total and
divide by the number included.

x
 x 6500  6500  6500  6500  10500 36500
  £7300
n 5 5
The Median and the Mode
 This list is already in order:
 £6500 £6500 £6500 £6500 £10500
 The middle one is the third value
 median = £6500
 The most frequently occurring value is the salary of
£6500
 Mode = £6500
Exercise
 Determine the mean and the median from the following data.
 The weekly pay (x-variable) of a sample of 6 workers is as follows:
 e220, e220, e180, e215, e208, e207
 The mode: The attendance at five mathematics tutorials is as
follows:
 15 ,18 ,17 ,17 ,20
Second Moment
The spread of the data
Measures of location, variability and
shape

 Measures of variability
 Range (max –min)
 Variance – Is the spread of the data around the mean?
▪ The difference between the mean and an observed value is
called the deviation from the mean
▪ When the datapoints are clustered around the mean the
variance is small
 Standard deviation
▪ Square root of the variance
Normal (Gaussian) Distribution

SMALL VARIANCE LARGE VARIANCE


Skewness

 It is a measure of symmetry (or not symmetry) of


a distribution
 If a distribution is perfectly symmetric it is
described as the NORMAL DISTRIBUTION and
the mean, median and mode are identical
 A distribution, or data set, is symmetric if it looks
the same to the left and right of the centre point.
DIRECTION OF SKEW

Consider the distributions in the figure. These tapering sides are called tails (or snakes),
and they provide a visual means for determining which of the two kinds of skewness a
distribution has:

1. Positive skew: The right tail is longer; the mass of the distribution is concentrated
on the left of the figure. The distribution is said to be right-skewed. An example
would be that of income distribution in which there are a few high incomes
2. Negative skew: The left tail is longer; the mass of the distribution is concentrated
on the right of the figure. The distribution is said to be left-skewed.
Parametric vs. Non-parametric
tests
 Parametric
 Ratio or Interval scales
 Large samples
 More powerful
 Stringent assumptions

 Non-parametric tests
 Nominal or ordinal scales
 Small samples
 Less assumptions
 Corresponding non-parametric techniques for many
parametric techniques
 Not as powerful/less sensitive
Scatter plot

 The scatter plot is one of the most important tools


in data visualization.

A scatter plot is based on two axes: the horizontal


axis represents one feature and the vertical axis
represents a second.

 Each instance in a dataset is represented by a point


on the plot determined by the values for that
instance of the two features involved.
Age and height data set
Age Height
32 5.3
34 5.4
36 5.5
16 4
18 4.2
22 4.4
34 5.6
56 6
24 6.2
Scatter plot of age and height
Data Quality
 Missing values
 Outliers
Missing values
 The data quality report highlights
the percentage of missing values
for each feature in your table
 If features have missing values, the
first step is to try to determine why
Outliers
 Outliers are values that lie far away from the central
tendency of a feature. These variables are exceptionally
far away from the mainstream of the data.
 There are two kinds of outliers that might occur in your
sample data: Invalid outliers are values that have
been included in a sample through error and are often
referred to as noise in the data.

 Valid outliers are correct values that are simply very


different to all of the rest of the values for a feature.
To identify outliers

 Order data by size and scan top and bottom (min


and max values)
 Difference between mean and median
 Range, standard deviation
 Visualisations = bar plots, box plots
Dealing with outliers

Invalid outliers should either


be marked as missing values
or, if possible, replaced with
valid values sourced from original data sources.

Valid outliers can be allowed to remain in your


data or removed.
Crosstabulations
 Whilea frequency distribution describes one
variable at a time, a cross-tabulation describes
two or more variables simultaneously

 Cross-tabulation results in tables that reflect the


joint distribution of two or more variables with a
limited number of categories or distinct values
Crosstabulation:
Gender and Internet Usage

Gender
Internet Male Female Row
Usage Total
Light 5 10 15
15 10 5 15
Column 15 15 30
Total
Internet usage by gender

Gender
Internet Male Female
Usage
Light 33.3% 66.7%
15 66.7% 33.3%
Column 100% 100%
Total
Any questions?

You might also like