08.11 Week 5, Class 2 - Descriptive Analytics and Data Wrangling With Pandas
08.11 Week 5, Class 2 - Descriptive Analytics and Data Wrangling With Pandas
Data Analytics
Descriptive
Data Analytics
Descriptive Analytics (Recap)
• Help us to understand what the happened e.g the number of
accidents increased year-on-year.
• We can get an idea of the main trends in the data.
• Typically when describing data, there two types of analysis that we
can consider.
• Measures of Centrality (help to understand the typical center of the
data)
• Measures of Dispersion (help to understand how far apart samples are
from the central value)
Measures of Central Tendency
• Mean - sum of data points divided by the number of data points.
• Median - middle value in an ordered sample.
• Mode - most frequent value and usually the preferred measure for
categorical data.
Measures of Central Tendency (Mean)
• Most common measure for numerical data. e.g consider these values for
age
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
• The mean will be the sum(623) / number of samples(11) = 56.6 years
• One of the issues with the mean is that it cannot be used for categorical
data.
• It is heavily influenced by presence of outliers. e.g. for the values below,
the mean would be 78.5
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 300
Measures of Central Tendency (Median)
• Another common measure for numerical data. e.g. considering the
same values as in the previous slide
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
• The median value is 57 (remember to order the data).
• It is a better measure when there are outliers..
• Although it cannot be used for categorical(nominal) data because this
data cannot be ordered.
Measures of Central Tendency (Mode)
• Unlike mean and median, it can be used for both categorical and
numerical data.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
• We need to get the frequency(number of occurrences) of each value
to get the mode.
• Here the mode is 54 as it has the highest frequency.
• Most common issue is having multi-modal data i.e. where more than
one value have the highest number of occurrences. e.g.
54, 54, 54, 55, 56, 57, 57, 58, 58, 58, 60, 60
Measures of Dispersion
• The central value describes the typical data point in the sample
whereas dispersion measures the difference between a sample
and the center.
• Range tells us the difference between the smallest and largest
value in the data.
• Variance and Standard Deviation tell us how spread the data are
around the mean.
Measures of Dispersion (Variance)
• Can be calculated for the population or a sample.
• The smaller the variance the closer the dataset is to the mean and
vice-versa.