0% found this document useful (0 votes)
16 views12 pages

08.11 Week 5, Class 2 - Descriptive Analytics and Data Wrangling With Pandas

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views12 pages

08.11 Week 5, Class 2 - Descriptive Analytics and Data Wrangling With Pandas

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

AI/ML

Data Analytics
Descriptive

Data Analytics
Descriptive Analytics (Recap)
• Help us to understand what the happened e.g the number of
accidents increased year-on-year.
• We can get an idea of the main trends in the data.
• Typically when describing data, there two types of analysis that we
can consider.
• Measures of Centrality (help to understand the typical center of the
data)
• Measures of Dispersion (help to understand how far apart samples are
from the central value)
Measures of Central Tendency
• Mean - sum of data points divided by the number of data points.
• Median - middle value in an ordered sample.
• Mode - most frequent value and usually the preferred measure for
categorical data.
Measures of Central Tendency (Mean)
• Most common measure for numerical data. e.g consider these values for
age
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
• The mean will be the sum(623) / number of samples(11) = 56.6 years
• One of the issues with the mean is that it cannot be used for categorical
data.
• It is heavily influenced by presence of outliers. e.g. for the values below,
the mean would be 78.5
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 300
Measures of Central Tendency (Median)
• Another common measure for numerical data. e.g. considering the
same values as in the previous slide
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
• The median value is 57 (remember to order the data).
• It is a better measure when there are outliers..
• Although it cannot be used for categorical(nominal) data because this
data cannot be ordered.
Measures of Central Tendency (Mode)
• Unlike mean and median, it can be used for both categorical and
numerical data.
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
• We need to get the frequency(number of occurrences) of each value
to get the mode.
• Here the mode is 54 as it has the highest frequency.
• Most common issue is having multi-modal data i.e. where more than
one value have the highest number of occurrences. e.g.
54, 54, 54, 55, 56, 57, 57, 58, 58, 58, 60, 60
Measures of Dispersion
• The central value describes the typical data point in the sample
whereas dispersion measures the difference between a sample
and the center.
• Range tells us the difference between the smallest and largest
value in the data.
• Variance and Standard Deviation tell us how spread the data are
around the mean.
Measures of Dispersion (Variance)
• Can be calculated for the population or a sample.
• The smaller the variance the closer the dataset is to the mean and
vice-versa.

Population Variance(Credit) Sample Variance(Credit)


Measures of Dispersion (Standard Deviation)
• Is the square root of variance. This helps to bring back the value of
spread to approximately the same units as samples.

• Consider the dataset 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 8.


Mean = 6
• Xi - μ = -2, -1, -1, -1, 0, 0, 0, 0, 1, 1, 1, 2
• (Xi - μ)2 = 4, 1, 1, 1, 0, 0, 0, 0, 1,1,1, 4
• Variance 𝝈2 = 1.17
• Standard Deviation 𝝈 = 1.08
Measures of Dispersion (Standard Deviation)

Visual illustration of Spread of Data around Mean


Pandas for Data Manipulation
• Pandas is one of the most popular libraries for handling data in
python
• Over the next 2 weeks we are going to learn how to use pandas to
• load and interact with data
• modify the data
• explore data
• visualization (matplotlib and seaborn)

You might also like