Unit 5 Descriptive Statistics
Unit 5 Descriptive Statistics
Data Science enables practitioners to do various mathematical operations on data, to get the best insight of
the data and with desired output objective. Not just to mention but with python, it becomes more exciting
to do operations on data. Generally, in Mathematical terms central tendency means the center of the
distribution, it enables to get the idea of the average value with the indication of how widely the values are
spread. There are three main measures of central tendency, which can be calculated using Pandas in the
Python library, namely,
Mean
Median
Mode ##### Mean can be defined as the average of the data observation, calculated by adding up all
the number in the data and dividing it by the total number of data terms. Mean is preferred when the
data is normally distributed.
Mean= x̄ = ∑x/ N
Median can be defined as middle number data in a given set of observations, calculated by arranging the
data in the required order and the middle data is taken out. Median is best used when data is skewed.
Mode can be defined as the highest frequency occurring number in a given set of datasets, if there is a
unique dataset then there is no mode at all.
import pandas as pd
Load Data:
Pandas can work with various types of data, such as CSV files , Excel files , SQL databases , or
even from web URLs . To load a dataset, you use functions like pd.read_csv() , pd.read_excel() ,
pd.read_sql() , or pd.read_html() depending on the data source.
Series: a one-dimensional labeled array holding data of any type such as integers, strings, Python
objects etc.
DataFrame: a two-dimensional data structure that holds data like a two-dimension array or a table
with rows and columns.
0 98 88 90 88
1 87 52 92 85
2 76 69 71 79
3 88 79 60 81
4 96 80 64 91
The data frame has been created using pd.DataFrame and is stored in df variable. The values are
then displayed as output.
In [2]: df.mean(axis = 0)
Upendra 89.0
Out[2]:
Jayesh 73.6
Rahul 75.4
Puja 84.8
dtype: float64
In [3]: df.median(axis = 0)
Upendra 88.0
Out[3]:
Jayesh 79.0
Rahul 71.0
Puja 85.0
dtype: float64
In [4]: df.mode()
0 76 52 60 79
1 87 69 64 81
2 88 79 71 85
3 96 80 90 88
4 98 88 92 91
In [ ]:
Measures Of Spread
Measures of spread tell how spread the data points are. Some examples of measures of spread are
quantiles, variance, standard deviation and mean absolute deviation.
Quantiles
Quantiles are values that split sorted data or a probability distribution into equal parts. There
several different types of quantlies, here are some of the examples:
data=({'hour':[2.5,5.1,3.2,8.5,3.5,1.5,9.2,5.5,8.3,2.7,7.7,
5.9,4.5,3.3,1.1,8.9,2.5,1.9,6.1,7.4,2.7,4.8,3.8,6.9,7.8],
'Scores':[21,47,27,75,30,20,88,60,81,25,85,62,41,42,17,95,30,24,67,69,30,54,35,76
In [7]: df = pd.DataFrame(data)
print(df)
hour Scores
0 2.5 21
1 5.1 47
2 3.2 27
3 8.5 75
4 3.5 30
5 1.5 20
6 9.2 88
7 5.5 60
8 8.3 81
9 2.7 25
10 7.7 85
11 5.9 62
12 4.5 41
13 3.3 42
14 1.1 17
15 8.9 95
16 2.5 30
17 1.9 24
18 6.1 67
19 7.4 69
20 2.7 30
21 4.8 54
22 3.8 35
23 6.9 76
24 7.8 86
Let's calculate the quartiles for the scores. These are the 5 data points in the scores that divide the
scores into 4 equal parts.
[17. 22.2 26.6 30. 38.6 47. 60.8 68.6 77. 85.6 95. ]
This is the difference between the 3rd and the 1st quartile. The IQR tells the spread of the middle half
of the data.
45.0
Another way we can get IQR is by using iqr( ) from the scipy library
IQR = iqr(df['Scores'])
print(IQR)
45.0
Outliers
These are data points that are usually different or detached from the rest of the data points.
or
hour Scores
0 2.5 21
2 3.2 27
5 1.5 20
6 9.2 88
8 8.3 81
9 2.7 25
10 7.7 85
14 1.1 17
15 8.9 95
17 1.9 24
23 6.9 76
24 7.8 86
Variance
Varience is the average of the squared distance between each data point and the mean of the data.
639.4266666666666
with the 'ddof=1' included, it means that the variance we get is the sample variance, if it is excluded
then we get the population variance.
In [16]: print(np.var(df['Scores']))
613.8496
Standard deviation
This is the average of the distance between each data point and the mean of the data.
22.4192
decsribe( ) method
The pandas describe( ) method can be used to calculate some statistical data of a dataframe. The
dataframe must contain numerical data for the describe( ) method to be used.
We can make use of it to get some of the measurements that have been mentioned above.
In [20]: df['Scores'].describe()
count 25.000000
Out[20]:
mean 51.480000
std 25.286887
min 17.000000
25% 30.000000
50% 47.000000
75% 75.000000
max 95.000000
Name: Scores, dtype: float64
In [ ]:
In [ ]:
In [ ]:
Assignment Questions
1. Load/define a dataset using pandas and display the first 5 rows to inspect its structure. also
Calculate the mean, median, and mode of a specific column in the dataset.
2. Determine the range of values in a particular column of the dataset. also Compute the variance
and standard deviation of a numeric column in the dataset.
3. Identify any missing values in the dataset and count the total number of missing entries.
4. Determine the top 5 highest values in a specific column.
5. Find the interquartile range (IQR) for a numeric column in the dataset.
6. Using Pandas and Matplotlib for any real life application and demostrate with example.
In [ ]: