0% found this document useful (0 votes)

22 views7 pages

Unit 5 Descriptive Statistics

python pandas

Uploaded by

upendra maurya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views7 pages

Unit 5 Descriptive Statistics

python pandas

Uploaded by

upendra maurya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Descriptive statistics

Data Science enables practitioners to do various mathematical operations on data, to get the best insight of
the data and with desired output objective. Not just to mention but with python, it becomes more exciting
to do operations on data. Generally, in Mathematical terms central tendency means the center of the
distribution, it enables to get the idea of the average value with the indication of how widely the values are
spread. There are three main measures of central tendency, which can be calculated using Pandas in the
Python library, namely,

Mean
Median
Mode ##### Mean can be defined as the average of the data observation, calculated by adding up all
the number in the data and dividing it by the total number of data terms. Mean is preferred when the
data is normally distributed.

Mean= x̄ = ∑x/ N

Median can be defined as middle number data in a given set of observations, calculated by arranging the
data in the required order and the middle data is taken out. Median is best used when data is skewed.

Median = (n + 1/2)th observation if the total observation is odd.

Mode can be defined as the highest frequency occurring number in a given set of datasets, if there is a
unique dataset then there is no mode at all.

* We are going to use very famous pandas library

to explore descriptive statistics
for more details please refer following links https://pandas.pydata.org/docs/index.html
Import Pandas:
you need to import Pandas into your Python script so you can use its functionality. You can do this by
adding the following line at the beginning of your script:

import pandas as pd

Load Data:
Pandas can work with various types of data, such as CSV files , Excel files , SQL databases , or
even from web URLs . To load a dataset, you use functions like pd.read_csv() , pd.read_excel() ,
pd.read_sql() , or pd.read_html() depending on the data source.

Basic data structures in pandas

Pandas provides two types of classes for handling data:

Series: a one-dimensional labeled array holding data of any type such as integers, strings, Python
objects etc.

DataFrame: a two-dimensional data structure that holds data like a two-dimension array or a table
with rows and columns.

Creating the dataset

In [1]: import pandas as pd

# Creating the dataframe of student's marks

df = pd.DataFrame({"Upendra ":[98,87,76,88,96],
"Jayesh ":[88,52,69,79,80],
"Rahul":[90,92,71,60,64],
"Puja":[88,85,79,81,91]})

# Printing the dataframe

Out[1]: Upendra Jayesh Rahul Puja

0 98 88 90 88

1 87 52 92 85

2 76 69 71 79

3 88 79 60 81

4 96 80 64 91

The data frame has been created using pd.DataFrame and is stored in df variable. The values are
then displayed as output.

now lets calculate mean

In [2]: df.mean(axis = 0)

Upendra 89.0
Out[2]:
Jayesh 73.6
Rahul 75.4
Puja 84.8
dtype: float64

Now, lets calculate MEDIAN

In [3]: df.median(axis = 0)
Upendra 88.0
Out[3]:
Jayesh 79.0
Rahul 71.0
Puja 85.0
dtype: float64

Now, we will find the MODE

In [4]: df.mode()

Out[4]: Upendra Jayesh Rahul Puja

0 76 52 60 79

1 87 69 64 81

2 88 79 71 85

3 96 80 90 88

4 98 88 92 91

In [ ]:

Measures Of Spread
Measures of spread tell how spread the data points are. Some examples of measures of spread are
quantiles, variance, standard deviation and mean absolute deviation.

Quantiles

Quantiles are values that split sorted data or a probability distribution into equal parts. There
several different types of quantlies, here are some of the examples:

Quartiles - Divides the data into 4 equal parts.

Quintiles - Divides the data into 5 equal parts.
Deciles - Divides the data into 10 equal parts
Percentiles - Divides the data into 100 equal parts

In [5]: import numpy as np

import pandas as pd

In [6]: # lets make datasets

data=({'hour':[2.5,5.1,3.2,8.5,3.5,1.5,9.2,5.5,8.3,2.7,7.7,
5.9,4.5,3.3,1.1,8.9,2.5,1.9,6.1,7.4,2.7,4.8,3.8,6.9,7.8],
'Scores':[21,47,27,75,30,20,88,60,81,25,85,62,41,42,17,95,30,24,67,69,30,54,35,76

In [7]: df = pd.DataFrame(data)
print(df)

hour Scores
0 2.5 21
1 5.1 47
2 3.2 27
3 8.5 75
4 3.5 30
5 1.5 20
6 9.2 88
7 5.5 60
8 8.3 81
9 2.7 25
10 7.7 85
11 5.9 62
12 4.5 41
13 3.3 42
14 1.1 17
15 8.9 95
16 2.5 30
17 1.9 24
18 6.1 67
19 7.4 69
20 2.7 30
21 4.8 54
22 3.8 35
23 6.9 76
24 7.8 86

Let's calculate the quartiles for the scores. These are the 5 data points in the scores that divide the
scores into 4 equal parts.

In [8]: print(np.quantile(df['Scores'], [0, 0.25, 0.5, 0.75, 1]))

[17. 30. 47. 75. 95.]

Quantiles using linspace( )

It can become quite tedious to list all the points when getting quantiles, more so in cases of higher
quantiles such as deciles and percentiles. For such cases we can make use of the linspace( )

Let's get the quartiles of the scores

In [9]: print(np.quantile(df['Scores'], np.linspace(0, 1, 5)))

[17. 30. 47. 75. 95.]

Let's get the quintiles

In [10]: print(np.quantile(df['Scores'], np.linspace(0, 1, 6)))

[17. 26.6 38.6 60.8 77. 95. ]

Let's get the deciles

In [11]: print(np.quantile(df['Scores'], np.linspace(0, 1, 11)))

[17. 22.2 26.6 30. 38.6 47. 60.8 68.6 77. 85.6 95. ]

Interquartile Range (IQR)

This is the difference between the 3rd and the 1st quartile. The IQR tells the spread of the middle half
of the data.

Let's get the IQR for the scores

In [12]: IQR = np.quantile(df['Scores'], 0.75) - np.quantile(df['Scores'], 0.25)

print(IQR)

45.0

Another way we can get IQR is by using iqr( ) from the scipy library

In [13]: from scipy.stats import iqr

IQR = iqr(df['Scores'])
print(IQR)

45.0

Outliers

These are data points that are usually different or detached from the rest of the data points.

A data point is an outlier if:

`data < 1st quartile − 1.5 * IQR

data > 3rd quartile + 1.5 * IQR`

Let's get the outliers in the scores

In [14]: # first get iqr

iqr= iqr(df['Scores'])
# then get lower & upper threshold
lower_threshold = np.quantile(df['Scores'], 0.25)
upper_threshold = np.quantile(df['Scores'], 0.75)
# then find outliers
outliers = df[(df['Scores'] < lower_threshold) | (df['Scores'] > upper_threshold)]
print(outliers)

hour Scores
0 2.5 21
2 3.2 27
5 1.5 20
6 9.2 88
8 8.3 81
9 2.7 25
10 7.7 85
14 1.1 17
15 8.9 95
17 1.9 24
23 6.9 76
24 7.8 86

Variance

Varience is the average of the squared distance between each data point and the mean of the data.

Let's calculate the variance of the scores. We will use np.var( )

In [15]: print(np.var(df['Scores'], ddof=1))

639.4266666666666

with the 'ddof=1' included, it means that the variance we get is the sample variance, if it is excluded
then we get the population variance.

Let's see that here below.

In [16]: print(np.var(df['Scores']))
613.8496

Standard deviation

This is the squareroot of the variance.

Let's get the standard deviation of the scores

In [17]: print(np.sqrt(np.var(df['Scores'], ddof=1)))

25.28688724747802

Another way we can get standard deviation is by np.std( )

Let's use that

In [18]: print(np.std(df['Scores'], ddof=1))

25.28688724747802

Mean Absolute Deviation

This is the average of the distance between each data point and the mean of the data.

Let's find the mean absolute distance of the scores

In [19]: # first find the distance between the data points and the mean
dists = df['Scores'] - np.mean(df['Scores'])
# find the mean absolute
print(np.mean(np.abs(dists)))

22.4192

decsribe( ) method

The pandas describe( ) method can be used to calculate some statistical data of a dataframe. The
dataframe must contain numerical data for the describe( ) method to be used.

We can make use of it to get some of the measurements that have been mentioned above.

In [20]: df['Scores'].describe()

count 25.000000
Out[20]:
mean 51.480000
std 25.286887
min 17.000000
25% 30.000000
50% 47.000000
75% 75.000000
max 95.000000
Name: Scores, dtype: float64

In [ ]:

Assignment Questions
1. Load/define a dataset using pandas and display the first 5 rows to inspect its structure. also
Calculate the mean, median, and mode of a specific column in the dataset.
2. Determine the range of values in a particular column of the dataset. also Compute the variance
and standard deviation of a numeric column in the dataset.
3. Identify any missing values in the dataset and count the total number of missing entries.
4. Determine the top 5 highest values in a specific column.
5. Find the interquartile range (IQR) for a numeric column in the dataset.
6. Using Pandas and Matplotlib for any real life application and demostrate with example.

In [ ]:

Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (3)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
9 pages
Informatics Practices Class 12 Cbse Notes Data Handling
0% (1)
Informatics Practices Class 12 Cbse Notes Data Handling
17 pages
LTCC
No ratings yet
LTCC
10 pages
Anion Play Dominant Role Aisa Bola Ye Blanco2017
No ratings yet
Anion Play Dominant Role Aisa Bola Ye Blanco2017
10 pages
The Tribological Properties of The Polyurea Greases Based On Oil Miscible Phosphonium Based Ionic Liquids
No ratings yet
The Tribological Properties of The Polyurea Greases Based On Oil Miscible Phosphonium Based Ionic Liquids
7 pages
Data Handling Using Pandas-II
No ratings yet
Data Handling Using Pandas-II
55 pages
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
EDA Lab Manual
100% (2)
EDA Lab Manual
93 pages
Practical File Question 28.09.2022
No ratings yet
Practical File Question 28.09.2022
15 pages
Descriptive Statistics With Pandas: Data Handling Using Pandas - II
100% (1)
Descriptive Statistics With Pandas: Data Handling Using Pandas - II
37 pages
Introduction To Pandas - Ipynb - Colaboratory
No ratings yet
Introduction To Pandas - Ipynb - Colaboratory
7 pages
Python Pandas II Notes XII
No ratings yet
Python Pandas II Notes XII
20 pages
Python Pandas2 PDF
No ratings yet
Python Pandas2 PDF
38 pages
Python For DS Cheat Sheet
100% (2)
Python For DS Cheat Sheet
6 pages
DataFrame Statistics
No ratings yet
DataFrame Statistics
41 pages
I MACE 2024 Forthcall Draft
No ratings yet
I MACE 2024 Forthcall Draft
1 page
Data Manipulation and Visualization
No ratings yet
Data Manipulation and Visualization
21 pages
ML Lab Final R22
No ratings yet
ML Lab Final R22
67 pages
BDA File
No ratings yet
BDA File
26 pages
4 PythonPandas
No ratings yet
4 PythonPandas
8 pages
Tutorial Data Visualization Pandas Matplotlib Seaborn
No ratings yet
Tutorial Data Visualization Pandas Matplotlib Seaborn
32 pages
Ineuron - Paid-Lectures
No ratings yet
Ineuron - Paid-Lectures
19 pages
Data Preprocessing Python Tome II
No ratings yet
Data Preprocessing Python Tome II
14 pages
Chapter 2 - Python Pandas II
No ratings yet
Chapter 2 - Python Pandas II
71 pages
More On Pandas
No ratings yet
More On Pandas
51 pages
B Tech-CSBS
No ratings yet
B Tech-CSBS
44 pages
Python Libraries
No ratings yet
Python Libraries
27 pages
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Murali Internship
No ratings yet
Murali Internship
34 pages
Document 1
No ratings yet
Document 1
16 pages
Data Science Practical Book - Ipynb
No ratings yet
Data Science Practical Book - Ipynb
21 pages
Phthon Notes
No ratings yet
Phthon Notes
13 pages
Python For Statistics
No ratings yet
Python For Statistics
40 pages
Chapter1.2 PythonPandas2
No ratings yet
Chapter1.2 PythonPandas2
38 pages
Python For Data Science
No ratings yet
Python For Data Science
45 pages
Dsbda Ass3
No ratings yet
Dsbda Ass3
22 pages
DS Chapter - 2
No ratings yet
DS Chapter - 2
73 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
Data Science Learning Checklist
No ratings yet
Data Science Learning Checklist
1 page
Unit 3
No ratings yet
Unit 3
20 pages
Series and Pandas Methods
No ratings yet
Series and Pandas Methods
5 pages
Machine Learning Lab Word 12-1-2025. Document
No ratings yet
Machine Learning Lab Word 12-1-2025. Document
68 pages
Chapter 4 - Python For Data Analysis
No ratings yet
Chapter 4 - Python For Data Analysis
47 pages
Unit 2 1
No ratings yet
Unit 2 1
54 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
87 pages
Fintech Resource
No ratings yet
Fintech Resource
9 pages
Lucknow Public School - 20241201 - 220143 - 0000
No ratings yet
Lucknow Public School - 20241201 - 220143 - 0000
44 pages
Data Science Workshop Brainovision
No ratings yet
Data Science Workshop Brainovision
25 pages
Pandas 2
No ratings yet
Pandas 2
17 pages
Data Science Programs
No ratings yet
Data Science Programs
6 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
97 pages
Principles of AI Laboratory Varshadr
No ratings yet
Principles of AI Laboratory Varshadr
54 pages
Practical Questions (Python PGM NO 1 To 21)
No ratings yet
Practical Questions (Python PGM NO 1 To 21)
61 pages
Unit2 - Pandas - Jupyter Notebook
No ratings yet
Unit2 - Pandas - Jupyter Notebook
10 pages
EDA Lab Manual
No ratings yet
EDA Lab Manual
93 pages
Python For ML
No ratings yet
Python For ML
41 pages
Ids 1
No ratings yet
Ids 1
30 pages
Python - Final 1
No ratings yet
Python - Final 1
17 pages
Python Pandas For Data Analytics
No ratings yet
Python Pandas For Data Analytics
7 pages
Resume N
No ratings yet
Resume N
1 page
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
Practical 1
No ratings yet
Practical 1
5 pages
Practical Assignment Python
No ratings yet
Practical Assignment Python
28 pages
Edaunit IV
No ratings yet
Edaunit IV
15 pages
BIA Data Science Detailed Brochure - Vikhroli West, Mumbai-1
No ratings yet
BIA Data Science Detailed Brochure - Vikhroli West, Mumbai-1
28 pages
Data Science Algorithmen Master - 02 Data Handling
No ratings yet
Data Science Algorithmen Master - 02 Data Handling
76 pages
Python Exam Question Paper
No ratings yet
Python Exam Question Paper
3 pages
Anshu Kumar Jha CV
No ratings yet
Anshu Kumar Jha CV
1 page
Program-1
No ratings yet
Program-1
15 pages
Week2 Lab
No ratings yet
Week2 Lab
8 pages
Eda Code Snippets
No ratings yet
Eda Code Snippets
17 pages
Pandas Dataframe2
No ratings yet
Pandas Dataframe2
12 pages
Ankit Python
No ratings yet
Ankit Python
26 pages
ADS LAB Merged
No ratings yet
ADS LAB Merged
86 pages
Data Analytics Guide
No ratings yet
Data Analytics Guide
4 pages
Week - 6-7
No ratings yet
Week - 6-7
9 pages
Abhishek Prasad
No ratings yet
Abhishek Prasad
1 page
ML Programs
No ratings yet
ML Programs
41 pages
Data Handling Using Pandas-By Abhishek Shakya
No ratings yet
Data Handling Using Pandas-By Abhishek Shakya
55 pages
@Arcserve@Operations Analyst Hyderabad Remote
No ratings yet
@Arcserve@Operations Analyst Hyderabad Remote
10 pages
PANDAS SERIES - WS4 New
No ratings yet
PANDAS SERIES - WS4 New
4 pages
Heart Rate Monitoring System Project
No ratings yet
Heart Rate Monitoring System Project
16 pages
Mevbot
No ratings yet
Mevbot
51 pages
Numpanda
No ratings yet
Numpanda
24 pages
Python Basics - Hamza Zahoor
No ratings yet
Python Basics - Hamza Zahoor
6 pages
Note 5-7
No ratings yet
Note 5-7
21 pages
Data Analysis
No ratings yet
Data Analysis
20 pages
Sanyam Data Science
No ratings yet
Sanyam Data Science
33 pages
Python June2025
No ratings yet
Python June2025
2 pages
Even Students
No ratings yet
Even Students
36 pages
Experiment - 1 csd201
No ratings yet
Experiment - 1 csd201
19 pages
Lesson Plan For XII Informatics Practices
No ratings yet
Lesson Plan For XII Informatics Practices
16 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet

Unit 5 Descriptive Statistics

Uploaded by

Unit 5 Descriptive Statistics

Uploaded by

Descriptive statistics

Median = (n + 1/2)th observation if the total observation is odd.

* We are going to use very famous pandas library

Basic data structures in pandas

Creating the dataset

# Creating the dataframe of student's marks

# Printing the dataframe

Out[1]: Upendra Jayesh Rahul Puja

now lets calculate mean

Now, lets calculate MEDIAN

Now, we will find the MODE

Out[4]: Upendra Jayesh Rahul Puja

Quartiles - Divides the data into 4 equal parts.

In [5]: import numpy as np

In [6]: # lets make datasets

In [8]: print(np.quantile(df['Scores'], [0, 0.25, 0.5, 0.75, 1]))

[17. 30. 47. 75. 95.]

Quantiles using linspace( )

Let's get the quartiles of the scores

In [9]: print(np.quantile(df['Scores'], np.linspace(0, 1, 5)))

Let's get the quintiles

In [10]: print(np.quantile(df['Scores'], np.linspace(0, 1, 6)))

Let's get the deciles

In [11]: print(np.quantile(df['Scores'], np.linspace(0, 1, 11)))

Interquartile Range (IQR)

Let's get the IQR for the scores

In [12]: IQR = np.quantile(df['Scores'], 0.75) - np.quantile(df['Scores'], 0.25)

In [13]: from scipy.stats import iqr

A data point is an outlier if:

`data < 1st quartile − 1.5 * IQR

data > 3rd quartile + 1.5 * IQR`

Let's get the outliers in the scores

In [14]: # first get iqr

Let's calculate the variance of the scores. We will use np.var( )

In [15]: print(np.var(df['Scores'], ddof=1))

Let's see that here below.

This is the squareroot of the variance.

Let's get the standard deviation of the scores

In [17]: print(np.sqrt(np.var(df['Scores'], ddof=1)))

Another way we can get standard deviation is by np.std( )

Let's use that

In [18]: print(np.std(df['Scores'], ddof=1))

Mean Absolute Deviation

Let's find the mean absolute distance of the scores

You might also like