Pandas Tutorial1 - Informatics
Pandas Tutorial1 - Informatics
» Introduction
» Descriptive Statistics
3.1 INTRODUCTION
» Data Aggregations
As discussed in the previous chapter,
» Sorting a DataFrame
Pandas is a well established Python Library
used for manipulation, processing and » GROUP BY Functions
analysis of data. We have already discussed » Altering the Index
the basic operations on Series and » Other DataFrame
Operations
DataFrame like creating them and then
» Handling Missing
accessing data from them. Pandas provides Values
more powerful and useful functions for data » Import and Export
analysis. of Data between
In this chapter, we will be working with Pandas and MySQL
more advanced features of DataFrame like
sorting data, answering analytical questions
using the data, cleaning data and applying
different useful functions on the data. Below
is the example data on which we will be
applying the advanced features of Pandas.
2021–22
64
Case Study
Let us consider the data of marks scored in unit tests
held in school. For each unit test, the marks scored by
all students of the class is recorded. Maximum marks
are 25 in each subject. The subjects are Maths, Science.
Social Studies (S.St.), Hindi, and English. For simplicity,
we assume there are 4 students in the class and the
table below shows their marks in Unit Test 1, Unit Test
2 and Unit Test 3. Table 3.1 shows this data.
Table 3.1 Case Study
Result
Name/ Unit Maths Science S.St. Hindi Eng
Subjects Test
Raman 1 22 21 18 20 21
Raman 2 21 20 17 22 24
Raman 3 14 19 15 24 23
Zuhaire 1 20 17 22 24 19
Zuhaire 2 23 15 21 25 15
Zuhaire 3 22 18 19 23 13
Aashravy 1 23 19 20 15 22
Aashravy 2 24 22 24 17 21
Aashravy 3 12 25 19 21 23
Mishti 1 15 22 25 22 22
Mishti 2 18 21 25 24 23
Mishti 3 17 18 20 25 20
2021–22
65
2021–22
66
>>> print(df.max(numeric_only=True))
UT 3
Maths 24
Science 25
S.St 25
Hindi 25
Eng 24
dtype: int64
Program 3-2 Write the statements to output the
maximum marks obtained in each
subject in Unit Test 2.
UT 2
Maths 24
Science 22
S.St 25
Hindi 25
Eng 24
dtype: int64
By default, the max() method finds the maximum
value of each column (which means, axis=0). However,
to find the maximum value of each row, we have to
specify axis = 1 as its argument.
#maximum marks for each student in each unit
test among all the subjects
2021–22
67
>>> df.max(axis=1)
NOTES
0 22
1 24
2 24
3 24
4 25
5 23
6 23
7 24
8 25
9 25
10 25
11 25
dtype: int64
2021–22
68
2021–22
69
>>> dfRaman[['Maths','Science','S.
St','Hindi','Eng']].sum()
Maths 57
Science 60
S.St 50 Activity 3.1
Hindi 66 Write the python
Eng 68 statements to print
dtype: int64 the sum of the english
marks scored by
#To print total marks scored by Raman in all Mishti.
subjects in each Unit Test
>>> dfRaman[['Maths','Science','S.
St','Hindi','Eng']].sum(axis=1)
0 102
1 104
2 95
dtype: int64
2021–22
70
Name 12
UT 12
Maths 12
Science 12
S.St 12
Hindi 12
Eng 12
dtype: int64
2021–22
71
UT 2.5
Maths 19.0
Science 20.0
S.St 19.5
Hindi 21.5
Eng 21.0
dtype: float64
>>> dfMaths=df['Maths']
2021–22
72
>>> dfMathsUT1=dfMaths[df.UT==1]
>>> print("Displaying the marks scored in
Mathematics in UT1\n",dfMathsUT1)
>>> dfMathMedian=dfMathsUT1.median()
>>> print("Displaying the median of Mathematics
in UT1\n”,dfMathMedian)
2021–22
73
0 24
dtype: int64
NOTES
>>> df.quantile(q=.25)
UT 1.00
Maths 16.50
Science 18.00
S.St 18.75
Hindi 20.75
Eng 19.75
Name: 0.25, dtype: float64
>>> df.quantile(q=.75)
UT 3.00
Maths 22.25
Science 21.25
S.St 22.50
Hindi 24.00
Eng 23.00
Name: 0.75, dtype: float64
2021–22
74
>>> dfSubject=df[['Maths','Science','S.
St','Hindi','Eng']]
>>> print("Marks of all the subjects:\
n",dfSubject)
>>> dfQ=dfSubject.quantile([.25,.75])
>>> print("First and third quartiles of all the
subjects:\n",dfQ)
2021–22
75
Hindi 9.969697
Eng 11.363636
dtype: float64
Maths 3.980064
Science 2.667140
S.St 3.146667
Hindi 3.157483
Eng 3.370999
dtype: float64
DataFrame.describe() function displays the
descriptive statistical values in a single command. These
values help us describe a set of data in a DataFrame.
>>> df.describe()
UT Maths Science S.St Hindi Eng
count 12.000000 12.000000 12.00000 12.000000 12.000000 12.000000
mean 2.000000 19.250000 19.75000 20.416667 21.833333 20.500000
std 0.852803 3.980064 2.66714 3.146667 3.157483 3.370999
min 1.000000 12.000000 15.00000 15.000000 15.000000 13.000000
25% 1.000000 16.500000 18.00000 18.750000 20.750000 19.750000
50% 2.000000 20.500000 19.50000 20.000000 22.500000 21.500000
75% 3.000000 22.250000 21.25000 22.500000 24.000000 23.000000
max 3.000000 24.000000 25.00000 25.000000 25.000000 24.000000
>>> df.aggregate('max')
2021–22
76
NOTES
Science 25
S.St 25
Hindi 25
Eng 24
dtype: object
>>> df['Maths'].aggregate(['max','min'])
max 24
min 12
Name: Maths, dtype: int64
Note: We can also use the parameter axis with
aggregate function. By default, the value of axis is zero,
means columns.
#Using the above statement with axis=0 gives
the same result
>>> df['Maths'].aggregate(['max','min'],axis=0)
max 24
min 12
Name: Maths, dtype: int64
2021–22
78
>>> print(dfUT2.sort_values(by=['Science']))
2021–22
80
NOTES
#Displaying the first entry from each group
>>> g1.first()
UT Maths Science S.St Hindi Eng
Name
Ashravy 1 23 19 20 15 22
Mishti 1 15 22 25 22 22
Raman 1 22 21 18 20 21
Zuhaire 1 20 17 22 24 19
>>> g2.first()
2021–22
82
2021–22
83
2021–22
84
2021–22
85
>>> data={'Store':['S1','S4','S3','S1','S2','S3
','S1','S2','S3'], 'Year':[2016,2016,2016,2017
,2017,2017,2018,2018,2018],
'Total_sales(Rs)':[12000,330000,420000,
20000,10000,450000,30000, 11000,89000],
'Total_profit( Rs)':
[1100,5500,21000,32000,9000,45000,3000,
1900,23000]
}
>>> df=pd.DataFrame(data)
>>> print(df)
Store Year Total_sales(Rs) Total_profit(Rs)
0 S1 2016 12000 1100
1 S4 2016 330000 5500
2 S3 2016 420000 21000
2021–22
86
2021–22
87
2021–22
88
2021–22
89
>>> pivot_table1
sum max mean
Color Black Blue Red Black Blue Red Black Blue Red
Item
Pen NaN 69.0
30.0 60.0 NaN 55.0 50.0 NaN 34.5
Pencil 81.0 NaN
NaN NaN 47.0 NaN NaN 40.5 NaN
Pivoting can also be done on multiple columns.
Further, different aggregate functions can be applied on
different columns. The following example demonstrates
pivoting on two columns - Price(Rs) and Units_in_stock.
Also, the application of len() function on the column
2021–22
90
>>> pivot_table1
Price(Rs) Units_in_stock
Color Black Blue Red Black Blue Red
Item
Pen NaN 2.0 2.0 NaN 34.5 30.0
Pencil 2.0 NaN NaN 40.5 NaN NaN
Program 3-11 Write the statement to print the maximum
price of pen of each color.
>>> dfpen=df[df.Item=='Pen']
>>> pivot_redpen=dfpen.pivot_table(index='Item'
,columns=['Color'],values=['Price(Rs)'],aggfun
c=[max])
>>> print(pivot_redpen)
max
Price(Rs)
Color Blue Red
Item
Pen 50 25
2021–22
91
2021–22
92
>>> print(df['Hindi'].isnull().any())
False NOTES
2021–22
95
2021–22
96
2021–22
97
0 21.0
1 20.0
2 19.0
3 NaN
Name: Science, dtype: float64
2021–22
98
2 19.0
3 0.0 NOTES
Name: Science, dtype: float64
2021–22
99
2021–22
100
where,
Driver = mysql+pymysql
username=User name of the mysql (normally it is root)
password= Password of the MySql
port = usually we connect to localhost with port number
3306 (Default port number)
Name of the Database = Your database
In the following subsections, importing and exporting
data between Pandas and MySQL applications are
demonstrated. For this, we will use the same database
CARSHOWROOM and Table INVENTORY created in
Chapter 1 of this book.
mysql> use CARSHOWROOM ;
Database changed
mysql> select * from INVENTORY;
+-------+--------+-----------+-----------+-----------------+---------+
| CarId | CarName| Price | Model | YearManufacture | Fueltype |
+-------+--------+-----------+-----------+-----------------+---------+
| D001 | Car1 | 582613.00 | LXI | 2017 | Petrol |
| D002 | Car1 | 673112.00 | VXI | 2018 | Petrol |
| B001 | Car2 | 567031.00 | Sigma1.2 | 2019 | Petrol |
| B002 | Car2 | 647858.00 | Delta1.2 | 2018 | Petrol |
| E001 | Car3 | 355205.00 | 5 STR STD | 2017 | CNG |
| E002 | Car3 | 654914.00 | CARE | 2018 | CNG |
| S001 | Car4 | 514000.00 | LXI | 2017 | Petrol |
| S002 | Car4 | 614000.00 | VXI | 2018 | Petrol |
+-------+--------+-----------+-----------+-----------------+---------+
8 rows in set (0.00 sec)
3.9.1 Importing Data from MySQL to Pandas
Importing data from MySQL to pandas basically refers
to the process of reading a table from MySQL database
and loading it to a pandas DataFrame. After
establishing the connection, in order to fetch data from
the table of the database we have the following three
functions:
1) pandas.read_sql_query(query,sql_conn)
It is used to read an sql query (query) into a
DataFrame using the connection identifier (sql_
conn) returned from the create_engine ().
2) pandas.read_sql_table(table_name,sql_conn)
It is used to read an sql table (table_name) into a
DataFrame using the connection identifier (sql_
conn).
3) pandas.read_sql(sql, sql_conn)
It is used to read either an sql query or an sql
table (sql) into a DataFrame using the connection
identifier (sql_conn).
2021–22
101
2021–22
102
>>> df=pd.DataFrame(data)
>>> df.to_sql('showroom_info',engine,if_
exists="replace",index=False)
After running this python script, a mysql table
with the name “showroom_info” will be created in the
database.
S ummary
• Descriptive Statistics are used to quantitatively
summarise the given data.
• Pandas provide many statistical functions for
analysis of data. Some of the functions are max(),
min(), mean(), median(), mode(), std(), var() etc.
• Sorting is used to arrange data in a specified
order, i.e. either ascending or descending.
• Indexes or labels of a row or column can be
changed in a DataFrame. This process is known
as Altering the index. Two functions reset_index
and set_index are used for that purpose.
• Missing values are a hindrance in data analysis
and must be handled properly.
• There are primarily two main strategies for
handling missing data. Either the row (or column)
having missing value is removed completely from
analysis or missing value is replaced by some
2021–22
103
NOTES
appropriate value (which may be zero or one or
average etc.)
• Process of changing the structure of the DataFrame
is known as Reshaping. Pandas provide two basic
functions for this, pivot() and pivot_table().
• pymysql and sqlalchemy are two mandatory
libraries for facilitating import and export of data
between Pandas and MySQL. Before import and
export, a connection needs to be established from
python script to MySQL database.
• Importing data from MySQL to Panda refers to
the process of fetching data from a MySQL table
or database to a pandas DataFrame.
• Exporting data from Pandas to MySQL refers to the
process of storing data from a pandas DataFrame
to a MySQL table or database.
Exercise
1. Write the statement to install the python connector to
connect MySQL i.e. pymysql.
2. Explain the difference between pivot() and pivot_
table() function?
3. What is sqlalchemy?
4. Can you sort a DataFrame with respect to multiple
columns?
5. What are missing values? What are the strategies to
handle them?
6. Define the following terms: Median, Standard
Deviation and variance.
7. What do you understand by the term MODE? Name
the function which is used to calculate it.
8. Write the purpose of Data aggregation.
9. Explain the concept of GROUP BY with help on an
example.
10. Write the steps required to read data from a MySQL
database to a DataFrame.
11. Explain the importance of reshaping of data with an
example.
2021–22
104
TV LG 12000 700
TV VIDEOCON 10000 650
TV LG 15000 800
AC SONY 14000 750
2021–22
105
2021–22