Pandas Plots

Pandas plots
Pandas also provides us options to visualize the data. Here are some of the examples:
Syntax:
df.plot(X, y, marker, kind)
X = value on X axis
y = value on y axis
marker = shape in case of specific plots like a scatter plot
kind = type of plot
A scatter plot to visualize the trend of acceleration in different years.
1. df.plot(x = 'model_year', y = 'acceleration', marker = 'o',

kind = 'scatter');
2.
A bar plot to visualize mean acceleration in different years.
1. df.groupby('model_year').mean()[['acceleration']].plot(kind
= 'bar');
2.
A histogram to visualize the frequency distribution of cylinders
1. df['cylinders'].plot(kind = 'hist')
2.
A scatter plot to visualize the relationship between weight and mpg.
1. df.plot(x = 'weight', y = 'mpg', kind = 'scatter')
2.
A bar plot to visualize the sorted mean values of acceleration with respect to number of
cylinders.
1. df.groupby('cylinders').mean().sort_values('acceleration')
[['acceleration']].plot(kind = 'bar')
2.
Pandas Exercise
Problem Statement:
to find the rainfall dataset that is to be considered for the exercise.. This data contains
region(district) wise rainfall across India.
1. Import the data into Python environment as a Pandas DataFrame.

2. Check for missing values, if any and drop the corresponding rows.
3. Find the district that gets the highest annual rainfall.
4. Display the top 5 states that get the highest annual rainfall.
5. Drop the columns 'Jan-Feb', 'Mar-May', 'Jun-Sep', 'Oct-Dec'.
6. Display the state-wise mean rainfall for all the months using a pivot table.
7. Display the count of districts in each state.
8. For each state, display the district that gets the highest rainfall in May. Also display the
recorded rainfall.
Exercise
Problem Statement:
Given below is a DataFrame df:
1. import pandas as pd
2.
3. df = pd.DataFrame([[0.23,'f1'],[5.36,'f2']],
4. index = list('pq'),
5. columns = list('ab'))
6.
Do the following:
1. Change the column name from ' a' to ' A'.
2. Add a new column ' c' filled with random values.
3. Change the datatype of column ' A' values to complex.
4. Display rows whose any of the element matches with any element of the given list:
1. lst = ['f30','f50','f2','f0']
Merging
There are few more commonly used methods involved in Pandas as listed:
 Merging
 Reshaping
 Pivot Tables
 Grouping
Let us understand their importance by some illustrations starting with merging of datasets.
Given two datasets from a conducted experiment with each dataset having its own feature. Our
task is to form a single dataset combining all the features particular to each observation. To do so
we can take help of concat() function.
2.
3. data1 = pd.DataFrame([[15, 12, -3],
4. [33, 54, 21],
5. [10, 32, 22]],
6. columns = list('ABC'))
7. data2 = pd.DataFrame([[10, 1, 3],
8. [33, -54, 2],
9. [10, 0.32, 2]],
10. columns = list('DEF'))
11. print(data1)
12. # A B C
13. # 0 15 12 -3
14. # 1 33 54 21
15. # 2 10 32 22
16.
17. print(data2)
18. # D E F
19. # 0 10 1.00 3
20. # 1 33 -54.00 2
21. # 2 10 0.32 2
22.
23. print(pd.concat( [data1,data2], axis = 1 ))
24. # A B C D E F
25. # 0 15 12 -3 10 1.00 3
26. # 1 33 54 21 33 -54.00 2
27. # 2 10 32 22 10 0.32 2
28.
29.
Suppose an individual took observation of 3 different features in two instances. Next, she wants
to combine all these samples to form a single dataset. We can achieve this task using the
same concat() function, but this time with the change of axis.
1. import numpy as np
3.
4. data1 = pd.DataFrame(np.random.randn(9).reshape(3,3),
6. data2 = pd.DataFrame(np.arange(9).reshape(3,3),
8. print(data1) # Random values
9. # A B C
10. # 0 1.957218 0.433266 1.214950
11. # 1 -0.143500 -0.092030 -0.823898
12. # 2 0.481486 -0.024111 -0.769195

13.
14. print(data2)
15. # A B C
16. # 0 0 1 2
17. # 1 3 4 5
18. # 2 6 7 8
19.
20. print(pd.concat( [data1,data2], axis = 0 ))
21. # A B C
22. # 0 1.957218 0.433266 1.214950
23. # 1 -0.143500 -0.092030 -0.823898
24. # 2 0.481486 -0.024111 -0.769195
25. # 0 0.000000 1.000000 2.000000
26. # 1 3.000000 4.000000 5.000000
27. # 2 6.000000 7.000000 8.000000
Pivot Tables
Reshaping a dataset is a perfect seed for quick text visualization. However, similar to reshaping we have
another term named as pivot tables which are more efficient in delivering a better visualization.
To understand pivot tables we take the same last dataframe except adding a new feature 'score'.
2.
3. df = pd.DataFrame([
4. ['IND', 'Gold', 'Game1', '9.9'],

5. ['IND', 'Bronze', 'Game2', '8'],
6. ['USA', 'Silver', 'Game1', '9.5'],
7. ['USA', 'Gold', 'Game2', '8.6'],
8. ], columns = ['Country', 'Medal',
9. 'Game', 'Score'],
10. index = ['Year1', 'Year2','Year1', 'Year2'])
11.
12. print(df)
13. # Country Medal Game Score
14. # Year1 IND Gold Game1 9.9
15. # Year2 IND Bronze Game2 8
16. # Year1 USA Silver Game1 9.5
17. # Year2 USA Gold Game2 8.6
18.
Pivot tables come handy when we have to break down a large dataset (in terms of features) into fewer features
for quick visualization. For example, finding which medal is common to both IND and USA, listing game(s) in
which India won Silver, etc.
We implement pivot table on the given dataset:
1. # Listing all the features
2. print(df.pivot(index = 'Country', columns = 'Medal'))
3. # Game Score
4. # Medal Bronze Gold Silver Bronze Gold Silver
5. # Country
6. # IND Game2 Game1 None 8 9.9 None

7. # USA None Game2 Game1 None 8.6 9.5
8.
9. # Listing only Score feature
10. print(df.pivot(index = 'Country', columns = 'Medal',
11. values = 'Score'))
12. # Medal Bronze Gold Silver
13. # Country
14. # IND 8 9.9 None
15. # USA None 8.6 9.5
16.
Pivot table limitations

The previous dataset doesn't include any redundant sample, for instance, let's say 2 people win gold as well as
silver medals in Year1 of Game1 from India. In that case, the above function named pivot() breaks due to
ambiguity in Game column. Therefore, to overcome this limitation, we have another function named
as pivot_table() which requires an argument named as aggfun through which we can clear the ambiguity. Let's
understand it through an example.
1. import numpy as np
3.
4. df = pd.DataFrame([
5. ['IND', 'Gold', 'Game1', '9.9'],
6. ['IND', 'Silver', 'Game1', '9.5'],
7. ['IND', 'Bronze', 'Game2', '8'],
8. ['USA', 'Bronze', 'Game1', '9.0'],

9. ['USA', 'Silver', 'Game2', '8.6'],
10. ], columns = ['Country', 'Medal',
11. 'Game', 'Score'],
12. index = ['Year1', 'Year1', 'Year2','Year1',

'Year2'])
13.
14. print(df)
15. # Country Medal Game Score
16. # Year1 IND Gold Game1 9.9
17. # Year1 IND Silver Game1 9.5
18. # Year2 IND Bronze Game2 8
19. # Year1 USA Bronze Game1 9.0
20. # Year2 USA Silver Game2 8.6
21.
22. print(df.pivot_table(index = 'Country',
23. columns = 'Game',
24. values = 'Score',
25. aggfunc = np.max))
26. # Game Game1 Game2
27. # Country
28. # IND 9.9 8
29. # USA 9.0 8.6
Grouping
To understand grouping, a concept similar to databases, let us consider that we have been given a dataset for
the sales of laptop and desktop systems. The observations for a particular type can be repeated for the different
sales price. In this case, if we need to calculate the total sales of each category then we can group similar data
and apply a certain function.
To do so we create a sample dummy dataset and sum up the total sales particular to each category.
2.
3. df = pd.DataFrame([["Laptop", 1000],
4. ["Laptop", 2520],
5. ["Desktop", 3000],
6. ["Desktop", 400]], columns =

['Category','Sales'])
7.
8. print(df)
9. # Category Sales
10. # 0 Laptop 1000
11. # 1 Laptop 2520
12. # 2 Desktop 3000
13. # 3 Desktop 400
14.
15. print(df.groupby(['Category'], sort = False).sum())
16. # Sales
17. # Category
18. # Laptop 3520
19. # Desktop 3400

Exercise
Problem Statement:
Given a dataframe df which has three attributes defining: set_name: system names, spd_per_day:
Speed per day, speed: Network speed in MBps
1. sys = ['s1','s1','s1','s1',
2. 's2','s2','s2','s2']
3. net_day = ['d1','d1','d2','d2',
4. 'd1','d1','d2','d2']
5. spd = [1.3, 11.4, 5.6, 12.3,
6. 6.2, 1.1, 20.0, 8.8]
7. df = pd.DataFrame({'set_name':sys,
8. 'spd_per_day':net_day,
9. 'speed':spd})
10.
Do the following:
1. Construct a dataframe new_df where the given dataset is grouped based on each system
(s1 and s2) and speed per day (d1 and d2) with the median speed each day per system.
Also, provide a secondary name ' Median' for the speed attribute.
2. Sort the dataframe new_df in the ascending order of the median speed.
Hints/Answers:
1. Answer:
1. set_name spd_per_day speed
2. Median
3. 0 s1 d1 6.35
4. 1 s1 d2 8.95
5. 2 s2 d1 3.65
6. 3 s2 d2 14.40
7.
2. Answer:
1. set_name spd_per_day speed
2. Median
3. 2 s2 d1 3.65
4. 0 s1 d1 6.35
5. 1 s1 d2 8.95
6. 3 s2 d2 14.40

Pandas Plots

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Pandas Plots

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pandas Plots

Uploaded by

Copyright:

Available Formats

Pandas plots

df.plot(X, y, marker, kind)

marker = shape in case of specific plots like a scatter plot

kind = type of plot

A scatter plot to visualize the trend of acceleration in different years.

1. df.plot(x = 'model_year', y = 'acceleration', marker = 'o',

A histogram to visualize the frequency distribution of cylinders

1. df.plot(x = 'weight', y = 'mpg', kind = 'scatter')

1. Import the data into Python environment as a Pandas DataFrame.

1. Change the column name from ' a' to ' A'.

2. Add a new column ' c' filled with random values.

3. Change the datatype of column ' A' values to complex.

3. data1 = pd.DataFrame([[15, 12, -3],

4. [33, 54, 21],

5. [10, 32, 22]],

7. data2 = pd.DataFrame([[10, 1, 3],

8. [33, -54, 2],

9. [10, 0.32, 2]],

10. columns = list('DEF'))

23. print(pd.concat( [data1,data2], axis = 1 ))

8. print(data1) # Random values

10. # 0 1.957218 0.433266 1.214950

11. # 1 -0.143500 -0.092030 -0.823898

12. # 2 0.481486 -0.024111 -0.769195

20. print(pd.concat( [data1,data2], axis = 0 ))

22. # 0 1.957218 0.433266 1.214950

23. # 1 -0.143500 -0.092030 -0.823898

24. # 2 0.481486 -0.024111 -0.769195

25. # 0 0.000000 1.000000 2.000000

26. # 1 3.000000 4.000000 5.000000

27. # 2 6.000000 7.000000 8.000000

4. ['IND', 'Gold', 'Game1', '9.9'],

6. ['USA', 'Silver', 'Game1', '9.5'],

7. ['USA', 'Gold', 'Game2', '8.6'],

8. ], columns = ['Country', 'Medal',

10. index = ['Year1', 'Year2','Year1', 'Year2'])

13. # Country Medal Game Score

14. # Year1 IND Gold Game1 9.9

15. # Year2 IND Bronze Game2 8

16. # Year1 USA Silver Game1 9.5

17. # Year2 USA Gold Game2 8.6

We implement pivot table on the given dataset:

1. # Listing all the features

2. print(df.pivot(index = 'Country', columns = 'Medal'))

4. # Medal Bronze Gold Silver Bronze Gold Silver

6. # IND Game2 Game1 None 8 9.9 None

9. # Listing only Score feature

10. print(df.pivot(index = 'Country', columns = 'Medal',

11. values = 'Score'))

12. # Medal Bronze Gold Silver

14. # IND 8 9.9 None

15. # USA None 8.6 9.5

Pivot table limitations

5. ['IND', 'Gold', 'Game1', '9.9'],

6. ['IND', 'Silver', 'Game1', '9.5'],

7. ['IND', 'Bronze', 'Game2', '8'],

8. ['USA', 'Bronze', 'Game1', '9.0'],

10. ], columns = ['Country', 'Medal',