Applied Tech Lesson 45: 1 Lesson 45: Pie Chart & Bell Curve
Applied Tech Lesson 45: 1 Lesson 45: Pie Chart & Bell Curve
October 6, 2020
Particulars Description
Topic Pie Chart & Bell Curve
Class C45
1
Yes.
Instructions for the Teacher: - Please click on the “Quiz Time” button on the bottom right
corner of your screen to start the In-Class Quiz.
• A quiz will be visible to both you and the student. Encourage the student to answer the quiz
question.
• The student may choose the wrong option, help the student to think correctly about the
question and then answer again.
• After the student selects the correct option, the “End Quiz” button will start appearing on
your screen.
• Click the “End quiz” button to close the quiz pop-up and continue the class.
• Do not spend more than 2 minutes on this quiz.
2
1.0.3 Recap
[1]: # Run the code cell.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(csv_file, sep=';')
dt_series = pd.to_datetime(dt_series)
# Remove the Date & Time columns from the DataFrame and insert the 'dt_series'␣
,→in it.
# Get the Pandas series containing the days of a week, i.e., Monday, Tuesday,␣
,→Wednesday etc.
day_name_series = dt_series.dt.day_name()
# Add the 'Year', 'Month', 'Day' and 'Day Name' columns to the DataFrame.
df['Year'] = year_series
df['Month'] = month_series
df['Day'] = day_series
df['Day Name'] = day_name_series
3
# Sort the DataFrame by the 'DateTime' values in the ascending order. Also,␣
,→display the first 10 rows of the DataFrame.
df = df.sort_values(by='DateTime')
return new_series
# Remove all the columns from the 'df' DataFrame containing more than 10%␣
,→garbage value.
# Create a new DataFrame containing records for the years 2004 and 2005.
aq_2004_df = df[df['Year'] == 2004]
aq_2005_df = df[df['Year'] == 2005]
# Replace the -200 value with the median values for each column having indices␣
,→between 1 and -4 (excluding both) for the 2004 year DataFrame.
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19:
FutureWarning: pandas.util.testing is deprecated. Use the functions in the
public API at pandas.testing instead.
import pandas.util.testing as tm
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:66:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
4
See the caveats in the documentation: https://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:71:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
The description for all the columns containing data for air pollutants, temperature, relative humid-
ity and absolute humidity is provided below.
Columns Description
PT08.S1(CO) PT08.S1 (tin oxide) hourly averaged sensor
response (nominally CO targeted)
C6H6(GT) True hourly averaged Benzene concentration
µg
in m 3
PT08.S2(NMHC) PT08.S2 (titania) hourly averaged sensor
response (nominally NMHC targeted)
PT08.S3(NOx) PT08.S3 (tungsten oxide) hourly averaged
sensor response (nominally NOx targeted)
PT08.S4(NO2) PT08.S4 (tungsten oxide) hourly averaged
sensor response (nominally NO2 targeted)
PT08.S5(O3) PT08.S5 (indium oxide) hourly averaged
sensor response (nominally O3 targeted)
T Temperature in °C
RH Relative Humidity (%)
AH AH Absolute Humidity
[4]: # Concatenate the two DataFrames for 2004 and 2005 to obtain one DataFrame.
df = pd.concat([aq_2004_df, aq_2005_df])
df.head()
5
[4]: DateTime PT08.S1(CO) C6H6(GT) … Month Day Day Name
510 2004-01-04 00:00:00 1143.0 6.3 … 1 4 Sunday
511 2004-01-04 01:00:00 1044.0 5.1 … 1 4 Sunday
512 2004-01-04 02:00:00 1034.0 4.1 … 1 4 Sunday
513 2004-01-04 03:00:00 956.0 4.0 … 1 4 Sunday
514 2004-01-04 04:00:00 909.0 2.4 … 1 4 Sunday
[5 rows x 14 columns]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9357 entries, 510 to 8813
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DateTime 9357 non-null datetime64[ns]
1 PT08.S1(CO) 9357 non-null float64
2 C6H6(GT) 9357 non-null float64
3 PT08.S2(NMHC) 9357 non-null float64
4 PT08.S3(NOx) 9357 non-null float64
5 PT08.S4(NO2) 9357 non-null float64
6 PT08.S5(O3) 9357 non-null float64
7 T 9357 non-null float64
8 RH 9357 non-null float64
9 AH 9357 non-null float64
10 Year 9357 non-null int64
11 Month 9357 non-null int64
12 Day 9357 non-null int64
13 Day Name 9357 non-null object
dtypes: datetime64[ns](1), float64(9), int64(3), object(1)
memory usage: 1.1+ MB
Activity 1: Pie Charts^ A pie chart displays the various proportions of data in a dataset
through a circular representation wherein each proportion is represented through a slice. Larger
the slice, larger the proportion.
6
E.g., we can visualise the percentages of observations recorded in 2004 and 2005 using a pie chart.
To create a pie chart first you need to define the slice proportions through a list, a tuple, a series
or an array and pass it as an input to the pie() function.
Syntax: plt.pie(slice_proportions_array)
The slice proportions can either be the total number of values or total percentage of values. Either
way, the pie() function returns a pie chart such that each slice represents the percentage of values.
You can use the dpi attribute to control the quality of the charts/plots created using the
matplotlib and seaborn modules. The term dpi stands for dots per inch.
[6]: # S1.1: Create a pie chart to display the percentage of data collected in 2004␣
,→and 2005.
plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2004 and 2005")
plt.pie(year_slices)
plt.show()
7
[7]: # S1.2: Create a pie chart to display the percentage of data collected in 2004␣
,→and 2005 without calculating the percentage values for slices.
year_slices = df['Year'].value_counts()
plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2004 and 2005")
plt.pie(year_slices)
plt.show()
8
Let’s label the slices in the pie chart to identify the slices for 2004 and 2005. For this, you need
to pass a list, a tuple, a series or an array as an input to the labels parameter inside the pie()
function.
Additionally, you can pass the {'edgecolor':'red'} dictionary as an input to the wedgeprops
parameter inside the pie() function to define the colour of the outline of a pie chart.
[8]: # S1.3: Label the slices of a pie chart with their corresponding year values.␣
,→Also, set 'red' as the outline colour of the chart.
plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2004 and 2005")
plt.pie(year_slices, labels=year_labels, wedgeprops={'edgecolor':'red'})
plt.show()
9
You can add the percentage values in the corresponding slices by passing autopct='%1.1f%%' as
another parameter to the pie() function. If you change the numeral after the dot (or period) to
2, then the pie chart will display the percentage value upto 2 places after the decimal.
[9]: # S1.4 Add percentage values to the corresponding slices.
plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2004 and 2005")
plt.pie(year_slices, labels=year_labels, autopct='%1.1f%%',␣
,→wedgeprops={'edgecolor':'red'})
plt.show()
10
You can separate a slice (or more slices) from a pie by passing another parameter called explode
to the pie() function. The input to the explode parameter should be a list, tuple etc. containing
the amount by which a slice should move away from the centre of a pie.
E.g., let’s move the slice for the 2005 year away from the centre of the pie by a distance of 15% of
the radius of the pie. For this, we will have to create a list, tuple etc. containing the first value
as 0 (corresponding to the distance the slice for 2004 should move) and the second value as 0.15
denoting that the slice for the 2005 year should move away from the centre of the pie by a distance
of 15% of the radius of the pie.
[10]: # T1.1: Separate the slice for the year 2005.
explode = [0, 0.15]
plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2004 and 2005")
plt.pie(year_slices, labels=year_labels, explode=explode, autopct='%1.2f%%',␣
,→wedgeprops={'edgecolor':'red'})
plt.show()
11
You can also provide the 3D effect to the pie by adding the shadow=True parameter to the pie()
function.
[11]: # S1.5: Add 3D effect to the pie.
plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2004 and 2005")
plt.pie(year_slices, labels=year_labels, explode=explode, autopct='%1.1f%%',␣
,→shadow=True, wedgeprops={'edgecolor':'red'})
plt.show()
12
Similarly, you can create a pie chart to visualise the proportion of the observations recorded in
various months in the year 2005. For this, we may require to add another column to the df
DataFrame containing the month names for each record.
To get a series containing the month names from a series containing the datetime object, you can
use the series.dt.month_name() function.
[12]: # S1.6: Get the month names from the 'DateTime' column for each record.
df['DateTime'].dt.month_name()
# S1.7: Add the 'Month Name' column to the 'df' DataFrame and print the first␣
,→five rows of the updated DataFrame.
[5 rows x 15 columns]
Now create a pie chart for the year 2005 displaying the top 5 months having the most number of
observations. Label the slices with the month names.
13
[13]: # T1.2: Create a pie chart for the 2005 displaying the top 5 months having the␣
,→most number of observations. Label the slices with the month names.
plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2005")
plt.pie(data, labels=data.index,
explode=explode, autopct='%1.2f%%',
startangle=30, # The first slice will be placed at an angle of 30␣
,→degrees w.r.t. to the horizontal axis in the anti-clockwise direction.
shadow=True,
wedgeprops={'edgecolor':'r'})
plt.show()
Note: The sum of all the percentages in a pie chart will be and should be 100.
When not to create a pie chart?
A pie chart has its limited use. It should be used to plot the proportion of a few categories. If
there are more category proportions to be visualised then don’t use the pie chart.
[14]: # S1.8: Create a pie chart to visualise the proportions of the observations␣
,→recorded in each month in the year 2004.
plt.figure(dpi=108)
14
plt.title("Percentage of Data Collected in 2004")
plt.pie(df['Month'].value_counts())
plt.show()
As you can see there are too many, i.e., 12 proportions to be plotted in the above pie chart which
is quite hard to interpret even if we label it.
Activity 2: Bell Curve^^^ Nature loves symmetry. How? Consider a small experiment in
which you measure the heights of say 10,000 men (or women) in a city. There would be many
individuals who would be equally tall. The height of most people will be equal to the mean
(or average) height of all 10,000 people. Also, 50% of the population (or 5000 individuals) will
have a height less than or equal to the mean height and the other 50% of the population (or
5000 individuals) will have a height greater than the mean height. So the distribution of heights
will be symmetric around the mean of heights. Such a kind of distribution is called the normal
distribution.
If you create a histogram to plot the heights of 10,000 individuals, the arrangement of the bars in
the histogram will appear to form a bell shape.
Let’s create a histogram to understand this concept better. First, we will have to create a numpy
array containing 10,000 numbers denoting the heights of 10,000 individuals in a city. Let their
mean height be 165cm and the standard deviation in heights be 15cm.
[15]: # T2.1 Create a NumPy array containing 10,000 random normally distributed␣
,→numbers having a mean of 165 cm and a standard deviation.
15
height_mean = 165
height_std = 15
heights = np.random.normal(height_mean, height_std, size=10000)
heights[:10]
The np.random.normal() function takes mean, standard deviation and size (number of numbers
to be generated) as inputs and returns a numpy array containing normally distributed random
numbers whose mean and standard deviation is very close to the provided mean and standard
deviation values.
[16]: # T2.2: Calculate the mean and standard deviation of the normally distributed␣
,→heights.
np.mean(heights), np.std(heights)
There would be a little error in the calculated mean and standard deviation after creating the array.
[17]: # T2.3: Create a histogram for the heights.
plt.figure(figsize=(20, 5))
plt.title('Histogram for Heights (in cm)')
plt.hist(heights, bins='sturges', edgecolor='black') # 'sturges' is one of the␣
,→ways to compute the number of bins in a histogram.
The axvline() function creates a vertical line intersecting with the x−axis at x =
np.mean(heights).
16
In the above histogram, you can see that the arrangement of bars appears to make a bell shape.
We can create a bell-shaped curve using the distplot() function.
Note: Here is the list of the number of bins determiners that you could use to get the near-perfect
bell curve depending upon the number of data points.
bin_num_determiners = ('fd', 'doane', 'scott', 'stone', 'rice', 'sturges',
'sqrt')
[18]: # T2.4: Create a bell curve using the 'distplot()' function.
plt.figure(figsize=(15, 5), dpi=96)
plt.title("Bell Curve for Heights (in cm)")
sns.distplot(heights, hist=False, bins='sturges') # 'sturges' is one of the␣
,→ways to compute the number of bins in a histogram.
plt.legend()
plt.grid(which='major', axis='y', color='lightgrey')
plt.show()
Note: At this point, you don’t need to know how the probability density values are computed
because the knowledge of probability density function is required which you will learn in the sub-
sequent classes. Right now, you just need to know how to interpret a bell curve.
The above graph is a bell curve created using the distplot() function in which the bars are
disabled by passing the False value to the hist parameter. The 'sturges' value passed to the
bins parameter ensures that the bell curve created is equivalent to the histogram created earlier
using the hist() function.
The great thing about the normally distributed values (or values in a bell-shaped curve) is that
approximately
• 68% of the values lie between µ − σ and µ + σ. In other words, approx. 68% of the values lie
within one-sigma around the mean.
17
• 95% of the values lie between µ − 2σ and µ + 2σ. In other words, approx. 95% of the values
lie within two-sigma around the mean.
• 99.7% of the values lie between µ − 3σ and µ + 3σ. In other words, approx. 99.7% of the
values lie within three-sigma around the mean.
where µ and σ are mean and standard deviation respectively.
[19]: # S2.1: Create a bell curve with the vertical lines denoting mean value and the␣
,→one-sigma interval.
# One-sigma
plt.axvline(np.mean(heights) - np.std(heights), color='b',
label=f'mu - sigma = {np.mean(heights) - np.std(heights):.2f} cm',␣
,→linewidth=2)
plt.ylabel("Probability density")
plt.legend()
plt.grid(which='major', axis='y', color='lightgrey')
plt.show()
[20]: # S2.2: Get the percentage of the values lying within one-sigma around the mean.
one_sigma_count = 0
for num in heights:
if (num >= np.mean(heights) - np.std(heights)) and (num <= np.mean(heights) +␣
,→np.std(heights)):
18
one_sigma_count += 1
[20]: 68.22
Another way to interpret the one-sigma interval is that about 68% of the total values in a dataset
lie around its mean value. The remaining 32% of the total values lie away from the mean. This
denotes that mean is indeed the central value.
Another interesting property of a bell curve (or normal distribution) is that the mean, median and
mode values are the same.
[21]: # S2.3: Calculate the mean and the median height values.
np.mean(heights), np.median(heights) # The calculated mean, median and mode␣
,→values will be almost the same.
From the histogram, it is very clear that the mean value has the greatest number of counts, so it
is also the modal value or simply the mode.
The parameters mean, median and mode are called the measures of central tendency. In other
words, what is the central value of all the observations in a dataset or if you were to represent all
the observations in a dataset with exactly one central value, then what would be that value.
[22]: # S2.4: Create a bell curve with the vertical lines denoting mean value and the␣
,→two-sigma interval.
# Two-sigma
plt.axvline(np.mean(heights) - 2 * np.std(heights), color='g',
label=f'mu - 2 * sigma = {np.mean(heights) - 2 * np.std(heights):.
,→2f} cm', linewidth=2)
plt.ylabel("Number of observations")
plt.legend()
plt.grid(which='major', axis='y', color='lightgrey')
plt.show()
19
[23]: # S2.5: Get the percentage of the values lying within two-sigma around the mean.
two_sigma_count = 0
for num in heights:
if (num >= np.mean(heights) - 2 * np.std(heights)) and (num <= np.
,→mean(heights) + 2 * np.std(heights)):
two_sigma_count += 1
[23]: 95.61
[24]: # S2.6: Create a bell curve with the vertical lines denoting mean value and the␣
,→three-sigma interval.
# Three-sigma
plt.axvline(np.mean(heights) - 3 * np.std(heights), color='m',
label=f'mu - 3 * sigma = {np.mean(heights) - 3 * np.std(heights):.
,→2f} cm', linewidth=2)
plt.ylabel("Number of observations")
plt.legend()
plt.grid(which='major', axis='y', color='lightgrey')
plt.show()
20
[25]: # S2.7: Get the percentage of the values lying within three-sigma around the␣
,→mean.
three_sigma_count = 0
for num in heights:
if (num >= np.mean(heights) - 3 * np.std(heights)) and (num <= np.
,→mean(heights) + 3 * np.std(heights)):
three_sigma_count += 1
[25]: 99.76
The point of normal distribution (or bell curve) is if a certain set of values follow normal distribution,
then we can make the best guess with
• 68% confidence, that the value lies between one-sigma
• 95% confidence, that the value lies between two-sigma
• 99% confidence, that the value lies between three-sigma
Activity 3: Applying Normal Distribution Concepts^^ Assuming that the relative humid-
ity values in the df DataFrame follows the normal distribution, then with 68% confidence, you can
say that a relative humidity of 45% will lie between the range 32.24% and 66.20%
[35]: # S3.1: Compute the one-sigma interval for the relative humidity values.
print("Required one-sigma interval ==>", (df['RH'].mean() - df['RH'].std(),␣
,→df['RH'].mean() + df['RH'].std()), "\n")
# S3.2: Create a histogram for relative humidity values and find out whether it␣
,→follows a bell curve or not.
plt.figure(figsize=(20, 5))
plt.title("Histogram for RH (in %)")
21
plt.hist(df['RH'], bins='sturges', edgecolor='black')
plt.axvline(df['RH'].mean(), color='red', label=f"Mean RH = {df['RH'].mean():.
,→2f}", linewidth=2)
# One-sigma
plt.axvline(df['RH'].mean() - df['RH'].std(), color='gold',
label=f"mu - sigma = {df['RH'].mean() - df['RH'].std():.2f} ",␣
,→linewidth=3)
plt.xlabel("RH")
plt.ylabel("Probability density")
plt.legend()
plt.show()
Also, the larger the standard deviation, the wider the bell curve. Lower the standard deviation,
narrower the bell curve.
[28]: # T3.1: Create 3 arrays having normally distributed random values. They should␣
,→have the same length, same mean but different standard deviations.
mu = 150
array1 = np.random.normal(mu, 10, 10000) # First array having 10,000 values and␣
,→std = 10
array3 = np.random.normal(mu, 50, 10000) # Third array having 10,000 values and␣
,→std = 50
22
plt.title("Bell Curve")
sns.distplot(array1, hist=False, bins='sturges', label='First array') # Bell␣
,→curve for the first array
In the above graph, the first array has the lowest standard deviation. Hence, its bell curve is the
narrowest. The third array has the greatest standard deviation. Hence, its bell curve is the widest.
Evidently, the standard deviation value also affects the height of the bell curve.
Not just heights, a lot of physical quantities such as our weights, blood pressures, marks scored by
students in an exam etc follow the bell curve (or normal distribution).
23
Activity 1: Stack Plots (or Area Plots) A stack plot or an area plot is another good way to
visualise the progression of an event. It is similar to a line plot with an addition of a shaded area
below the lines.
Let’s create a stack plot for the daily variation in temperature in the year 2004.
To create a stack plot, you need to use the stackplot() function. It requires two inputs, the values
to be plotted on the x−axis and the values to be plotted on the y−axis.
Syntax: plt.stackplot(x_values, y_values)
[29]: # S1.1 Create a stack plot for the daily variation in temperature in the year␣
,→2004.
Now create a stack plot for the monthly median temperature in the year 2004 and 2005.
[30]: # S1.2 Create a stack plot for the monthly median temperature in the year 2004␣
,→and 2005.
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct',␣
,→'Nov', 'Dec']
24
Activities Teacher Activities
1. Pie Chart & Bell Curve (Class Copy)
https://colab.research.google.com/drive/1LnClZ_A2nIx_ONskdCNc5fPgyTPtaMTj
2. Pie Chart & Bell Curve (Reference)
https://colab.research.google.com/drive/1u56Gk11U3GzgPFBDvjcGU21224B_cejo
25