0% found this document useful (0 votes)
46 views25 pages

Applied Tech Lesson 45: 1 Lesson 45: Pie Chart & Bell Curve

1. The document describes a lesson on creating pie charts and bell curves to analyze and visualize data distributions. 2. Students will learn to create pie charts to summarize categorical data distributions and bell curves to compute confidence intervals for normal distributions. 3. The lesson will involve running code examples from previous classes and completing activities to create pie charts and bell curves.

Uploaded by

Rakshita Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views25 pages

Applied Tech Lesson 45: 1 Lesson 45: Pie Chart & Bell Curve

1. The document describes a lesson on creating pie charts and bell curves to analyze and visualize data distributions. 2. Students will learn to create pie charts to summarize categorical data distributions and bell curves to compute confidence intervals for normal distributions. 3. The lesson will involve running code examples from previous classes and completing activities to create pie charts and bell curves.

Uploaded by

Rakshita Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Applied Tech Lesson 45

October 6, 2020

1 Lesson 45: Pie Chart & Bell Curve


WARNING: The reference notebook is meant ONLY for a teacher. Please DO NOT share it
with any student. The contents of the reference notebook are meant only to prepare a teacher for
a class. To conduct the class, use the class copy of the reference notebook.

Particulars Description
Topic Pie Chart & Bell Curve

Class Description In this class, a student will learn to create a


pie chart and a bell curve

Class C45

Class Time 45 minutes

Goals Create a pie chart to summarise the


distribution of data across categories
Create a bell curve to compute the confidence
interval to allow an analyst to make the best
guess based on a normal distribution

Teacher Resources Google Account


Laptop with internet connectivity
Earphones with mic

Student Resources Google Account


Laptop with internet connectivity
Earphones with mic

1.0.1 Warm-up Quiz


TEACHER
I have an exciting quiz question for you! Are you ready to answer this question?

EXPECTED STUDENT RESPONSE

1
Yes.
Instructions for the Teacher: - Please click on the “Quiz Time” button on the bottom right
corner of your screen to start the In-Class Quiz.

• A quiz will be visible to both you and the student. Encourage the student to answer the quiz
question.
• The student may choose the wrong option, help the student to think correctly about the
question and then answer again.
• After the student selects the correct option, the “End Quiz” button will start appearing on
your screen.

• Click the “End quiz” button to close the quiz pop-up and continue the class.
• Do not spend more than 2 minutes on this quiz.

1.0.2 Teacher-Student Activities


In this class, we will learn to create pie charts and bell curves.
Let’s quickly run the codes covered in the previous classes and begin this session from Activity 1:
Pie Charts

2
1.0.3 Recap
[1]: # Run the code cell.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Loading the dataset.


csv_file = 'https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/
,→whitehat-ds-datasets/air-quality/AirQualityUCI.csv'

df = pd.read_csv(csv_file, sep=';')

# Dropping the 'Unnamed: 15' & 'Unnamed: 16' columns.


df = df.drop(columns=['Unnamed: 15', 'Unnamed: 16'], axis=1)

# Dropping the null values.


df = df.dropna()

# Creating a Pandas series containing 'datetime' objects.


dt_series = df['Date'] + ' ' + pd.Series(data=[str(item).replace(".", ":") for␣
,→item in df['Time']], index=df.index)

dt_series = pd.to_datetime(dt_series)

# Remove the Date & Time columns from the DataFrame and insert the 'dt_series'␣
,→in it.

df = df.drop(columns=['Date', 'Time'], axis=1)


df.insert(loc=0, column='DateTime', value=dt_series)

# Get the Pandas series containing the year values as integers.


year_series = dt_series.dt.year

# Get the Pandas series containing the month values as integers.


month_series = dt_series.dt.month

# Get the Pandas series containing the day values as integers.


day_series = dt_series.dt.day

# Get the Pandas series containing the days of a week, i.e., Monday, Tuesday,␣
,→Wednesday etc.

day_name_series = dt_series.dt.day_name()

# Add the 'Year', 'Month', 'Day' and 'Day Name' columns to the DataFrame.
df['Year'] = year_series
df['Month'] = month_series
df['Day'] = day_series
df['Day Name'] = day_name_series

3
# Sort the DataFrame by the 'DateTime' values in the ascending order. Also,␣
,→display the first 10 rows of the DataFrame.

df = df.sort_values(by='DateTime')

# Create a function to replace the commas with periods in a Pandas series.


def comma_to_period(series):
new_series = pd.Series(data=[float(str(item).replace(',', '.')) for item in␣
,→series], index=df.index)

return new_series

# Apply the 'comma_to_period()' function on the ''CO(GT)', 'C6H6(GT)', 'T',␣


,→'RH' and 'AH' columns.

cols_to_correct = ['CO(GT)', 'C6H6(GT)', 'T', 'RH', 'AH'] # Create a list of␣


,→column names.

for col in cols_to_correct: # Iterate through each column


df[col] = comma_to_period(df[col]) # Replace the original column with the␣
,→new series.

# Remove all the columns from the 'df' DataFrame containing more than 10%␣
,→garbage value.

df = df.drop(columns=['NMHC(GT)', 'CO(GT)', 'NOx(GT)', 'NO2(GT)'], axis=1)

# Create a new DataFrame containing records for the years 2004 and 2005.
aq_2004_df = df[df['Year'] == 2004]
aq_2005_df = df[df['Year'] == 2005]

# Replace the -200 value with the median values for each column having indices␣
,→between 1 and -4 (excluding both) for the 2004 year DataFrame.

for col in aq_2004_df.columns[1:-4]:


median = aq_2004_df[col].median()
aq_2004_df[col] = aq_2004_df[col].replace(to_replace=-200, value=median)

# Repeat the same exercise for the 2005 year DataFrame.


for col in aq_2005_df.columns[1:-4]:
median = aq_2005_df[col].median()
aq_2005_df[col] = aq_2005_df[col].replace(to_replace=-200, value=median)

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19:
FutureWarning: pandas.util.testing is deprecated. Use the functions in the
public API at pandas.testing instead.
import pandas.util.testing as tm
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:66:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

4
See the caveats in the documentation: https://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:71:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-


docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

[2]: # Get the columns in the original DataFrame.


df.columns

[2]: Index(['DateTime', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'PT08.S3(NOx)',


'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH', 'AH', 'Year', 'Month', 'Day',
'Day Name'],
dtype='object')

The description for all the columns containing data for air pollutants, temperature, relative humid-
ity and absolute humidity is provided below.

Columns Description
PT08.S1(CO) PT08.S1 (tin oxide) hourly averaged sensor
response (nominally CO targeted)
C6H6(GT) True hourly averaged Benzene concentration
µg
in m 3
PT08.S2(NMHC) PT08.S2 (titania) hourly averaged sensor
response (nominally NMHC targeted)
PT08.S3(NOx) PT08.S3 (tungsten oxide) hourly averaged
sensor response (nominally NOx targeted)
PT08.S4(NO2) PT08.S4 (tungsten oxide) hourly averaged
sensor response (nominally NO2 targeted)
PT08.S5(O3) PT08.S5 (indium oxide) hourly averaged
sensor response (nominally O3 targeted)
T Temperature in °C
RH Relative Humidity (%)
AH AH Absolute Humidity

[3]: # Group the DataFrames about the 'Month' column.


group_2004_month = aq_2004_df.groupby(by='Month')
group_2005_month = aq_2005_df.groupby(by='Month')

[4]: # Concatenate the two DataFrames for 2004 and 2005 to obtain one DataFrame.
df = pd.concat([aq_2004_df, aq_2005_df])
df.head()

5
[4]: DateTime PT08.S1(CO) C6H6(GT) … Month Day Day Name
510 2004-01-04 00:00:00 1143.0 6.3 … 1 4 Sunday
511 2004-01-04 01:00:00 1044.0 5.1 … 1 4 Sunday
512 2004-01-04 02:00:00 1034.0 4.1 … 1 4 Sunday
513 2004-01-04 03:00:00 956.0 4.0 … 1 4 Sunday
514 2004-01-04 04:00:00 909.0 2.4 … 1 4 Sunday

[5 rows x 14 columns]

[5]: # Information of the DataFrame.


df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9357 entries, 510 to 8813
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DateTime 9357 non-null datetime64[ns]
1 PT08.S1(CO) 9357 non-null float64
2 C6H6(GT) 9357 non-null float64
3 PT08.S2(NMHC) 9357 non-null float64
4 PT08.S3(NOx) 9357 non-null float64
5 PT08.S4(NO2) 9357 non-null float64
6 PT08.S5(O3) 9357 non-null float64
7 T 9357 non-null float64
8 RH 9357 non-null float64
9 AH 9357 non-null float64
10 Year 9357 non-null int64
11 Month 9357 non-null int64
12 Day 9357 non-null int64
13 Day Name 9357 non-null object
dtypes: datetime64[ns](1), float64(9), int64(3), object(1)
memory usage: 1.1+ MB

Activity 1: Pie Charts^ A pie chart displays the various proportions of data in a dataset
through a circular representation wherein each proportion is represented through a slice. Larger
the slice, larger the proportion.

6
E.g., we can visualise the percentages of observations recorded in 2004 and 2005 using a pie chart.
To create a pie chart first you need to define the slice proportions through a list, a tuple, a series
or an array and pass it as an input to the pie() function.
Syntax: plt.pie(slice_proportions_array)
The slice proportions can either be the total number of values or total percentage of values. Either
way, the pie() function returns a pie chart such that each slice represents the percentage of values.
You can use the dpi attribute to control the quality of the charts/plots created using the
matplotlib and seaborn modules. The term dpi stands for dots per inch.
[6]: # S1.1: Create a pie chart to display the percentage of data collected in 2004␣
,→and 2005.

year_slices = df['Year'].value_counts() * 100 / df.shape[0]

plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2004 and 2005")
plt.pie(year_slices)
plt.show()

7
[7]: # S1.2: Create a pie chart to display the percentage of data collected in 2004␣
,→and 2005 without calculating the percentage values for slices.

year_slices = df['Year'].value_counts()

plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2004 and 2005")
plt.pie(year_slices)
plt.show()

8
Let’s label the slices in the pie chart to identify the slices for 2004 and 2005. For this, you need
to pass a list, a tuple, a series or an array as an input to the labels parameter inside the pie()
function.
Additionally, you can pass the {'edgecolor':'red'} dictionary as an input to the wedgeprops
parameter inside the pie() function to define the colour of the outline of a pie chart.
[8]: # S1.3: Label the slices of a pie chart with their corresponding year values.␣
,→Also, set 'red' as the outline colour of the chart.

year_labels = ['Year 2004', 'Year 2005']

plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2004 and 2005")
plt.pie(year_slices, labels=year_labels, wedgeprops={'edgecolor':'red'})
plt.show()

9
You can add the percentage values in the corresponding slices by passing autopct='%1.1f%%' as
another parameter to the pie() function. If you change the numeral after the dot (or period) to
2, then the pie chart will display the percentage value upto 2 places after the decimal.
[9]: # S1.4 Add percentage values to the corresponding slices.
plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2004 and 2005")
plt.pie(year_slices, labels=year_labels, autopct='%1.1f%%',␣
,→wedgeprops={'edgecolor':'red'})

plt.show()

10
You can separate a slice (or more slices) from a pie by passing another parameter called explode
to the pie() function. The input to the explode parameter should be a list, tuple etc. containing
the amount by which a slice should move away from the centre of a pie.
E.g., let’s move the slice for the 2005 year away from the centre of the pie by a distance of 15% of
the radius of the pie. For this, we will have to create a list, tuple etc. containing the first value
as 0 (corresponding to the distance the slice for 2004 should move) and the second value as 0.15
denoting that the slice for the 2005 year should move away from the centre of the pie by a distance
of 15% of the radius of the pie.
[10]: # T1.1: Separate the slice for the year 2005.
explode = [0, 0.15]

plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2004 and 2005")
plt.pie(year_slices, labels=year_labels, explode=explode, autopct='%1.2f%%',␣
,→wedgeprops={'edgecolor':'red'})

plt.show()

11
You can also provide the 3D effect to the pie by adding the shadow=True parameter to the pie()
function.
[11]: # S1.5: Add 3D effect to the pie.
plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2004 and 2005")
plt.pie(year_slices, labels=year_labels, explode=explode, autopct='%1.1f%%',␣
,→shadow=True, wedgeprops={'edgecolor':'red'})

plt.show()

12
Similarly, you can create a pie chart to visualise the proportion of the observations recorded in
various months in the year 2005. For this, we may require to add another column to the df
DataFrame containing the month names for each record.
To get a series containing the month names from a series containing the datetime object, you can
use the series.dt.month_name() function.
[12]: # S1.6: Get the month names from the 'DateTime' column for each record.
df['DateTime'].dt.month_name()

# S1.7: Add the 'Month Name' column to the 'df' DataFrame and print the first␣
,→five rows of the updated DataFrame.

df['Month Name'] = df['DateTime'].dt.month_name()


df.head()

[12]: DateTime PT08.S1(CO) C6H6(GT) … Day Day Name Month Name


510 2004-01-04 00:00:00 1143.0 6.3 … 4 Sunday January
511 2004-01-04 01:00:00 1044.0 5.1 … 4 Sunday January
512 2004-01-04 02:00:00 1034.0 4.1 … 4 Sunday January
513 2004-01-04 03:00:00 956.0 4.0 … 4 Sunday January
514 2004-01-04 04:00:00 909.0 2.4 … 4 Sunday January

[5 rows x 15 columns]

Now create a pie chart for the year 2005 displaying the top 5 months having the most number of
observations. Label the slices with the month names.

13
[13]: # T1.2: Create a pie chart for the 2005 displaying the top 5 months having the␣
,→most number of observations. Label the slices with the month names.

data = df.loc[df['Year'] == 2005, 'Month Name'].value_counts()[:5]


explode = np.linspace(0, 0.5, 5) # Shift the slices away from the centre of the␣
,→pie

plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2005")
plt.pie(data, labels=data.index,
explode=explode, autopct='%1.2f%%',
startangle=30, # The first slice will be placed at an angle of 30␣
,→degrees w.r.t. to the horizontal axis in the anti-clockwise direction.

shadow=True,
wedgeprops={'edgecolor':'r'})

plt.show()

Note: The sum of all the percentages in a pie chart will be and should be 100.
When not to create a pie chart?
A pie chart has its limited use. It should be used to plot the proportion of a few categories. If
there are more category proportions to be visualised then don’t use the pie chart.
[14]: # S1.8: Create a pie chart to visualise the proportions of the observations␣
,→recorded in each month in the year 2004.

plt.figure(dpi=108)

14
plt.title("Percentage of Data Collected in 2004")
plt.pie(df['Month'].value_counts())
plt.show()

As you can see there are too many, i.e., 12 proportions to be plotted in the above pie chart which
is quite hard to interpret even if we label it.

Activity 2: Bell Curve^^^ Nature loves symmetry. How? Consider a small experiment in
which you measure the heights of say 10,000 men (or women) in a city. There would be many
individuals who would be equally tall. The height of most people will be equal to the mean
(or average) height of all 10,000 people. Also, 50% of the population (or 5000 individuals) will
have a height less than or equal to the mean height and the other 50% of the population (or
5000 individuals) will have a height greater than the mean height. So the distribution of heights
will be symmetric around the mean of heights. Such a kind of distribution is called the normal
distribution.
If you create a histogram to plot the heights of 10,000 individuals, the arrangement of the bars in
the histogram will appear to form a bell shape.
Let’s create a histogram to understand this concept better. First, we will have to create a numpy
array containing 10,000 numbers denoting the heights of 10,000 individuals in a city. Let their
mean height be 165cm and the standard deviation in heights be 15cm.
[15]: # T2.1 Create a NumPy array containing 10,000 random normally distributed␣
,→numbers having a mean of 165 cm and a standard deviation.

15
height_mean = 165
height_std = 15
heights = np.random.normal(height_mean, height_std, size=10000)
heights[:10]

[15]: array([166.01153866, 156.97932004, 162.05261284, 191.17481898,


187.5502948 , 147.72496541, 163.59929782, 182.18742029,
184.61383796, 170.10937814])

The np.random.normal() function takes mean, standard deviation and size (number of numbers
to be generated) as inputs and returns a numpy array containing normally distributed random
numbers whose mean and standard deviation is very close to the provided mean and standard
deviation values.
[16]: # T2.2: Calculate the mean and standard deviation of the normally distributed␣
,→heights.

np.mean(heights), np.std(heights)

[16]: (165.09044751335384, 15.066292539602626)

There would be a little error in the calculated mean and standard deviation after creating the array.
[17]: # T2.3: Create a histogram for the heights.
plt.figure(figsize=(20, 5))
plt.title('Histogram for Heights (in cm)')
plt.hist(heights, bins='sturges', edgecolor='black') # 'sturges' is one of the␣
,→ways to compute the number of bins in a histogram.

plt.axvline(np.mean(heights), color='red', label=f'Mean height = {np.


,→mean(heights):.2f} cm', linewidth=2) # Creates a vertical line

plt.xlabel('Height (in cm)')


plt.ylabel("Number of observations")
plt.legend()
plt.show()

The axvline() function creates a vertical line intersecting with the x−axis at x =
np.mean(heights).

16
In the above histogram, you can see that the arrangement of bars appears to make a bell shape.
We can create a bell-shaped curve using the distplot() function.
Note: Here is the list of the number of bins determiners that you could use to get the near-perfect
bell curve depending upon the number of data points.
bin_num_determiners = ('fd', 'doane', 'scott', 'stone', 'rice', 'sturges',
'sqrt')
[18]: # T2.4: Create a bell curve using the 'distplot()' function.
plt.figure(figsize=(15, 5), dpi=96)
plt.title("Bell Curve for Heights (in cm)")
sns.distplot(heights, hist=False, bins='sturges') # 'sturges' is one of the␣
,→ways to compute the number of bins in a histogram.

plt.axvline(np.mean(heights), color='red', label=f'Mean height = {np.


,→mean(heights):.2f} cm', linewidth=2)

plt.ylabel("Probability density") # The 'y-axis' on the bell curve represent␣


,→the probability density.

plt.legend()
plt.grid(which='major', axis='y', color='lightgrey')
plt.show()

Note: At this point, you don’t need to know how the probability density values are computed
because the knowledge of probability density function is required which you will learn in the sub-
sequent classes. Right now, you just need to know how to interpret a bell curve.
The above graph is a bell curve created using the distplot() function in which the bars are
disabled by passing the False value to the hist parameter. The 'sturges' value passed to the
bins parameter ensures that the bell curve created is equivalent to the histogram created earlier
using the hist() function.
The great thing about the normally distributed values (or values in a bell-shaped curve) is that
approximately
• 68% of the values lie between µ − σ and µ + σ. In other words, approx. 68% of the values lie
within one-sigma around the mean.

17
• 95% of the values lie between µ − 2σ and µ + 2σ. In other words, approx. 95% of the values
lie within two-sigma around the mean.
• 99.7% of the values lie between µ − 3σ and µ + 3σ. In other words, approx. 99.7% of the
values lie within three-sigma around the mean.
where µ and σ are mean and standard deviation respectively.
[19]: # S2.1: Create a bell curve with the vertical lines denoting mean value and the␣
,→one-sigma interval.

plt.figure(figsize=(15, 5), dpi=96)


plt.title("Bell Curve for Heights (in cm)")
sns.distplot(heights, hist=False, bins='sturges')
plt.axvline(np.mean(heights), color='red', label=f'Mean height = {np.
,→mean(heights):.2f} cm', linewidth=2)

# One-sigma
plt.axvline(np.mean(heights) - np.std(heights), color='b',
label=f'mu - sigma = {np.mean(heights) - np.std(heights):.2f} cm',␣
,→linewidth=2)

plt.axvline(np.mean(heights) + np.std(heights), color='b',


label=f'mu + sigma = {np.mean(heights) + np.std(heights):.2f} cm',␣
,→linewidth=2)

plt.ylabel("Probability density")
plt.legend()
plt.grid(which='major', axis='y', color='lightgrey')
plt.show()

[20]: # S2.2: Get the percentage of the values lying within one-sigma around the mean.
one_sigma_count = 0
for num in heights:
if (num >= np.mean(heights) - np.std(heights)) and (num <= np.mean(heights) +␣
,→np.std(heights)):

18
one_sigma_count += 1

one_sigma_count * 100 / len(heights)

[20]: 68.22

Another way to interpret the one-sigma interval is that about 68% of the total values in a dataset
lie around its mean value. The remaining 32% of the total values lie away from the mean. This
denotes that mean is indeed the central value.
Another interesting property of a bell curve (or normal distribution) is that the mean, median and
mode values are the same.
[21]: # S2.3: Calculate the mean and the median height values.
np.mean(heights), np.median(heights) # The calculated mean, median and mode␣
,→values will be almost the same.

[21]: (165.09044751335384, 165.11234221075898)

From the histogram, it is very clear that the mean value has the greatest number of counts, so it
is also the modal value or simply the mode.
The parameters mean, median and mode are called the measures of central tendency. In other
words, what is the central value of all the observations in a dataset or if you were to represent all
the observations in a dataset with exactly one central value, then what would be that value.
[22]: # S2.4: Create a bell curve with the vertical lines denoting mean value and the␣
,→two-sigma interval.

plt.figure(figsize=(15, 5), dpi=96)


plt.title("Bell Curve for Heights (in cm)")
sns.distplot(heights, hist=False, bins='sturges')
plt.axvline(np.mean(heights), color='red', label=f'Mean height = {np.
,→mean(heights):.2f} cm', linewidth=2)

# Two-sigma
plt.axvline(np.mean(heights) - 2 * np.std(heights), color='g',
label=f'mu - 2 * sigma = {np.mean(heights) - 2 * np.std(heights):.
,→2f} cm', linewidth=2)

plt.axvline(np.mean(heights) + 2 * np.std(heights), color='g',


label=f'mu + 2 * sigma = {np.mean(heights) + 2 * np.std(heights):.
,→2f} cm', linewidth=2)

plt.ylabel("Number of observations")
plt.legend()
plt.grid(which='major', axis='y', color='lightgrey')
plt.show()

19
[23]: # S2.5: Get the percentage of the values lying within two-sigma around the mean.
two_sigma_count = 0
for num in heights:
if (num >= np.mean(heights) - 2 * np.std(heights)) and (num <= np.
,→mean(heights) + 2 * np.std(heights)):

two_sigma_count += 1

two_sigma_count * 100 / len(heights)

[23]: 95.61

[24]: # S2.6: Create a bell curve with the vertical lines denoting mean value and the␣
,→three-sigma interval.

plt.figure(figsize=(15, 5), dpi=96)


plt.title("Bell Curve for Heights (in cm)")
sns.distplot(heights, hist=False, bins='sturges')
plt.axvline(np.mean(heights), color='red', label=f'Mean height = {np.
,→mean(heights):.2f} cm', linewidth=2)

# Three-sigma
plt.axvline(np.mean(heights) - 3 * np.std(heights), color='m',
label=f'mu - 3 * sigma = {np.mean(heights) - 3 * np.std(heights):.
,→2f} cm', linewidth=2)

plt.axvline(np.mean(heights) + 3 * np.std(heights), color='m',


label=f'mu + 3 * sigma = {np.mean(heights) + 3 * np.std(heights):.
,→2f} cm', linewidth=2)

plt.ylabel("Number of observations")
plt.legend()
plt.grid(which='major', axis='y', color='lightgrey')
plt.show()

20
[25]: # S2.7: Get the percentage of the values lying within three-sigma around the␣
,→mean.

three_sigma_count = 0
for num in heights:
if (num >= np.mean(heights) - 3 * np.std(heights)) and (num <= np.
,→mean(heights) + 3 * np.std(heights)):

three_sigma_count += 1

three_sigma_count * 100 / len(heights)

[25]: 99.76

The point of normal distribution (or bell curve) is if a certain set of values follow normal distribution,
then we can make the best guess with
• 68% confidence, that the value lies between one-sigma
• 95% confidence, that the value lies between two-sigma
• 99% confidence, that the value lies between three-sigma

Activity 3: Applying Normal Distribution Concepts^^ Assuming that the relative humid-
ity values in the df DataFrame follows the normal distribution, then with 68% confidence, you can
say that a relative humidity of 45% will lie between the range 32.24% and 66.20%
[35]: # S3.1: Compute the one-sigma interval for the relative humidity values.
print("Required one-sigma interval ==>", (df['RH'].mean() - df['RH'].std(),␣
,→df['RH'].mean() + df['RH'].std()), "\n")

# S3.2: Create a histogram for relative humidity values and find out whether it␣
,→follows a bell curve or not.

plt.figure(figsize=(20, 5))
plt.title("Histogram for RH (in %)")

21
plt.hist(df['RH'], bins='sturges', edgecolor='black')
plt.axvline(df['RH'].mean(), color='red', label=f"Mean RH = {df['RH'].mean():.
,→2f}", linewidth=2)

# One-sigma
plt.axvline(df['RH'].mean() - df['RH'].std(), color='gold',
label=f"mu - sigma = {df['RH'].mean() - df['RH'].std():.2f} ",␣
,→linewidth=3)

plt.axvline(df['RH'].mean() + df['RH'].std(), color='gold',


label=f"mu + sigma = {df['RH'].mean() + df['RH'].std():.2f} ",␣
,→linewidth=3)

plt.xlabel("RH")
plt.ylabel("Probability density")
plt.legend()
plt.show()

Required one-sigma interval ==> (32.24369369233582, 66.19907642629283)

Also, the larger the standard deviation, the wider the bell curve. Lower the standard deviation,
narrower the bell curve.
[28]: # T3.1: Create 3 arrays having normally distributed random values. They should␣
,→have the same length, same mean but different standard deviations.

mu = 150
array1 = np.random.normal(mu, 10, 10000) # First array having 10,000 values and␣
,→std = 10

array2 = np.random.normal(mu, 30, 10000) # Second array having 10,000 values␣


,→and std = 30

array3 = np.random.normal(mu, 50, 10000) # Third array having 10,000 values and␣
,→std = 50

# T3.2: Create bell curves as well for above three arrays.


plt.figure(figsize=(14, 5), dpi=96)

22
plt.title("Bell Curve")
sns.distplot(array1, hist=False, bins='sturges', label='First array') # Bell␣
,→curve for the first array

sns.distplot(array2, hist=False, bins='sturges', label='Second array') # Bell␣


,→curve for the second array

sns.distplot(array3, hist=False, bins='sturges', label='Third array') # Bell␣


,→curve for the third array

plt.axvline(mu, color='black', label=f'Mean = {mu}', linewidth=2)


plt.ylabel("Probability density")
plt.legend(loc='upper left') # 'loc' parameter sets the location of the legend␣
,→to be displayed on the graph

plt.grid(which='major', axis='y', color='lightgrey')


plt.show()

In the above graph, the first array has the lowest standard deviation. Hence, its bell curve is the
narrowest. The third array has the greatest standard deviation. Hence, its bell curve is the widest.
Evidently, the standard deviation value also affects the height of the bell curve.
Not just heights, a lot of physical quantities such as our weights, blood pressures, marks scored by
students in an exam etc follow the bell curve (or normal distribution).

1.0.4 Additional Activities


The activities starting from this point are optional. Please do these activities ONLY if you have
time to spare in the class. Otherwise, skip to the Wrap-Up section. The additional activities will
not be available in the class copy of the notebook. You will have to manually add these activities
in the class copy by adding new text and code cells.
Moreover, you don’t have to do all the additional activities. Depending on the availability of time
in a class, you can choose the number of additional activities to perform from this collection.

23
Activity 1: Stack Plots (or Area Plots) A stack plot or an area plot is another good way to
visualise the progression of an event. It is similar to a line plot with an addition of a shaded area
below the lines.
Let’s create a stack plot for the daily variation in temperature in the year 2004.
To create a stack plot, you need to use the stackplot() function. It requires two inputs, the values
to be plotted on the x−axis and the values to be plotted on the y−axis.
Syntax: plt.stackplot(x_values, y_values)
[29]: # S1.1 Create a stack plot for the daily variation in temperature in the year␣
,→2004.

plt.figure(figsize=(16, 5), dpi=96)


plt.stackplot(aq_2004_df['DateTime'], aq_2004_df['T'])
plt.xticks(rotation=45)
plt.show()

Now create a stack plot for the monthly median temperature in the year 2004 and 2005.
[30]: # S1.2 Create a stack plot for the monthly median temperature in the year 2004␣
,→and 2005.

months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct',␣
,→'Nov', 'Dec']

plt.figure(figsize=(16, 5), dpi=96)


plt.title("Monthly Median Temperature Variation in 2004 and 2005.")
plt.stackplot(np.arange(1, 13), group_2004_month.median()['T'],␣
,→group_2005_month.median()['T'], labels=['2004', '2005'])

plt.xticks(ticks=np.arange(1, 13), labels=months)


plt.legend()
plt.show()

24
Activities Teacher Activities
1. Pie Chart & Bell Curve (Class Copy)
https://colab.research.google.com/drive/1LnClZ_A2nIx_ONskdCNc5fPgyTPtaMTj
2. Pie Chart & Bell Curve (Reference)
https://colab.research.google.com/drive/1u56Gk11U3GzgPFBDvjcGU21224B_cejo

25

You might also like