0% found this document useful (0 votes)

46 views25 pages

Applied Tech Lesson 45: 1 Lesson 45: Pie Chart & Bell Curve

1. The document describes a lesson on creating pie charts and bell curves to analyze and visualize data distributions. 2. Students will learn to create pie charts to summarize categorical data distributions and bell curves to compute confidence intervals for normal distributions. 3. The lesson will involve running code examples from previous classes and completing activities to create pie charts and bell curves.

Uploaded by

Rakshita Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views25 pages

Applied Tech Lesson 45: 1 Lesson 45: Pie Chart & Bell Curve

Uploaded by

Rakshita Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Applied Tech Lesson 45

October 6, 2020

1 Lesson 45: Pie Chart & Bell Curve

WARNING: The reference notebook is meant ONLY for a teacher. Please DO NOT share it
with any student. The contents of the reference notebook are meant only to prepare a teacher for
a class. To conduct the class, use the class copy of the reference notebook.

Particulars Description
Topic Pie Chart & Bell Curve

Class Description In this class, a student will learn to create a

pie chart and a bell curve

Class C45

Class Time 45 minutes

Goals Create a pie chart to summarise the

distribution of data across categories
Create a bell curve to compute the confidence
interval to allow an analyst to make the best
guess based on a normal distribution

Teacher Resources Google Account

Laptop with internet connectivity
Earphones with mic

Student Resources Google Account

Laptop with internet connectivity
Earphones with mic

1.0.1 Warm-up Quiz

TEACHER
I have an exciting quiz question for you! Are you ready to answer this question?

EXPECTED STUDENT RESPONSE

1
Yes.
Instructions for the Teacher: - Please click on the “Quiz Time” button on the bottom right
corner of your screen to start the In-Class Quiz.

• A quiz will be visible to both you and the student. Encourage the student to answer the quiz
question.
• The student may choose the wrong option, help the student to think correctly about the
question and then answer again.
• After the student selects the correct option, the “End Quiz” button will start appearing on
your screen.

• Click the “End quiz” button to close the quiz pop-up and continue the class.
• Do not spend more than 2 minutes on this quiz.

1.0.2 Teacher-Student Activities

In this class, we will learn to create pie charts and bell curves.
Let’s quickly run the codes covered in the previous classes and begin this session from Activity 1:
Pie Charts

2
1.0.3 Recap
[1]: # Run the code cell.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Loading the dataset.

csv_file = 'https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/
,→whitehat-ds-datasets/air-quality/AirQualityUCI.csv'

df = pd.read_csv(csv_file, sep=';')

# Dropping the 'Unnamed: 15' & 'Unnamed: 16' columns.

df = df.drop(columns=['Unnamed: 15', 'Unnamed: 16'], axis=1)

# Dropping the null values.

df = df.dropna()

# Creating a Pandas series containing 'datetime' objects.

dt_series = df['Date'] + ' ' + pd.Series(data=[str(item).replace(".", ":") for␣
,→item in df['Time']], index=df.index)

dt_series = pd.to_datetime(dt_series)

# Remove the Date & Time columns from the DataFrame and insert the 'dt_series'␣
,→in it.

df = df.drop(columns=['Date', 'Time'], axis=1)

df.insert(loc=0, column='DateTime', value=dt_series)

# Get the Pandas series containing the year values as integers.

year_series = dt_series.dt.year

# Get the Pandas series containing the month values as integers.

month_series = dt_series.dt.month

# Get the Pandas series containing the day values as integers.

day_series = dt_series.dt.day

# Get the Pandas series containing the days of a week, i.e., Monday, Tuesday,␣
,→Wednesday etc.

day_name_series = dt_series.dt.day_name()

# Add the 'Year', 'Month', 'Day' and 'Day Name' columns to the DataFrame.
df['Year'] = year_series
df['Month'] = month_series
df['Day'] = day_series
df['Day Name'] = day_name_series

3
# Sort the DataFrame by the 'DateTime' values in the ascending order. Also,␣
,→display the first 10 rows of the DataFrame.

df = df.sort_values(by='DateTime')

# Create a function to replace the commas with periods in a Pandas series.

def comma_to_period(series):
new_series = pd.Series(data=[float(str(item).replace(',', '.')) for item in␣
,→series], index=df.index)

return new_series

# Apply the 'comma_to_period()' function on the ''CO(GT)', 'C6H6(GT)', 'T',␣

,→'RH' and 'AH' columns.

cols_to_correct = ['CO(GT)', 'C6H6(GT)', 'T', 'RH', 'AH'] # Create a list of␣

,→column names.

for col in cols_to_correct: # Iterate through each column

df[col] = comma_to_period(df[col]) # Replace the original column with the␣
,→new series.

# Remove all the columns from the 'df' DataFrame containing more than 10%␣
,→garbage value.

df = df.drop(columns=['NMHC(GT)', 'CO(GT)', 'NOx(GT)', 'NO2(GT)'], axis=1)

# Create a new DataFrame containing records for the years 2004 and 2005.
aq_2004_df = df[df['Year'] == 2004]
aq_2005_df = df[df['Year'] == 2005]

# Replace the -200 value with the median values for each column having indices␣
,→between 1 and -4 (excluding both) for the 2004 year DataFrame.

for col in aq_2004_df.columns[1:-4]:

median = aq_2004_df[col].median()
aq_2004_df[col] = aq_2004_df[col].replace(to_replace=-200, value=median)

# Repeat the same exercise for the 2005 year DataFrame.

for col in aq_2005_df.columns[1:-4]:
median = aq_2005_df[col].median()
aq_2005_df[col] = aq_2005_df[col].replace(to_replace=-200, value=median)

/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19:
FutureWarning: pandas.util.testing is deprecated. Use the functions in the
public API at pandas.testing instead.
import pandas.util.testing as tm
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:66:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

4
See the caveats in the documentation: https://pandas.pydata.org/pandas-
docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:71:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-

docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

[2]: # Get the columns in the original DataFrame.

df.columns

[2]: Index(['DateTime', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'PT08.S3(NOx)',

'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH', 'AH', 'Year', 'Month', 'Day',
'Day Name'],
dtype='object')

The description for all the columns containing data for air pollutants, temperature, relative humid-
ity and absolute humidity is provided below.

Columns Description
PT08.S1(CO) PT08.S1 (tin oxide) hourly averaged sensor
response (nominally CO targeted)
C6H6(GT) True hourly averaged Benzene concentration
µg
in m 3
PT08.S2(NMHC) PT08.S2 (titania) hourly averaged sensor
response (nominally NMHC targeted)
PT08.S3(NOx) PT08.S3 (tungsten oxide) hourly averaged
sensor response (nominally NOx targeted)
PT08.S4(NO2) PT08.S4 (tungsten oxide) hourly averaged
sensor response (nominally NO2 targeted)
PT08.S5(O3) PT08.S5 (indium oxide) hourly averaged
sensor response (nominally O3 targeted)
T Temperature in Â°C
RH Relative Humidity (%)
AH AH Absolute Humidity

[3]: # Group the DataFrames about the 'Month' column.

group_2004_month = aq_2004_df.groupby(by='Month')
group_2005_month = aq_2005_df.groupby(by='Month')

[4]: # Concatenate the two DataFrames for 2004 and 2005 to obtain one DataFrame.
df = pd.concat([aq_2004_df, aq_2005_df])
df.head()

5
[4]: DateTime PT08.S1(CO) C6H6(GT) … Month Day Day Name
510 2004-01-04 00:00:00 1143.0 6.3 … 1 4 Sunday
511 2004-01-04 01:00:00 1044.0 5.1 … 1 4 Sunday
512 2004-01-04 02:00:00 1034.0 4.1 … 1 4 Sunday
513 2004-01-04 03:00:00 956.0 4.0 … 1 4 Sunday
514 2004-01-04 04:00:00 909.0 2.4 … 1 4 Sunday

[5 rows x 14 columns]

[5]: # Information of the DataFrame.

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9357 entries, 510 to 8813
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 DateTime 9357 non-null datetime64[ns]
1 PT08.S1(CO) 9357 non-null float64
2 C6H6(GT) 9357 non-null float64
3 PT08.S2(NMHC) 9357 non-null float64
4 PT08.S3(NOx) 9357 non-null float64
5 PT08.S4(NO2) 9357 non-null float64
6 PT08.S5(O3) 9357 non-null float64
7 T 9357 non-null float64
8 RH 9357 non-null float64
9 AH 9357 non-null float64
10 Year 9357 non-null int64
11 Month 9357 non-null int64
12 Day 9357 non-null int64
13 Day Name 9357 non-null object
dtypes: datetime64[ns](1), float64(9), int64(3), object(1)
memory usage: 1.1+ MB

Activity 1: Pie Charts^ A pie chart displays the various proportions of data in a dataset
through a circular representation wherein each proportion is represented through a slice. Larger
the slice, larger the proportion.

6
E.g., we can visualise the percentages of observations recorded in 2004 and 2005 using a pie chart.
To create a pie chart first you need to define the slice proportions through a list, a tuple, a series
or an array and pass it as an input to the pie() function.
Syntax: plt.pie(slice_proportions_array)
The slice proportions can either be the total number of values or total percentage of values. Either
way, the pie() function returns a pie chart such that each slice represents the percentage of values.
You can use the dpi attribute to control the quality of the charts/plots created using the
matplotlib and seaborn modules. The term dpi stands for dots per inch.
[6]: # S1.1: Create a pie chart to display the percentage of data collected in 2004␣
,→and 2005.

year_slices = df['Year'].value_counts() * 100 / df.shape[0]

plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2004 and 2005")
plt.pie(year_slices)
plt.show()

7
[7]: # S1.2: Create a pie chart to display the percentage of data collected in 2004␣
,→and 2005 without calculating the percentage values for slices.

year_slices = df['Year'].value_counts()

plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2004 and 2005")
plt.pie(year_slices)
plt.show()

8
Let’s label the slices in the pie chart to identify the slices for 2004 and 2005. For this, you need
to pass a list, a tuple, a series or an array as an input to the labels parameter inside the pie()
function.
Additionally, you can pass the {'edgecolor':'red'} dictionary as an input to the wedgeprops
parameter inside the pie() function to define the colour of the outline of a pie chart.
[8]: # S1.3: Label the slices of a pie chart with their corresponding year values.␣
,→Also, set 'red' as the outline colour of the chart.

year_labels = ['Year 2004', 'Year 2005']

plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2004 and 2005")
plt.pie(year_slices, labels=year_labels, wedgeprops={'edgecolor':'red'})
plt.show()

9
You can add the percentage values in the corresponding slices by passing autopct='%1.1f%%' as
another parameter to the pie() function. If you change the numeral after the dot (or period) to
2, then the pie chart will display the percentage value upto 2 places after the decimal.
[9]: # S1.4 Add percentage values to the corresponding slices.
plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2004 and 2005")
plt.pie(year_slices, labels=year_labels, autopct='%1.1f%%',␣
,→wedgeprops={'edgecolor':'red'})

plt.show()

10
You can separate a slice (or more slices) from a pie by passing another parameter called explode
to the pie() function. The input to the explode parameter should be a list, tuple etc. containing
the amount by which a slice should move away from the centre of a pie.
E.g., let’s move the slice for the 2005 year away from the centre of the pie by a distance of 15% of
the radius of the pie. For this, we will have to create a list, tuple etc. containing the first value
as 0 (corresponding to the distance the slice for 2004 should move) and the second value as 0.15
denoting that the slice for the 2005 year should move away from the centre of the pie by a distance
of 15% of the radius of the pie.
[10]: # T1.1: Separate the slice for the year 2005.
explode = [0, 0.15]

plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2004 and 2005")
plt.pie(year_slices, labels=year_labels, explode=explode, autopct='%1.2f%%',␣
,→wedgeprops={'edgecolor':'red'})

plt.show()

11
You can also provide the 3D effect to the pie by adding the shadow=True parameter to the pie()
function.
[11]: # S1.5: Add 3D effect to the pie.
plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2004 and 2005")
plt.pie(year_slices, labels=year_labels, explode=explode, autopct='%1.1f%%',␣
,→shadow=True, wedgeprops={'edgecolor':'red'})

plt.show()

12
Similarly, you can create a pie chart to visualise the proportion of the observations recorded in
various months in the year 2005. For this, we may require to add another column to the df
DataFrame containing the month names for each record.
To get a series containing the month names from a series containing the datetime object, you can
use the series.dt.month_name() function.
[12]: # S1.6: Get the month names from the 'DateTime' column for each record.
df['DateTime'].dt.month_name()

# S1.7: Add the 'Month Name' column to the 'df' DataFrame and print the first␣
,→five rows of the updated DataFrame.

df['Month Name'] = df['DateTime'].dt.month_name()

df.head()

[12]: DateTime PT08.S1(CO) C6H6(GT) … Day Day Name Month Name

510 2004-01-04 00:00:00 1143.0 6.3 … 4 Sunday January
511 2004-01-04 01:00:00 1044.0 5.1 … 4 Sunday January
512 2004-01-04 02:00:00 1034.0 4.1 … 4 Sunday January
513 2004-01-04 03:00:00 956.0 4.0 … 4 Sunday January
514 2004-01-04 04:00:00 909.0 2.4 … 4 Sunday January

[5 rows x 15 columns]

Now create a pie chart for the year 2005 displaying the top 5 months having the most number of
observations. Label the slices with the month names.

13
[13]: # T1.2: Create a pie chart for the 2005 displaying the top 5 months having the␣
,→most number of observations. Label the slices with the month names.

data = df.loc[df['Year'] == 2005, 'Month Name'].value_counts()[:5]

explode = np.linspace(0, 0.5, 5) # Shift the slices away from the centre of the␣
,→pie

plt.figure(dpi=108)
plt.title("Percentage of Data Collected in 2005")
plt.pie(data, labels=data.index,
explode=explode, autopct='%1.2f%%',
startangle=30, # The first slice will be placed at an angle of 30␣
,→degrees w.r.t. to the horizontal axis in the anti-clockwise direction.

shadow=True,
wedgeprops={'edgecolor':'r'})

plt.show()

Note: The sum of all the percentages in a pie chart will be and should be 100.
When not to create a pie chart?
A pie chart has its limited use. It should be used to plot the proportion of a few categories. If
there are more category proportions to be visualised then don’t use the pie chart.
[14]: # S1.8: Create a pie chart to visualise the proportions of the observations␣
,→recorded in each month in the year 2004.

plt.figure(dpi=108)

14
plt.title("Percentage of Data Collected in 2004")
plt.pie(df['Month'].value_counts())
plt.show()

As you can see there are too many, i.e., 12 proportions to be plotted in the above pie chart which
is quite hard to interpret even if we label it.

Activity 2: Bell Curve^^^ Nature loves symmetry. How? Consider a small experiment in
which you measure the heights of say 10,000 men (or women) in a city. There would be many
individuals who would be equally tall. The height of most people will be equal to the mean
(or average) height of all 10,000 people. Also, 50% of the population (or 5000 individuals) will
have a height less than or equal to the mean height and the other 50% of the population (or
5000 individuals) will have a height greater than the mean height. So the distribution of heights
will be symmetric around the mean of heights. Such a kind of distribution is called the normal
distribution.
If you create a histogram to plot the heights of 10,000 individuals, the arrangement of the bars in
the histogram will appear to form a bell shape.
Let’s create a histogram to understand this concept better. First, we will have to create a numpy
array containing 10,000 numbers denoting the heights of 10,000 individuals in a city. Let their
mean height be 165cm and the standard deviation in heights be 15cm.
[15]: # T2.1 Create a NumPy array containing 10,000 random normally distributed␣
,→numbers having a mean of 165 cm and a standard deviation.

15
height_mean = 165
height_std = 15
heights = np.random.normal(height_mean, height_std, size=10000)
heights[:10]

[15]: array([166.01153866, 156.97932004, 162.05261284, 191.17481898,

187.5502948 , 147.72496541, 163.59929782, 182.18742029,
184.61383796, 170.10937814])

The np.random.normal() function takes mean, standard deviation and size (number of numbers
to be generated) as inputs and returns a numpy array containing normally distributed random
numbers whose mean and standard deviation is very close to the provided mean and standard
deviation values.
[16]: # T2.2: Calculate the mean and standard deviation of the normally distributed␣
,→heights.

np.mean(heights), np.std(heights)

[16]: (165.09044751335384, 15.066292539602626)

There would be a little error in the calculated mean and standard deviation after creating the array.
[17]: # T2.3: Create a histogram for the heights.
plt.figure(figsize=(20, 5))
plt.title('Histogram for Heights (in cm)')
plt.hist(heights, bins='sturges', edgecolor='black') # 'sturges' is one of the␣
,→ways to compute the number of bins in a histogram.

plt.axvline(np.mean(heights), color='red', label=f'Mean height = {np.

,→mean(heights):.2f} cm', linewidth=2) # Creates a vertical line

plt.xlabel('Height (in cm)')

plt.ylabel("Number of observations")
plt.legend()
plt.show()

The axvline() function creates a vertical line intersecting with the x−axis at x =
np.mean(heights).

16
In the above histogram, you can see that the arrangement of bars appears to make a bell shape.
We can create a bell-shaped curve using the distplot() function.
Note: Here is the list of the number of bins determiners that you could use to get the near-perfect
bell curve depending upon the number of data points.
bin_num_determiners = ('fd', 'doane', 'scott', 'stone', 'rice', 'sturges',
'sqrt')
[18]: # T2.4: Create a bell curve using the 'distplot()' function.
plt.figure(figsize=(15, 5), dpi=96)
plt.title("Bell Curve for Heights (in cm)")
sns.distplot(heights, hist=False, bins='sturges') # 'sturges' is one of the␣
,→ways to compute the number of bins in a histogram.

plt.axvline(np.mean(heights), color='red', label=f'Mean height = {np.

,→mean(heights):.2f} cm', linewidth=2)

plt.ylabel("Probability density") # The 'y-axis' on the bell curve represent␣

,→the probability density.

plt.legend()
plt.grid(which='major', axis='y', color='lightgrey')
plt.show()

Note: At this point, you don’t need to know how the probability density values are computed
because the knowledge of probability density function is required which you will learn in the sub-
sequent classes. Right now, you just need to know how to interpret a bell curve.
The above graph is a bell curve created using the distplot() function in which the bars are
disabled by passing the False value to the hist parameter. The 'sturges' value passed to the
bins parameter ensures that the bell curve created is equivalent to the histogram created earlier
using the hist() function.
The great thing about the normally distributed values (or values in a bell-shaped curve) is that
approximately
• 68% of the values lie between µ − σ and µ + σ. In other words, approx. 68% of the values lie
within one-sigma around the mean.

17
• 95% of the values lie between µ − 2σ and µ + 2σ. In other words, approx. 95% of the values
lie within two-sigma around the mean.
• 99.7% of the values lie between µ − 3σ and µ + 3σ. In other words, approx. 99.7% of the
values lie within three-sigma around the mean.
where µ and σ are mean and standard deviation respectively.
[19]: # S2.1: Create a bell curve with the vertical lines denoting mean value and the␣
,→one-sigma interval.

plt.figure(figsize=(15, 5), dpi=96)

plt.title("Bell Curve for Heights (in cm)")
sns.distplot(heights, hist=False, bins='sturges')
plt.axvline(np.mean(heights), color='red', label=f'Mean height = {np.
,→mean(heights):.2f} cm', linewidth=2)

# One-sigma
plt.axvline(np.mean(heights) - np.std(heights), color='b',
label=f'mu - sigma = {np.mean(heights) - np.std(heights):.2f} cm',␣
,→linewidth=2)

plt.axvline(np.mean(heights) + np.std(heights), color='b',

label=f'mu + sigma = {np.mean(heights) + np.std(heights):.2f} cm',␣
,→linewidth=2)

plt.ylabel("Probability density")
plt.legend()
plt.grid(which='major', axis='y', color='lightgrey')
plt.show()

[20]: # S2.2: Get the percentage of the values lying within one-sigma around the mean.
one_sigma_count = 0
for num in heights:
if (num >= np.mean(heights) - np.std(heights)) and (num <= np.mean(heights) +␣
,→np.std(heights)):

18
one_sigma_count += 1

one_sigma_count * 100 / len(heights)

[20]: 68.22

Another way to interpret the one-sigma interval is that about 68% of the total values in a dataset
lie around its mean value. The remaining 32% of the total values lie away from the mean. This
denotes that mean is indeed the central value.
Another interesting property of a bell curve (or normal distribution) is that the mean, median and
mode values are the same.
[21]: # S2.3: Calculate the mean and the median height values.
np.mean(heights), np.median(heights) # The calculated mean, median and mode␣
,→values will be almost the same.

[21]: (165.09044751335384, 165.11234221075898)

From the histogram, it is very clear that the mean value has the greatest number of counts, so it
is also the modal value or simply the mode.
The parameters mean, median and mode are called the measures of central tendency. In other
words, what is the central value of all the observations in a dataset or if you were to represent all
the observations in a dataset with exactly one central value, then what would be that value.
[22]: # S2.4: Create a bell curve with the vertical lines denoting mean value and the␣
,→two-sigma interval.

plt.figure(figsize=(15, 5), dpi=96)

# Two-sigma
plt.axvline(np.mean(heights) - 2 * np.std(heights), color='g',
label=f'mu - 2 * sigma = {np.mean(heights) - 2 * np.std(heights):.
,→2f} cm', linewidth=2)

plt.axvline(np.mean(heights) + 2 * np.std(heights), color='g',

label=f'mu + 2 * sigma = {np.mean(heights) + 2 * np.std(heights):.
,→2f} cm', linewidth=2)

plt.ylabel("Number of observations")
plt.legend()
plt.grid(which='major', axis='y', color='lightgrey')
plt.show()

19
[23]: # S2.5: Get the percentage of the values lying within two-sigma around the mean.
two_sigma_count = 0
for num in heights:
if (num >= np.mean(heights) - 2 * np.std(heights)) and (num <= np.
,→mean(heights) + 2 * np.std(heights)):

two_sigma_count += 1

two_sigma_count * 100 / len(heights)

[23]: 95.61

[24]: # S2.6: Create a bell curve with the vertical lines denoting mean value and the␣
,→three-sigma interval.

plt.figure(figsize=(15, 5), dpi=96)

# Three-sigma
plt.axvline(np.mean(heights) - 3 * np.std(heights), color='m',
label=f'mu - 3 * sigma = {np.mean(heights) - 3 * np.std(heights):.
,→2f} cm', linewidth=2)

plt.axvline(np.mean(heights) + 3 * np.std(heights), color='m',

label=f'mu + 3 * sigma = {np.mean(heights) + 3 * np.std(heights):.
,→2f} cm', linewidth=2)

plt.ylabel("Number of observations")
plt.legend()
plt.grid(which='major', axis='y', color='lightgrey')
plt.show()

20
[25]: # S2.7: Get the percentage of the values lying within three-sigma around the␣
,→mean.

three_sigma_count = 0
for num in heights:
if (num >= np.mean(heights) - 3 * np.std(heights)) and (num <= np.
,→mean(heights) + 3 * np.std(heights)):

three_sigma_count += 1

three_sigma_count * 100 / len(heights)

[25]: 99.76

The point of normal distribution (or bell curve) is if a certain set of values follow normal distribution,
then we can make the best guess with
• 68% confidence, that the value lies between one-sigma
• 95% confidence, that the value lies between two-sigma
• 99% confidence, that the value lies between three-sigma

Activity 3: Applying Normal Distribution Concepts^^ Assuming that the relative humid-
ity values in the df DataFrame follows the normal distribution, then with 68% confidence, you can
say that a relative humidity of 45% will lie between the range 32.24% and 66.20%
[35]: # S3.1: Compute the one-sigma interval for the relative humidity values.
print("Required one-sigma interval ==>", (df['RH'].mean() - df['RH'].std(),␣
,→df['RH'].mean() + df['RH'].std()), "\n")

# S3.2: Create a histogram for relative humidity values and find out whether it␣
,→follows a bell curve or not.

plt.figure(figsize=(20, 5))
plt.title("Histogram for RH (in %)")

21
plt.hist(df['RH'], bins='sturges', edgecolor='black')
plt.axvline(df['RH'].mean(), color='red', label=f"Mean RH = {df['RH'].mean():.
,→2f}", linewidth=2)

# One-sigma
plt.axvline(df['RH'].mean() - df['RH'].std(), color='gold',
label=f"mu - sigma = {df['RH'].mean() - df['RH'].std():.2f} ",␣
,→linewidth=3)

plt.axvline(df['RH'].mean() + df['RH'].std(), color='gold',

label=f"mu + sigma = {df['RH'].mean() + df['RH'].std():.2f} ",␣
,→linewidth=3)

plt.xlabel("RH")
plt.ylabel("Probability density")
plt.legend()
plt.show()

Required one-sigma interval ==> (32.24369369233582, 66.19907642629283)

Also, the larger the standard deviation, the wider the bell curve. Lower the standard deviation,
narrower the bell curve.
[28]: # T3.1: Create 3 arrays having normally distributed random values. They should␣
,→have the same length, same mean but different standard deviations.

mu = 150
array1 = np.random.normal(mu, 10, 10000) # First array having 10,000 values and␣
,→std = 10

array2 = np.random.normal(mu, 30, 10000) # Second array having 10,000 values␣

,→and std = 30

array3 = np.random.normal(mu, 50, 10000) # Third array having 10,000 values and␣
,→std = 50

# T3.2: Create bell curves as well for above three arrays.

plt.figure(figsize=(14, 5), dpi=96)

22
plt.title("Bell Curve")
sns.distplot(array1, hist=False, bins='sturges', label='First array') # Bell␣
,→curve for the first array

sns.distplot(array2, hist=False, bins='sturges', label='Second array') # Bell␣

,→curve for the second array

sns.distplot(array3, hist=False, bins='sturges', label='Third array') # Bell␣

,→curve for the third array

plt.axvline(mu, color='black', label=f'Mean = {mu}', linewidth=2)

plt.ylabel("Probability density")
plt.legend(loc='upper left') # 'loc' parameter sets the location of the legend␣
,→to be displayed on the graph

plt.grid(which='major', axis='y', color='lightgrey')

plt.show()

In the above graph, the first array has the lowest standard deviation. Hence, its bell curve is the
narrowest. The third array has the greatest standard deviation. Hence, its bell curve is the widest.
Evidently, the standard deviation value also affects the height of the bell curve.
Not just heights, a lot of physical quantities such as our weights, blood pressures, marks scored by
students in an exam etc follow the bell curve (or normal distribution).

1.0.4 Additional Activities

The activities starting from this point are optional. Please do these activities ONLY if you have
time to spare in the class. Otherwise, skip to the Wrap-Up section. The additional activities will
not be available in the class copy of the notebook. You will have to manually add these activities
in the class copy by adding new text and code cells.
Moreover, you don’t have to do all the additional activities. Depending on the availability of time
in a class, you can choose the number of additional activities to perform from this collection.

23
Activity 1: Stack Plots (or Area Plots) A stack plot or an area plot is another good way to
visualise the progression of an event. It is similar to a line plot with an addition of a shaded area
below the lines.
Let’s create a stack plot for the daily variation in temperature in the year 2004.
To create a stack plot, you need to use the stackplot() function. It requires two inputs, the values
to be plotted on the x−axis and the values to be plotted on the y−axis.
Syntax: plt.stackplot(x_values, y_values)
[29]: # S1.1 Create a stack plot for the daily variation in temperature in the year␣
,→2004.

plt.figure(figsize=(16, 5), dpi=96)

plt.stackplot(aq_2004_df['DateTime'], aq_2004_df['T'])
plt.xticks(rotation=45)
plt.show()

Now create a stack plot for the monthly median temperature in the year 2004 and 2005.
[30]: # S1.2 Create a stack plot for the monthly median temperature in the year 2004␣
,→and 2005.

months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct',␣
,→'Nov', 'Dec']

plt.figure(figsize=(16, 5), dpi=96)

plt.title("Monthly Median Temperature Variation in 2004 and 2005.")
plt.stackplot(np.arange(1, 13), group_2004_month.median()['T'],␣
,→group_2005_month.median()['T'], labels=['2004', '2005'])

plt.xticks(ticks=np.arange(1, 13), labels=months)

plt.legend()
plt.show()

24
Activities Teacher Activities
1. Pie Chart & Bell Curve (Class Copy)
https://colab.research.google.com/drive/1LnClZ_A2nIx_ONskdCNc5fPgyTPtaMTj
2. Pie Chart & Bell Curve (Reference)
https://colab.research.google.com/drive/1u56Gk11U3GzgPFBDvjcGU21224B_cejo

Industrial/Organizational Psychology An Applied Approach - 8th Edition Unlimited Download
100% (9)
Industrial/Organizational Psychology An Applied Approach - 8th Edition Unlimited Download
15 pages
Dev Lab Record
No ratings yet
Dev Lab Record
21 pages
Seminar Front, Certificate, Acknowled
No ratings yet
Seminar Front, Certificate, Acknowled
3 pages
12th Practical
No ratings yet
12th Practical
21 pages
Chemical Calligraphy
No ratings yet
Chemical Calligraphy
11 pages
03 Numpy and Pandas
No ratings yet
03 Numpy and Pandas
68 pages
1 2 Merged
No ratings yet
1 2 Merged
12 pages
Dev Lab Manual Org
No ratings yet
Dev Lab Manual Org
28 pages
Informatics Practical File (Upl Fil)
No ratings yet
Informatics Practical File (Upl Fil)
27 pages
PLC Symposium
No ratings yet
PLC Symposium
16 pages
FDS Notes Unit-4
No ratings yet
FDS Notes Unit-4
30 pages
Asfasdas
No ratings yet
Asfasdas
36 pages
Pandas+With+Python+ +DATAhill+Solutions
No ratings yet
Pandas+With+Python+ +DATAhill+Solutions
24 pages
Pandas Notes
No ratings yet
Pandas Notes
5 pages
Cheat Sheet
No ratings yet
Cheat Sheet
10 pages
Chirayu (1) Merged Merged
No ratings yet
Chirayu (1) Merged Merged
76 pages
Macbeth - Word - EDITABLE SCRIPT
No ratings yet
Macbeth - Word - EDITABLE SCRIPT
12 pages
DS (Pandas)
No ratings yet
DS (Pandas)
17 pages
Pandas Module (Part-I)
No ratings yet
Pandas Module (Part-I)
36 pages
Nyaani Mansa Mamudu Et La Fin de L 'Empire Du Mali
No ratings yet
Nyaani Mansa Mamudu Et La Fin de L 'Empire Du Mali
43 pages
Research: Clinical and Epidemiological Profile of Herpes Zoster A Cross-Sectional Study From Tertiary Hospital
No ratings yet
Research: Clinical and Epidemiological Profile of Herpes Zoster A Cross-Sectional Study From Tertiary Hospital
6 pages
Hands On Data Cleaning With Pandas and NumPy
No ratings yet
Hands On Data Cleaning With Pandas and NumPy
20 pages
WS#3 Python Data Science Toolbox - Nitro
No ratings yet
WS#3 Python Data Science Toolbox - Nitro
6 pages
ML Expt 1 Description
No ratings yet
ML Expt 1 Description
15 pages
Lab Exercise 2-CS0017
No ratings yet
Lab Exercise 2-CS0017
17 pages
Data Science Notes Unit-1 Part - 2
No ratings yet
Data Science Notes Unit-1 Part - 2
22 pages
Data Preprocessing Python Tome III
No ratings yet
Data Preprocessing Python Tome III
12 pages
Exp7 11 Data Science
No ratings yet
Exp7 11 Data Science
23 pages
Test Series: March, 2022 Mock Test Paper 1 Final Course: Group - Ii Paper - 7: Direct Tax Laws and International Taxaxtion
No ratings yet
Test Series: March, 2022 Mock Test Paper 1 Final Course: Group - Ii Paper - 7: Direct Tax Laws and International Taxaxtion
10 pages
Pandas
No ratings yet
Pandas
44 pages
So I'm A Spider, So What, Vol. 12
100% (4)
So I'm A Spider, So What, Vol. 12
347 pages
Sources of Funds
No ratings yet
Sources of Funds
26 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
42 pages
Set-D CT2 Answerkey
No ratings yet
Set-D CT2 Answerkey
11 pages
2,3. Introduction Pandas & Matplotlib
No ratings yet
2,3. Introduction Pandas & Matplotlib
32 pages
Storytelling With Digital Photographs: Marko Balabanović Lonny L. Chu Gregory J. Wolff
No ratings yet
Storytelling With Digital Photographs: Marko Balabanović Lonny L. Chu Gregory J. Wolff
8 pages
Understanding Consumers Willingness To Use Ride 2019 Transportation Researc
No ratings yet
Understanding Consumers Willingness To Use Ride 2019 Transportation Researc
16 pages
Even Students
No ratings yet
Even Students
36 pages
Pandas
No ratings yet
Pandas
21 pages
Python Data Cleaning
100% (1)
Python Data Cleaning
20 pages
Pandas
No ratings yet
Pandas
21 pages
Pandas DataFrame Notes
100% (1)
Pandas DataFrame Notes
10 pages
Pandas Notes
No ratings yet
Pandas Notes
10 pages
Theoretical/methodological Approach: Representation
No ratings yet
Theoretical/methodological Approach: Representation
3 pages
Assignment 2
No ratings yet
Assignment 2
6 pages
Ware House Executor Mini Project
No ratings yet
Ware House Executor Mini Project
5 pages
Dav End Sem
No ratings yet
Dav End Sem
2 pages
Dev Record Final
No ratings yet
Dev Record Final
34 pages
Codes
No ratings yet
Codes
44 pages
Lab Record Dev
No ratings yet
Lab Record Dev
20 pages
Pandas Merged
No ratings yet
Pandas Merged
2 pages
Edid 6503 Assignment 3 D
No ratings yet
Edid 6503 Assignment 3 D
11 pages
Exercise 7 - Pandas
No ratings yet
Exercise 7 - Pandas
2 pages
Identifying Primary and Secondary Sources: Answers
No ratings yet
Identifying Primary and Secondary Sources: Answers
8 pages
2011 Math Challenge
No ratings yet
2011 Math Challenge
2 pages
Persuasive Writing: Self Learning Activity Grade 10-English Learning Competencies
No ratings yet
Persuasive Writing: Self Learning Activity Grade 10-English Learning Competencies
4 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
10 pages
Final Dev Record
No ratings yet
Final Dev Record
49 pages
Pandas
No ratings yet
Pandas
25 pages
Causes of Death
No ratings yet
Causes of Death
1 page
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
Data Wrangling With Python and Pandas
No ratings yet
Data Wrangling With Python and Pandas
7 pages
Data Sci
No ratings yet
Data Sci
10 pages
A Linguistic Analysis of Some Problems of Arabic-English Translation of Legal Texts - Facebook Com LibraryofHIL PDF
100% (2)
A Linguistic Analysis of Some Problems of Arabic-English Translation of Legal Texts - Facebook Com LibraryofHIL PDF
121 pages
Pandas Complete + Visualisation Summary of IBM Visualization
No ratings yet
Pandas Complete + Visualisation Summary of IBM Visualization
21 pages
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
No ratings yet
3rd Semester DDM AI DAA DEV Print Pages For Spiral Record 25-1-24 - Removed
28 pages
Samuel Huntington
100% (1)
Samuel Huntington
4 pages
12 Ip Practical List With Solution Complete
No ratings yet
12 Ip Practical List With Solution Complete
5 pages
An Introduction To Orthodontics
No ratings yet
An Introduction To Orthodontics
38 pages
Pandas Cheat Sheet Final
No ratings yet
Pandas Cheat Sheet Final
1 page
Summary: Introduction To Data Visualization Tools
No ratings yet
Summary: Introduction To Data Visualization Tools
13 pages
Pl4Lt3 - Openplant Powerpid - Tips and Tricks Lecture Companion Guide
No ratings yet
Pl4Lt3 - Openplant Powerpid - Tips and Tricks Lecture Companion Guide
43 pages
Pages From 0580 - Practice - Questions - (For - Examination - From - 2020)
No ratings yet
Pages From 0580 - Practice - Questions - (For - Examination - From - 2020)
26 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Practical File Questions With Answers
No ratings yet
Practical File Questions With Answers
7 pages
Pandas DataFrameObject
No ratings yet
Pandas DataFrameObject
4 pages
Pandas
No ratings yet
Pandas
41 pages
Commands SQL, Python (BASICS)
No ratings yet
Commands SQL, Python (BASICS)
7 pages
Pandas Notes
No ratings yet
Pandas Notes
27 pages
Customer Service Survey
No ratings yet
Customer Service Survey
3 pages
Pandas: Import
100% (1)
Pandas: Import
13 pages
7.3 Instructional Support Materials To Promote Literacy
No ratings yet
7.3 Instructional Support Materials To Promote Literacy
6 pages
Guidelines For Writing A Summary With In-Text Citations
No ratings yet
Guidelines For Writing A Summary With In-Text Citations
3 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
65 Enrile Vs Salazar
No ratings yet
65 Enrile Vs Salazar
2 pages
Dr. Anwar Al Assaf
No ratings yet
Dr. Anwar Al Assaf
24 pages
24-PS - AAM - Quick Reference Guide AS2 V12
No ratings yet
24-PS - AAM - Quick Reference Guide AS2 V12
10 pages
IGNOU PGDCA MCS 208 Data Structure and Algorithm Previous Years Unsolved Papers
From Everand
IGNOU PGDCA MCS 208 Data Structure and Algorithm Previous Years Unsolved Papers
Manish Soni
No ratings yet
Profound Python Libraries
From Everand
Profound Python Libraries
Onder Teker
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet

Applied Tech Lesson 45: 1 Lesson 45: Pie Chart & Bell Curve

Uploaded by

Applied Tech Lesson 45: 1 Lesson 45: Pie Chart & Bell Curve

Uploaded by

Applied Tech Lesson 45

1 Lesson 45: Pie Chart & Bell Curve

Class Description In this class, a student will learn to create a

Class Time 45 minutes

Goals Create a pie chart to summarise the

Teacher Resources Google Account

Student Resources Google Account

1.0.1 Warm-up Quiz

EXPECTED STUDENT RESPONSE

1.0.2 Teacher-Student Activities

# Loading the dataset.

# Dropping the 'Unnamed: 15' & 'Unnamed: 16' columns.

# Dropping the null values.

# Creating a Pandas series containing 'datetime' objects.

df = df.drop(columns=['Date', 'Time'], axis=1)

# Get the Pandas series containing the year values as integers.

# Get the Pandas series containing the month values as integers.

# Get the Pandas series containing the day values as integers.

# Create a function to replace the commas with periods in a Pandas series.

# Apply the 'comma_to_period()' function on the ''CO(GT)', 'C6H6(GT)', 'T',␣

cols_to_correct = ['CO(GT)', 'C6H6(GT)', 'T', 'RH', 'AH'] # Create a list of␣

for col in cols_to_correct: # Iterate through each column

df = df.drop(columns=['NMHC(GT)', 'CO(GT)', 'NOx(GT)', 'NO2(GT)'], axis=1)

for col in aq_2004_df.columns[1:-4]:

# Repeat the same exercise for the 2005 year DataFrame.

See the caveats in the documentation: https://pandas.pydata.org/pandas-

[2]: # Get the columns in the original DataFrame.

[2]: Index(['DateTime', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'PT08.S3(NOx)',

[3]: # Group the DataFrames about the 'Month' column.

[5]: # Information of the DataFrame.

year_slices = df['Year'].value_counts() * 100 / df.shape[0]

year_labels = ['Year 2004', 'Year 2005']

df['Month Name'] = df['DateTime'].dt.month_name()

[12]: DateTime PT08.S1(CO) C6H6(GT) … Day Day Name Month Name

data = df.loc[df['Year'] == 2005, 'Month Name'].value_counts()[:5]

[15]: array([166.01153866, 156.97932004, 162.05261284, 191.17481898,

[16]: (165.09044751335384, 15.066292539602626)

plt.axvline(np.mean(heights), color='red', label=f'Mean height = {np.

plt.xlabel('Height (in cm)')

plt.axvline(np.mean(heights), color='red', label=f'Mean height = {np.

plt.ylabel("Probability density") # The 'y-axis' on the bell curve represent␣

plt.figure(figsize=(15, 5), dpi=96)

plt.axvline(np.mean(heights) + np.std(heights), color='b',

one_sigma_count * 100 / len(heights)

[21]: (165.09044751335384, 165.11234221075898)

plt.figure(figsize=(15, 5), dpi=96)

plt.axvline(np.mean(heights) + 2 * np.std(heights), color='g',

two_sigma_count * 100 / len(heights)

plt.figure(figsize=(15, 5), dpi=96)

plt.axvline(np.mean(heights) + 3 * np.std(heights), color='m',

three_sigma_count * 100 / len(heights)

plt.axvline(df['RH'].mean() + df['RH'].std(), color='gold',

Required one-sigma interval ==> (32.24369369233582, 66.19907642629283)

array2 = np.random.normal(mu, 30, 10000) # Second array having 10,000 values␣

# T3.2: Create bell curves as well for above three arrays.

sns.distplot(array2, hist=False, bins='sturges', label='Second array') # Bell␣

sns.distplot(array3, hist=False, bins='sturges', label='Third array') # Bell␣

plt.axvline(mu, color='black', label=f'Mean = {mu}', linewidth=2)

plt.grid(which='major', axis='y', color='lightgrey')

1.0.4 Additional Activities

plt.figure(figsize=(16, 5), dpi=96)

plt.figure(figsize=(16, 5), dpi=96)

plt.xticks(ticks=np.arange(1, 13), labels=months)

You might also like