0% found this document useful (0 votes)
15 views39 pages

Lec 7 Data Visualization Basic Statistics Updated 21102024 122008pm

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views39 pages

Lec 7 Data Visualization Basic Statistics Updated 21102024 122008pm

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 39

Statistics for data

science
Example 1
• Create a new feature called "Family Size" by combining the
SibSp(Siblings/Spouses) and Parch(parent/child) columns. How does
family size affect the survival rate? Use a bar plot to visualize the
survival rates based on different family sizes.
import pandas as pd
import matplotlib.pyplot as plt
# Step 1: Create the "Family Size" feature
df['Family Size'] = df['SibSp'] + df['Parch'] + 1
# Display the result
print(family_size_counts)
# Step 2: Calculate the survival rate for each family size
family_survival_rate = df.groupby('Family Size')['Survived'].count()
print (family_survival_rate)
# Step 3: Plotting the survival rates for different family sizes
plt.figure(figsize=(10, 6))
plt.bar(family_survival_rate.index, family_survival_rate.values,
color='skyblue')
plt.title('Survival Rate Based on Family Size')
plt.xlabel('Family Size')
plt.ylabel('Survival Rate')
plt.xticks(family_survival_rate.index) # To show each family size as
a tick
plt.show()
import pandas as pd
import matplotlib.pyplot as plt

# Sample DataFrame creation (Replace this with your actual data loading)
# df = pd.read_csv('your_dataset.csv')
Calculates the mean (average) of the
# Step 1: Create the "Family Size" feature
"survived" column for each group. Since
df['Family Size'] = df['SibSp'] + df['Parch'] + 1
the values in "survived" are either 0 or 1,
the mean effectively represents the
# Step 2: Calculate the survival rate for each family size
proportion of survivors for each family
family_survival_rate = df.groupby('Family Size')['Survived'].mean()
size.
# Step 3: Plotting the survival rates for different family sizes
plt.figure(figsize=(10, 6))
plt.bar(family_survival_rate.index, family_survival_rate.values, color='skyblue')
plt.title('Survival Rate Based on Family Size')
plt.xlabel('Family Size')
plt.ylabel('Survival Rate')
plt.xticks(family_survival_rate.index) # To show each family size as a tick
plt.show()
counts the number of non-null entries in the
family_survival_rate = df.groupby('Family "Survived" column for each group (i.e., each
Size')['Survived'].count() family size).

Calculates the mean (average) of the


family_survival_rate = df.groupby('Family Size') "survived" column for each group. Since
['Survived'].mean() the values in "survived" are either 0 or 1,
the mean effectively represents the
proportion of survivors for each family
size.
Example 2
Fare denotes the fare paid by a passenger. As the values in this column
are continuous, they need to be put in separate bins(Divide Fare into 4
bins ) to get a clear idea. draw relation survival and fare rate.
sns.histplot(data=df, x='Fare Bin', sns.histplot(data=df, x='Fare Bin',
hue='Survived', multiple='stack', hue='Survived', stat='count',
stat='count', palette='pastel’) palette='pastel')
• “An Outlier is that observation which is significantly different from
all other observations.”

There are several ways to treat outliers in a dataset,


depending on the nature of the outliers and the problem
being solved.
Trimming
It excludes the outlier values from our analysis. By applying this technique, our data becomes thin
when more outliers are present in the dataset. Its main advantage is its fastest nature.
• For Normal Distributions
Use empirical relations of Normal
distribution.
The data points that fall below mean-
3*(sigma) or above mean+3*(sigma) are
outliers, where mean and sigma are
the average value and standard
deviation of a particular column.
For Skewed Distributions
 Use Inter-Quartile Range (IQR) proximity rule.
 The data points that fall below Q1 – 1.5 IQR or above the third quartile Q3 + 1.5 IQR are outliers,
where Q1 and Q3 are the 25th and 75th percentile of the dataset, respectively. IQR represents
the inter-quartile range and is given by Q3 – Q1.
Z-SCORE Method
• Finding the boundary values
• print(“Highest allowed”,df[‘cgpa’].mean() + 3*df[‘cgpa’].std())
print(“Lowest allowed”,df[‘cgpa’].mean() – 3*df[‘cgpa’].std())

Output:

Highest allowed 8.808933625397177


Lowest allowed 5.113546374602842
• Step 5: Finding the outliers
• df[(df[‘cgpa’] > 8.80) | (df[‘cgpa’] < 5.11)]
• Capping on outliers
• upper_limit = df[‘cgpa’].mean() + 3*df[‘cgpa’].std()
lower_limit = df[‘cgpa’].mean() – 3*df[‘cgpa’].std()
• Now, apply the capping
• df['cgpa'] = df['cgpa'].clip(lower=lower_limit, upper=upper_limit)
df[‘cgpa’].describe()
Percentile Method
• Step-1: Import necessary dependencies
• import numpy as np
• import pandas as pd
• Step-2: Read and Load the dataset
• df = pd.read_csv('weight-height.csv')
• df.sample(5)
Percentile Method
• : Plot the box-plot of the “height” feature
• sns.boxplot(df['Height'])
((data_ro['Height'] < (Q1 - 1.5 * IQR)) |
(data_ro['Height'] > (Q3 + 1.5 * IQR)))
creates a boolean Series where each entry is
True if the corresponding Height value is an
• # Copy of the data outlier, and False otherwise.
• data_ro = df.copy()
• # Calculate Q1, Q3, and IQR for the 'Height' column
• Q1 = data_ro['Height'].quantile(0.25)
• Q3 = data_ro['Height'].quantile(0.75)
• IQR = Q3 - Q1
• # Remove outliers based on the IQR
• data_ro = data_ro[~((data_ro['Height'] < (Q1 - 1.5 * IQR)) | (data_ro['Height'] > (Q3 + 1.5 *
IQR)))]
• # Print the shapes of the original and modified DataFrames
• print("Old Shape: ", df.shape)
• print("New Shape: ", data_ro.shape)
I. Measuring the Central Tendency

November 2, 2024 22
Mean

November 2, 2024 23
Mean

November 2, 2024 24
Mean

November 2, 2024 25
Median

November 2, 2024 26
Medain

November 2, 2024 27
Median

November 2, 2024 28
Example

November 2, 2024 30
Mode

November 2, 2024 31
Example: mode of Grouped Data

November 2, 2024 32
Midrange

November 2, 2024 33
Example

November 2, 2024 34
Symmetric Data

November 2, 2024 35
Karl Pearson’s Co-efficient of Skewness
The formula for measuring Skewness using Karl Pearson’s
Co-efficient is in the below image
Example 1: Find the skewness for the given Data ( 2,4,6,6)
Solution:
Mean of Data = (2 + 4 + 6 + 6) / 4
= 18 / 4
= 4.5
Median of Data = [4+6]/2
= 10/2=5
S.D. = √[(4.5-2 )2 + (4.5-4)2 + (4.5-6)2 + (4.5-6)2/4]
= √[(6.25+0.25+2.25+2.25)/4]
= √1.658
= 1.1.658
Skewness = 3(Mean – Median)/S.D.
By Applying Skewness Formula,
Skewness = 3(4.5 – 5)/1.658
= 3(-0.5)/ 1.658
Skewness = – 0.904 So, the skewness of these data is negative.
Solve Example

• A boy collects some rupees in a week as follows


(25,28,26,30,40,50,40) and finds the skewness of the given Data in
question with the help of the skewness formula.

You might also like