Lec 7 Data Visualization Basic Statistics Updated 21102024 122008pm

Uploaded by

hasansiddiqui17098

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views39 pages

Lec 7 Data Visualization Basic Statistics Updated 21102024 122008pm

Uploaded by

hasansiddiqui17098

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 39

Statistics for data

science
Example 1
• Create a new feature called "Family Size" by combining the
SibSp(Siblings/Spouses) and Parch(parent/child) columns. How does
family size affect the survival rate? Use a bar plot to visualize the
survival rates based on different family sizes.
import pandas as pd
import matplotlib.pyplot as plt
# Step 1: Create the "Family Size" feature
df['Family Size'] = df['SibSp'] + df['Parch'] + 1
# Display the result
print(family_size_counts)
# Step 2: Calculate the survival rate for each family size
family_survival_rate = df.groupby('Family Size')['Survived'].count()
print (family_survival_rate)
# Step 3: Plotting the survival rates for different family sizes
plt.figure(figsize=(10, 6))
plt.bar(family_survival_rate.index, family_survival_rate.values,
color='skyblue')
plt.title('Survival Rate Based on Family Size')
plt.xlabel('Family Size')
plt.ylabel('Survival Rate')
plt.xticks(family_survival_rate.index) # To show each family size as
a tick
plt.show()
import pandas as pd
import matplotlib.pyplot as plt

# Sample DataFrame creation (Replace this with your actual data loading)
# df = pd.read_csv('your_dataset.csv')
Calculates the mean (average) of the
# Step 1: Create the "Family Size" feature
"survived" column for each group. Since
df['Family Size'] = df['SibSp'] + df['Parch'] + 1
the values in "survived" are either 0 or 1,
the mean effectively represents the
# Step 2: Calculate the survival rate for each family size
proportion of survivors for each family
family_survival_rate = df.groupby('Family Size')['Survived'].mean()
size.
# Step 3: Plotting the survival rates for different family sizes
plt.figure(figsize=(10, 6))
plt.bar(family_survival_rate.index, family_survival_rate.values, color='skyblue')
plt.title('Survival Rate Based on Family Size')
plt.xlabel('Family Size')
plt.ylabel('Survival Rate')
plt.xticks(family_survival_rate.index) # To show each family size as a tick
plt.show()
counts the number of non-null entries in the
family_survival_rate = df.groupby('Family "Survived" column for each group (i.e., each
Size')['Survived'].count() family size).

Calculates the mean (average) of the

family_survival_rate = df.groupby('Family Size') "survived" column for each group. Since
['Survived'].mean() the values in "survived" are either 0 or 1,
the mean effectively represents the
proportion of survivors for each family
size.
Example 2
Fare denotes the fare paid by a passenger. As the values in this column
are continuous, they need to be put in separate bins(Divide Fare into 4
bins ) to get a clear idea. draw relation survival and fare rate.
sns.histplot(data=df, x='Fare Bin', sns.histplot(data=df, x='Fare Bin',
hue='Survived', multiple='stack', hue='Survived', stat='count',
stat='count', palette='pastel’) palette='pastel')
• “An Outlier is that observation which is significantly different from
all other observations.”

There are several ways to treat outliers in a dataset,

depending on the nature of the outliers and the problem
being solved.
Trimming
It excludes the outlier values from our analysis. By applying this technique, our data becomes thin
when more outliers are present in the dataset. Its main advantage is its fastest nature.
• For Normal Distributions
Use empirical relations of Normal
distribution.
The data points that fall below mean-
3*(sigma) or above mean+3*(sigma) are
outliers, where mean and sigma are
the average value and standard
deviation of a particular column.
For Skewed Distributions
 Use Inter-Quartile Range (IQR) proximity rule.
 The data points that fall below Q1 – 1.5 IQR or above the third quartile Q3 + 1.5 IQR are outliers,
where Q1 and Q3 are the 25th and 75th percentile of the dataset, respectively. IQR represents
the inter-quartile range and is given by Q3 – Q1.
Z-SCORE Method
• Finding the boundary values
• print(“Highest allowed”,df[‘cgpa’].mean() + 3*df[‘cgpa’].std())
print(“Lowest allowed”,df[‘cgpa’].mean() – 3*df[‘cgpa’].std())

Output:

Highest allowed 8.808933625397177

Lowest allowed 5.113546374602842
• Step 5: Finding the outliers
• df[(df[‘cgpa’] > 8.80) | (df[‘cgpa’] < 5.11)]
• Capping on outliers
• upper_limit = df[‘cgpa’].mean() + 3*df[‘cgpa’].std()
lower_limit = df[‘cgpa’].mean() – 3*df[‘cgpa’].std()
• Now, apply the capping
• df['cgpa'] = df['cgpa'].clip(lower=lower_limit, upper=upper_limit)
df[‘cgpa’].describe()
Percentile Method
• Step-1: Import necessary dependencies
• import numpy as np
• import pandas as pd
• Step-2: Read and Load the dataset
• df = pd.read_csv('weight-height.csv')
• df.sample(5)
Percentile Method
• : Plot the box-plot of the “height” feature
• sns.boxplot(df['Height'])
((data_ro['Height'] < (Q1 - 1.5 * IQR)) |
(data_ro['Height'] > (Q3 + 1.5 * IQR)))
creates a boolean Series where each entry is
True if the corresponding Height value is an
• # Copy of the data outlier, and False otherwise.
• data_ro = df.copy()
• # Calculate Q1, Q3, and IQR for the 'Height' column
• Q1 = data_ro['Height'].quantile(0.25)
• Q3 = data_ro['Height'].quantile(0.75)
• IQR = Q3 - Q1
• # Remove outliers based on the IQR
• data_ro = data_ro[~((data_ro['Height'] < (Q1 - 1.5 * IQR)) | (data_ro['Height'] > (Q3 + 1.5 *
IQR)))]
• # Print the shapes of the original and modified DataFrames
• print("Old Shape: ", df.shape)
• print("New Shape: ", data_ro.shape)
I. Measuring the Central Tendency

November 2, 2024 22
Mean

November 2, 2024 23
Mean

November 2, 2024 24
Mean

November 2, 2024 25
Median

November 2, 2024 26
Medain

November 2, 2024 27
Median

November 2, 2024 28
Example

November 2, 2024 30
Mode

November 2, 2024 31
Example: mode of Grouped Data

November 2, 2024 32
Midrange

November 2, 2024 33
Example

November 2, 2024 34
Symmetric Data

November 2, 2024 35
Karl Pearson’s Co-efficient of Skewness
The formula for measuring Skewness using Karl Pearson’s
Co-efficient is in the below image
Example 1: Find the skewness for the given Data ( 2,4,6,6)
Solution:
Mean of Data = (2 + 4 + 6 + 6) / 4
= 18 / 4
= 4.5
Median of Data = [4+6]/2
= 10/2=5
S.D. = √[(4.5-2 )2 + (4.5-4)2 + (4.5-6)2 + (4.5-6)2/4]
= √[(6.25+0.25+2.25+2.25)/4]
= √1.658
= 1.1.658
Skewness = 3(Mean – Median)/S.D.
By Applying Skewness Formula,
Skewness = 3(4.5 – 5)/1.658
= 3(-0.5)/ 1.658
Skewness = – 0.904 So, the skewness of these data is negative.
Solve Example

• A boy collects some rupees in a week as follows

(25,28,26,30,40,50,40) and finds the skewness of the given Data in
question with the help of the skewness formula.

Section 5 Quiz
100% (1)
Section 5 Quiz
7 pages
Floating Break Water
No ratings yet
Floating Break Water
160 pages
Nikita Prasad - Outliers Basics
No ratings yet
Nikita Prasad - Outliers Basics
13 pages
Lecture 3b Descriptive Statistics - Numerical Measures
No ratings yet
Lecture 3b Descriptive Statistics - Numerical Measures
34 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
Unit 3
No ratings yet
Unit 3
20 pages
ML Ex2
No ratings yet
ML Ex2
7 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
19 pages
Dsi237 Group 2
No ratings yet
Dsi237 Group 2
27 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
Unit 3
No ratings yet
Unit 3
45 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
Outlier Detection
No ratings yet
Outlier Detection
41 pages
Slides Data Analysis
No ratings yet
Slides Data Analysis
53 pages
ADS PRINT Ans
No ratings yet
ADS PRINT Ans
4 pages
1 Program
No ratings yet
1 Program
20 pages
Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English
No ratings yet
Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English
19 pages
Explanatory Data Analysis
100% (1)
Explanatory Data Analysis
28 pages
Answers IBS
No ratings yet
Answers IBS
13 pages
ML Lab Manual
No ratings yet
ML Lab Manual
110 pages
Quantitative Methods in Management
No ratings yet
Quantitative Methods in Management
67 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Outlier Treatment
No ratings yet
Outlier Treatment
16 pages
EDA On Titanic Dataset
100% (1)
EDA On Titanic Dataset
39 pages
4 - SM and Data Visualization
No ratings yet
4 - SM and Data Visualization
61 pages
05 - Moments-Standized - Variable - Chebychev-1
No ratings yet
05 - Moments-Standized - Variable - Chebychev-1
22 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
Edp 3
No ratings yet
Edp 3
16 pages
Guide On Outlier Detection Methods
No ratings yet
Guide On Outlier Detection Methods
11 pages
Statistics Session - 9 - Boxplot - Outliers
No ratings yet
Statistics Session - 9 - Boxplot - Outliers
6 pages
Lecture 3
No ratings yet
Lecture 3
23 pages
Advanced Data Analysis Techniques 3
No ratings yet
Advanced Data Analysis Techniques 3
31 pages
Heart Disease Diagnosis Using Machine Learning
No ratings yet
Heart Disease Diagnosis Using Machine Learning
26 pages
Outliers in Machine Learning
No ratings yet
Outliers in Machine Learning
13 pages
DS Assignment COMPLETED
No ratings yet
DS Assignment COMPLETED
11 pages
Stats Formulae
No ratings yet
Stats Formulae
11 pages
Quantitative Methods in Management: Term II 4 Credits MGT 408 DAY - 5
No ratings yet
Quantitative Methods in Management: Term II 4 Credits MGT 408 DAY - 5
123 pages
Statistics Midterm Review
No ratings yet
Statistics Midterm Review
21 pages
Measures of Central Tendency & Variability: Lina, Karima, Joselyn, Arlene
No ratings yet
Measures of Central Tendency & Variability: Lina, Karima, Joselyn, Arlene
34 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Unit 1 Assignment SKELETON R spr18
No ratings yet
Unit 1 Assignment SKELETON R spr18
23 pages
Data Assigment 1
100% (2)
Data Assigment 1
32 pages
SLIDES - Statistics-Descriptive Statistics
No ratings yet
SLIDES - Statistics-Descriptive Statistics
25 pages
Assignmeant-1 Sharan S
No ratings yet
Assignmeant-1 Sharan S
20 pages
Note 02
No ratings yet
Note 02
31 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
ML Lab Manual Bcsl602
No ratings yet
ML Lab Manual Bcsl602
108 pages
Research File 3
No ratings yet
Research File 3
10 pages
5 - Data Summaries and Visualization
No ratings yet
5 - Data Summaries and Visualization
87 pages
STAT 7000 Chapter 1.2 - Summarizing Data Probability: Ash Abebe
No ratings yet
STAT 7000 Chapter 1.2 - Summarizing Data Probability: Ash Abebe
18 pages
ADS LAB Merged
No ratings yet
ADS LAB Merged
86 pages
Data Science 01 - Basics
No ratings yet
Data Science 01 - Basics
52 pages
Basic Statistics - 1
No ratings yet
Basic Statistics - 1
21 pages
Chapter 3 Exploratory Data Analysis
No ratings yet
Chapter 3 Exploratory Data Analysis
22 pages
Univariate Outlier Detection
No ratings yet
Univariate Outlier Detection
9 pages
DSILYTC Session 5 - Descriptive Statistics
No ratings yet
DSILYTC Session 5 - Descriptive Statistics
99 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Ds 5 Marks Final
No ratings yet
Ds 5 Marks Final
11 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Inter 1b Syllabus
No ratings yet
Inter 1b Syllabus
3 pages
1 s2.0 S1877050922015058 Main
No ratings yet
1 s2.0 S1877050922015058 Main
11 pages
Engineering Statistics
No ratings yet
Engineering Statistics
7 pages
KKP-BDS Lecture Notes
No ratings yet
KKP-BDS Lecture Notes
78 pages
Asymptotic Generalizations of The Lockhart Martinelli Method For Two Phase Flows
No ratings yet
Asymptotic Generalizations of The Lockhart Martinelli Method For Two Phase Flows
12 pages
2nd Summative Test TOS
No ratings yet
2nd Summative Test TOS
1 page
Unit V
No ratings yet
Unit V
22 pages
Quantitative Aptitude Shortcuts & Tricks
No ratings yet
Quantitative Aptitude Shortcuts & Tricks
8 pages
Simulink
No ratings yet
Simulink
6 pages
Frames of References 5th Sem Nep
No ratings yet
Frames of References 5th Sem Nep
16 pages
Safety Stock
No ratings yet
Safety Stock
35 pages
Bangxi Li (Auth.) - Linear Theory of Fixed Capital and China's Economy - Marx, Sraffa and Okishio-Springer Singapore (2017)
No ratings yet
Bangxi Li (Auth.) - Linear Theory of Fixed Capital and China's Economy - Marx, Sraffa and Okishio-Springer Singapore (2017)
132 pages
DS Unit 5
No ratings yet
DS Unit 5
27 pages
Boosted Convolutional Neural Network For Real Time Facial Expression Recognition
No ratings yet
Boosted Convolutional Neural Network For Real Time Facial Expression Recognition
4 pages
Week3 Lecture Notes
No ratings yet
Week3 Lecture Notes
11 pages
Electronics - Number System & Logic Gates
No ratings yet
Electronics - Number System & Logic Gates
26 pages
Brace Forces in Steel Box Girders With Single Diagonal Lateral Bracing Systems
No ratings yet
Brace Forces in Steel Box Girders With Single Diagonal Lateral Bracing Systems
12 pages
Nonlinear Modal Analysis of A Full-Scale Aircraft
No ratings yet
Nonlinear Modal Analysis of A Full-Scale Aircraft
11 pages
1980 Kennedy
No ratings yet
1980 Kennedy
24 pages
EN3037 FiniteDifference discussion-AMJ
No ratings yet
EN3037 FiniteDifference discussion-AMJ
9 pages
Chapter 11 - Similarity
100% (1)
Chapter 11 - Similarity
37 pages
4 Lab Manual 18CSL76
No ratings yet
4 Lab Manual 18CSL76
29 pages
Basic Properties and Behaviors of Oil and Gas Reservoirs PDF
No ratings yet
Basic Properties and Behaviors of Oil and Gas Reservoirs PDF
97 pages
Chapter 2
No ratings yet
Chapter 2
29 pages
MF821 Syllabus
No ratings yet
MF821 Syllabus
5 pages
Traffic Engineering
No ratings yet
Traffic Engineering
24 pages
Fermi-Hubbard Gas With Three-Body Losses: Symmetries and Dark States
No ratings yet
Fermi-Hubbard Gas With Three-Body Losses: Symmetries and Dark States
21 pages
Definiteness in A Language Without Articles A Study On Polish Adrian Czardybon PDF Download
No ratings yet
Definiteness in A Language Without Articles A Study On Polish Adrian Czardybon PDF Download
74 pages