Assignment 1

This document outlines an assignment focused on exploratory data analysis (EDA) using the Titanic dataset, which includes passenger details for predicting survival. It discusses measures of central tendency, variability, and outlier identification techniques, emphasizing the importance of data preprocessing and feature engineering. The conclusion highlights insights into factors influencing survival and the role of predictive modeling in determining survival probabilities.

Uploaded by

apurva.kondekar6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views5 pages

Assignment 1

Uploaded by

apurva.kondekar6

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

ASSIGNMENT NO.

1
AIM: Assignment of exploring data analysis.

PREREQUISITE: Statistics and Python programming

THEORY:

The Titanic dataset consists of passenger details such as age, gender, ticket class, and
survival status. The classification task involves predicting whether a passenger survived
based on these features.

Exploratory data analysis or “EDA” is a critical first step in analyzing the data from an
experiment. Here are the main reasons we use EDA: • detection of mistakes • checking of
assumptions • preliminary selection of appropriate models • determining relationships among the
explanatory variables, and • assessing the direction and rough size of relationships between
explanatory and outcome variables. Loosely speaking, any method of looking at data that does
not include formal statistical modeling and inference falls under the term exploratory data
analysis.
Measure of Central tendency

The central tendency or “location” of a distribution has to do with typical or middle values. The
common, useful measures of central tendency are the statistics called (arithmetic) mean, median,
and sometimes mode.
Means, such as geometric, harmonic, truncated, or Winsorized means, are used as measures of
centrality. While most authors use the term “average” as a synonym for the arithmetic mean,
some use average in a broader sense to also include geometric, harmonic, and other means.
Assuming that we have n data values labeled x1 through xn, the formula for calculating the

sample (arithmetic) mean is

The median is another measure of central tendency. The sample median is the middle value after
all of the values are put in an ordered list. If there are an even number of values, take the average
of the two middle values.
The mean, median and mode for the titanic dataset for different features gives how skewed that
feature is.

The dataset has some missing values which were replaced by mean, median and mode depending
upon the datatype and if the feature is skewed or not.
Measure of variability
Spread
Several statistics are commonly used as a measure of the spread of a distribution, including
variance, standard deviation, and interquartile range. Spread is an indicator of how far away from
the center we are still likely to find data values.
The variance is a standard measure of spread. It is calculated for a list of numbers, e.g., the n
observations of a particular measurement labeled x1 through xn, based on the n sample
deviations (or just “deviations”). Then for any data value, xi , the corresponding deviation is (xi
− x¯), which is the signed (- for lower and + for higher) distance of the data value from the mean
of all of the n data values. It is not hard to prove that the sum of all of the deviations of a sample
is zero. The variance of a population is defined as the mean squared deviation (see section 3.5.2).
The sample formula for the variance of observed data conventionally has n−1 in the denominator
instead of n to achieve the property of “unbiasedness”, which roughly means that when
calculated for many different random samples from the same population, the average should
match the corresponding population quantity (here, σ 2 ). The most commonly used symbol for
sample variance is s 2 , and the formula is

which is essentially the average of the squared deviations, except for dividing by n − 1 instead of
n. This is a measure of spread, because the bigger the deviations from the mean, the bigger the
variance gets.
The standard deviation is simply the square root of the variance. Therefore it has the same units
as the original data, which helps make it more interpretable. The sample standard deviation is
usually represented by the symbol s. For a theoretical Gaussian distribution, we learned in the
previous chapter that mean plus or minus 1, 2 or 3 standard deviations holds 68.3, 95.4 and
99.7% of the probability respectively, and this should be approximately true for real data from a
Normal distribution.
A third measure of spread is the interquartile range. To define IQR, we first need to define the
concepts of quartiles. The quartiles of a population or a sample are the three values which divide
the distribution or observed data into even fourths. So one quarter of the data fall below the first
quartile, usually written Q1; one half fall below the second quartile (Q2); and three fourths fall
below the third quartile (Q3). The astute reader will realize that half of the values fall above Q2,

one quarter fall above Q3, and also that Q2 is a synonym for the median. Once the quartiles are
defined, it is easy to define the IQR as IQR = Q3 − Q1. By definition, half of the values (and
specifically the middle half) fall within an interval whose width equals the IQR. If the data are
more spread out, then the IQR tends to increase, and vice versa.
Outlier Identification
Boxplots Another very useful univariate graphical technique is the boxplot. The boxplot will be
described here in its vertical format, which is the most common, but a horizontal format also is
possible. An example of a boxplot is shown in the following figure, which again represents the

data in EDA1.dat.
Here you can see that the boxplot consists of a rectangular box bounded above and below by
“hinges” that represent the quartiles Q3 and Q1, respectively, and with a horizontal “median”
line through it. You can also see the upper and lower “whiskers”, and a point marking an
“outlier”. The vertical axis is in the units of the quantitative variable.
Box plot for the titanic dataset looks as follows:

Other plots were also constructed such as histogram, heatmap, pairplot, barplot, violin plot
for the different features which gave us different insights of the dataset

REFERENCES:

1. Coursera Course on “What is Data Science?” offered by IBM. Available at

https://www.coursera.org/learn/what-is-datascience?specialization=ibm-data-science

2. Getting Started with Business Analytics: Insightful Decision-Making, David Roi Hardoon,
Galit Shmueli, CRC Press

3.
https://medium.com/@ugursavci/complete-exploratory-data-analysis-using-python-9f685d67d1e
4

4.
https://medium.com/data-and-beyond/mastering-exploratory-data-analysis-eda-everything-you-n
eed-to-know-7e3b48d63a95

CONCLUSION:

The analysis provides insights into key factors influencing survival, such as passenger class,
gender, and age. The predictive model can help determine survival probability based on these
factors. The results highlight the importance of feature engineering and data preprocessing in
improving model accuracy.

Data Dispersion and Central Tendency Analysis
No ratings yet
Data Dispersion and Central Tendency Analysis
7 pages
Lecture 5&6
No ratings yet
Lecture 5&6
15 pages
Unit 1
No ratings yet
Unit 1
51 pages
Chapter 4
No ratings yet
Chapter 4
46 pages
Measures of Central Tendency
100% (15)
Measures of Central Tendency
15 pages
SCSA1606 - Predictive and Advanced Analytics - Unit II
No ratings yet
SCSA1606 - Predictive and Advanced Analytics - Unit II
50 pages
Understanding Measures of Dispersion
No ratings yet
Understanding Measures of Dispersion
13 pages
Chapter 4-1
No ratings yet
Chapter 4-1
46 pages
Lecture 06-Describing Data Visual Information
No ratings yet
Lecture 06-Describing Data Visual Information
49 pages
Business Analytics Unit 4
No ratings yet
Business Analytics Unit 4
24 pages
Unit II TYCS DS
No ratings yet
Unit II TYCS DS
176 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Ch3 Numerically Summarizing Data
No ratings yet
Ch3 Numerically Summarizing Data
35 pages
CHP 2
No ratings yet
CHP 2
52 pages
Module 1 Overview - of - Statistics
No ratings yet
Module 1 Overview - of - Statistics
11 pages
EDA: Key Stats & Visualizations in Python
No ratings yet
EDA: Key Stats & Visualizations in Python
15 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
Central Tendency & Variability Guide
100% (1)
Central Tendency & Variability Guide
9 pages
H1.1 Definitions, Measures, Plots, CLT
No ratings yet
H1.1 Definitions, Measures, Plots, CLT
83 pages
R22 Unit2 CH2
No ratings yet
R22 Unit2 CH2
28 pages
Central Tendency & Variation Measures
No ratings yet
Central Tendency & Variation Measures
3 pages
STAE Lecture Notes - LU3 - Annotated
No ratings yet
STAE Lecture Notes - LU3 - Annotated
10 pages
Further Mathematics Exam Tips & Data Analysis
No ratings yet
Further Mathematics Exam Tips & Data Analysis
29 pages
Quant Descriptive Statistics
No ratings yet
Quant Descriptive Statistics
37 pages
TDA1
No ratings yet
TDA1
57 pages
Stats Prac 1
No ratings yet
Stats Prac 1
10 pages
Lecture Notes 2 - Descriptive Statistics-1720598791715
No ratings yet
Lecture Notes 2 - Descriptive Statistics-1720598791715
21 pages
Lecture-6-7-8-Descriptive Statistics-Dispersion
No ratings yet
Lecture-6-7-8-Descriptive Statistics-Dispersion
42 pages
Stats Week 1 PDF
No ratings yet
Stats Week 1 PDF
6 pages
Numerical Descriptive Statistics Overview
No ratings yet
Numerical Descriptive Statistics Overview
28 pages
Statistics: Types, Data, and Measures
No ratings yet
Statistics: Types, Data, and Measures
6 pages
Note Chapter 3
No ratings yet
Note Chapter 3
14 pages
Statistics 1
No ratings yet
Statistics 1
10 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
20 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
58 pages
Describing Data - Numerical Measure
No ratings yet
Describing Data - Numerical Measure
33 pages
Lecture 1ASADA Descriptive Stats
No ratings yet
Lecture 1ASADA Descriptive Stats
38 pages
Central Tendency and Dispersion Measures
No ratings yet
Central Tendency and Dispersion Measures
10 pages
Understanding Univariate Analysis
No ratings yet
Understanding Univariate Analysis
142 pages
STAE Lecture Notes - LU3
No ratings yet
STAE Lecture Notes - LU3
24 pages
Understanding Statistics: Types & Methods
No ratings yet
Understanding Statistics: Types & Methods
7 pages
Exploratory Data Analysis Techniques
No ratings yet
Exploratory Data Analysis Techniques
24 pages
02 Data
No ratings yet
02 Data
36 pages
Ken Black QA ch03
0% (1)
Ken Black QA ch03
61 pages
Class Notes v1
No ratings yet
Class Notes v1
4 pages
Stats
No ratings yet
Stats
109 pages
Unit 3 - Descriptive Statistics
No ratings yet
Unit 3 - Descriptive Statistics
44 pages
Exploring Numerical Data - Students
No ratings yet
Exploring Numerical Data - Students
97 pages
TOPIC 2probability 1
No ratings yet
TOPIC 2probability 1
16 pages
L1.2 Exploratory Data Analysis 2023
No ratings yet
L1.2 Exploratory Data Analysis 2023
49 pages
Descriptive Statistics 1
No ratings yet
Descriptive Statistics 1
63 pages
Understanding Statistics: Concepts & Applications
No ratings yet
Understanding Statistics: Concepts & Applications
35 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
63 pages
Basic Statistics
No ratings yet
Basic Statistics
7 pages
Statistical Analysis - Descriptive Stat
No ratings yet
Statistical Analysis - Descriptive Stat
6 pages
04 Dispersion Measures
No ratings yet
04 Dispersion Measures
17 pages
Box Plots: Variance & Standard Deviation
No ratings yet
Box Plots: Variance & Standard Deviation
5 pages
Statistics For Chemical Engineers
100% (2)
Statistics For Chemical Engineers
122 pages
STAT 008 CH 1-3 p.1-37 Lecture Notes
No ratings yet
STAT 008 CH 1-3 p.1-37 Lecture Notes
37 pages
Study Guide Module 4 ProfEd107 Assessment in Learning 1
No ratings yet
Study Guide Module 4 ProfEd107 Assessment in Learning 1
20 pages
RFID Boosts OR Efficiency
No ratings yet
RFID Boosts OR Efficiency
4 pages
Grade 11 Tasks
No ratings yet
Grade 11 Tasks
25 pages
GCE O-Level Mathematics Syllabus 'D' May/June 2016 Paper 2
No ratings yet
GCE O-Level Mathematics Syllabus 'D' May/June 2016 Paper 2
20 pages
Juniper Apstra 5.0.1 Release Notes
No ratings yet
Juniper Apstra 5.0.1 Release Notes
71 pages
S ID A 1 DotPlots
No ratings yet
S ID A 1 DotPlots
4 pages
Applsci 13 11455 v2
No ratings yet
Applsci 13 11455 v2
19 pages
Understanding Percentiles and Variability
No ratings yet
Understanding Percentiles and Variability
19 pages
Biostatistics Lab Portfolio
No ratings yet
Biostatistics Lab Portfolio
21 pages
Chapter 1 Summary Univaraiate Data
No ratings yet
Chapter 1 Summary Univaraiate Data
44 pages
Workshop 01 - S1 - 2020 - Solutions For Business Statistics
No ratings yet
Workshop 01 - S1 - 2020 - Solutions For Business Statistics
7 pages
A Shifting Role of Thalamocortical Connectivity in The Emergence of Cortical Functional Organization
No ratings yet
A Shifting Role of Thalamocortical Connectivity in The Emergence of Cortical Functional Organization
32 pages
A Median Filter Method For Image Noise Variance Estimation
No ratings yet
A Median Filter Method For Image Noise Variance Estimation
4 pages
Pola Asuh - PWB
No ratings yet
Pola Asuh - PWB
11 pages
8602 Assignment 2
No ratings yet
8602 Assignment 2
46 pages
Data Mining Unit 1 (MSC Ds 3 Sem)
No ratings yet
Data Mining Unit 1 (MSC Ds 3 Sem)
119 pages
Statistical Modelling For Biomedical Researchers
100% (2)
Statistical Modelling For Biomedical Researchers
544 pages
Data Analysis Techniques Overview
No ratings yet
Data Analysis Techniques Overview
15 pages
Reviewer For Pre Final Examination
No ratings yet
Reviewer For Pre Final Examination
21 pages
Statistics Concepts for Students
No ratings yet
Statistics Concepts for Students
17 pages
Data Mining Basics for CSIT Students
No ratings yet
Data Mining Basics for CSIT Students
64 pages
Measures of Dispersion or Variation: Vijay - Gahlawat@yahoo - Co.in
No ratings yet
Measures of Dispersion or Variation: Vijay - Gahlawat@yahoo - Co.in
31 pages
Measures of Spread: Range & Quartiles
No ratings yet
Measures of Spread: Range & Quartiles
9 pages
Measures of Variation, Quartiles and Percentiles, Skewness and Kurtosis
No ratings yet
Measures of Variation, Quartiles and Percentiles, Skewness and Kurtosis
16 pages
Probability and Statistics Advanced (Second Edition)
100% (1)
Probability and Statistics Advanced (Second Edition)
359 pages
Exercises On Introduction To Ststistics
No ratings yet
Exercises On Introduction To Ststistics
68 pages
Mathematical Literacy P2 May-June 2017 Eng
No ratings yet
Mathematical Literacy P2 May-June 2017 Eng
10 pages