0% found this document useful (0 votes)
65 views16 pages

28 Oct EDA Notes

This document discusses various topics related to mathematics for data science including exploratory data analysis, the life cycle of a data science project, and feature engineering. It provides details on exploring datasets through plots and graphs, handling missing values and outliers, and converting categorical features to numerical values. The purpose of exploratory data analysis is explained as gaining insights from a dataset to determine if predictive models are feasible.

Uploaded by

Prachi kasved
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views16 pages

28 Oct EDA Notes

This document discusses various topics related to mathematics for data science including exploratory data analysis, the life cycle of a data science project, and feature engineering. It provides details on exploring datasets through plots and graphs, handling missing values and outliers, and converting categorical features to numerical values. The purpose of exploratory data analysis is explained as gaining insights from a dataset to determine if predictive models are feasible.

Uploaded by

Prachi kasved
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 16

HDSC501

Mathematics for Data


Science
-DEEPALI P. KADAM
ASST. PROFESSOR
INFORMATION TECHNOLOGY
DATTA MEGHE COLLEGE OF ENGINEERING, AIROLI
Exploratory Data Analysis
1. Need of exploratory data analysis
2. Cleaning and preparing data
3. Feature engineering
4. Missing values
5. Understand dataset through various plots and
graphs
6. Draw conclusions
7. Deciding appropriate machine learning models.
Life Cycle of Data Science Project
1. Feature Engineering
2. Feature Selection
3. Model Creation
4. Hyper parameter tuning
5. Model Deployment
6. Incremental Learning
Feature Engineering
1. Exploratory Data Analysis
2. Handling the missing values
3. Handling Imbalanced dataset
4. Treating the outliers
5. Scaling down the data-
i. Standardization
ii. Normalization
6. Converting the categorical features into
numerical features
Need of Exploratory Data Analysis
1. Exploratory data analysis (EDA) involves using statistics and
visualizations to analyze and identify trends in data sets.
2. The primary intent of EDA is to determine whether a
predictive model is a feasible analytical tool for business
challenges or not. 
3. EDA helps data scientists gain an understanding of the data
set beyond the formal modeling or hypothesis testing task.
4. Exploratory data analysis is essential for any research
analysis, so as to gain insights into a data set. 
5. In this article, let’s take a look at the importance, and
purpose, and objective of exploratory data analysis that an
analyst would want to extract from a data set.
Exploratory Data Analysis
1. Analyze how many numerical features are
present using histogram, pdf function, Seaborn,
matplotlib, c bond.
2. Analyze how many categorical/discrete features
are present. Are multiple categories present for
each feature?
3. Missing values (visualize all these graphs)
4. Outliers – BoxPlot (Sem 6)
5. Cleaning
Histogram
• https://corporatefinanceinstitute.com/resourc
es/excel/study/histogram/
• Data binning/bucketing : Grop a no. of more
or less contionuous values into a smaller no.
of bins/buckets.
– Equal Frequency Binning/ Equal Width Binning
• Reduces chances of overfitting. (specially for
less dataset)
Categorial Features/ Discrete Features
– Age
– Sentiment Analysis/ Opinion Mining: (8th Sem)
– Colour
– Types of machine : heavy/light

Convert into numeric


Types of missing data
• Missing data are errors because your data don’t
represent the true values of what you set out to
measure.
• The reason for the missing data is important to
consider, because it helps you determine the
type of missing data and what you need to do
about it.
There are three main types of missing
data:
MCAR
• Missing completely at random
• When data are missing completely at random (MCAR),
the probability of any particular value being missing
from your dataset is unrelated to anything else.
• The missing values are randomly distributed, so they
can come from anywhere in the whole distribution of
your values. These MCAR data are also unrelated to
other unobserved variables.
MAR
• Missing at random
• Data missing at random (MAR) are not actually
missing at random; this term is a bit of a misnomer.
• This type of missing data systematically differs from
the data you’ve collected, but it can be fully
accounted for by other observed variables.
• The likelihood of a data point being missing is
related to another observed variable but not to the
specific value of that data point itself.
MNAR
• Missing not at random
• Data missing not at random (MNAR) are
missing for reasons related to the values
themselves.

• REFER
• https://www.scribbr.com/statistics/missing-
data/
To clean and prepare data while preprocessing the data before
it gives as an input to the Machine Learning algorithm.

• https://www.v7labs.com/blog/data-
preprocessing-guide
12 Data Plot Types for Visualisation
from Concept to Code

• https://www.analyticsvidhya.com/blog/2021/12/12-data-plo
t-types-for-visualization/
• https://towardsdatascience.com/11-dimensionality-
reduction-techniques-you-should-know-in-2021-
dcb9500d388b
The parameters you would take into consideration while deciding which machine learning algorithms to use:

• https://towardsdatascience.com
/considerations-when-choosing-a-machine-lea
rning-model-
aa31f52c27f3

You might also like