2 Eda

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Roadmap of ML Journey

Exploratory Data Analysis

• EDA is typically performed as an initial step before data preprocessing and model
building.
• Goal:
• The primary goal of EDA is to visually and statistically explore the dataset to understand the data
and discover patterns, relationships, and potential insights in the dataset.
• Output:
• The primary output of EDA is knowledge and insights about the dataset, including visualizations,
summary statistics, and potential hypotheses.
Exploratory Data Analysis(EDA)

• EDA involves the following Activities:


• Involves activities such as univariate analysis, bivariate analysis, and multivariate analysis. These
types of analyses help understand patterns, trends, and relationships within a dataset.
• Exploring correlations between variables.
• Identifying outliers.
• Data visualization: Creating plots, charts, and graphs to visualize data distributions, relationships,
and patterns.
• Summary statistics: Calculating basic statistics like mean, median, standard deviation, and
quartiles to describe the data.
EDA-Uni, Bi, Multivariate Analysis

• Univariate Analysis
• Univariate analysis involves the examination of a single variable, focusing on its distribution and
summary statistics.
• We also try to see if we can do classification through that feature or not. E.g., in the Iris data set,
analysing how sepal length helps in classification.

sepal_length sepal_width petal_length petal_width species (setosa, virginica, versicolor)


EDA-Uni, Bi, Multivariate Analysis

• Bivariate Analysis
• Bivariate analysis involves the examination of the relationship between two variables.
• We also try to see if we can classify through two features. E.g., in the Iris data set, analysing how
petal length and sepal width for classification.

sepal_length sepal_width petal_length petal_width species (setosa, virginica, versicolor)


EDA-Uni, Bi, Multivariate Analysis

• Multivariate Analysis
• Multivariate analysis involves the examination of more than two variables simultaneously.

sepal_length sepal_width petal_length petal_width species (setosa, virginica, versicolor)


EDA- Exploring correlations between variables

• Correlation Coefficient
• The correlation coefficient is a statistical measure that quantifies the strength and direction of a
linear relationship between two variables.
• A correlation coefficient greater than zero indicates a positive relationship, while a value less
than zero signifies a negative relationship.
• A value close to zero indicates a weak relationship between the two variables being compared.
EDA- Exploring correlations between variables

• What Is Meant by Linear Correlation?


• The correlation coefficient is a value between -1 and +1.
• A correlation coefficient of +1 indicates a perfect positive correlation. As variable x increases,
variable y increases. As variable x decreases, variable y decreases.
• A correlation coefficient of -1 indicates a perfect negative correlation. As variable x increases,
variable z decreases. As variable x decreases, variable z increases.
EDA- Exploring correlations between variables

• What Is Considered a Strong Correlation Coefficient?


• Generally, the closer a correlation coefficient is to 1.0 (or -1.0), the stronger the relationship
between the two variables is said to be.
• While there is no clear boundary to what makes a "strong" correlation, a coefficient above 0.75
(or below -0.75) is considered a high degree of correlation, while one between -0.3 and 0.3 is a
sign of weak or no correlation.
EDA- Exploring correlations between variables

• Multicollinearity
• Multicollinearity is a statistical phenomenon that occurs when two or more independent
variables are highly correlated with each other. In other words, multicollinearity indicates a
strong linear relationship among the independent variables.
• Multicollinearity can have implications for certain models, particularly those that assume the
independence of predictor variables.
• Highly correlated features provide redundant information about the target variable. Keeping
both features may not add much value to the model, and choosing one might be more efficient.
• Including highly correlated features can make it challenging to interpret the individual
contribution of each feature to the model. This is because changes in one highly correlated
variable are often associated with changes in the other.
EDA- Outliers

• What is Outlier?
• An outlier is an observation that significantly deviates from the rest of the data points in a
dataset.
• It is a data point that lies an abnormal distance away from other values in a random sample from
a population.
• The presence of outliers can substantially impact statistical analysis, as they can skew results and
lead to incorrect conclusions.
EDA- Outliers

• How do we find the presence of an Outliers


• Understanding the domain and context of the data is crucial. In some cases, data points that
might be considered outliers by statistical methods are actually valid observations based on
domain knowledge.
EDA- Outliers

• How do we find the presence of an Outliers


• Visual Inspection: Box plots are graphical representations that display the distribution of data,
including the presence of outliers. Points beyond the whiskers of the box plot are often
considered potential outliers.
EDA- Outliers

• How do we find the presence of an Outliers


• Visual Inspection: Visualizing the data using scatter plots can reveal points far from the data’s
main cluster. Outliers may appear as individual points with a significant deviation from the
general pattern.
EDA- Outliers

• How do we find the presence of an Outliers


• Statistic Methods: The IQR ranges between the first quartile (Q1) and the third quartile (Q3).
Data points beyond a certain range from the quartiles are considered outliers.
IQR=Q3−Q1
Upper Bound=Q3+1.5×IQR
Lower Bound=Q1−1.5×IQR
EDA- Data Visualization

• Box Plot
• Box plots provide a visual distribution summary, including the median, quartiles, and range.
• Outliers can be spotted as individual points outside the whiskers.
EDA- Data Visualization

• Scatter Plots
• Scatter plots visualize relationships between two continuous variables.
• Identify patterns, trends, and correlations between variables. Clusters of points may suggest
subgroups or patterns in the data.
• Scatter plots help assess the correlation, whether it's positive, negative, or nonexistent.
EDA- Data Visualization

• Line Charts
• Line charts show how a variable changes over time.
• Identify trends, seasonality, or cyclical patterns in time series data.
EDA- Data Visualization

• Histograms
• Histograms provide a visual representation of the distribution of a single variable.
• They help identify the data's central tendency, spread, and shape, including whether it follows a
normal distribution or exhibits skewness.

You might also like