2 Eda
2 Eda
2 Eda
• EDA is typically performed as an initial step before data preprocessing and model
building.
• Goal:
• The primary goal of EDA is to visually and statistically explore the dataset to understand the data
and discover patterns, relationships, and potential insights in the dataset.
• Output:
• The primary output of EDA is knowledge and insights about the dataset, including visualizations,
summary statistics, and potential hypotheses.
Exploratory Data Analysis(EDA)
• Univariate Analysis
• Univariate analysis involves the examination of a single variable, focusing on its distribution and
summary statistics.
• We also try to see if we can do classification through that feature or not. E.g., in the Iris data set,
analysing how sepal length helps in classification.
• Bivariate Analysis
• Bivariate analysis involves the examination of the relationship between two variables.
• We also try to see if we can classify through two features. E.g., in the Iris data set, analysing how
petal length and sepal width for classification.
• Multivariate Analysis
• Multivariate analysis involves the examination of more than two variables simultaneously.
• Correlation Coefficient
• The correlation coefficient is a statistical measure that quantifies the strength and direction of a
linear relationship between two variables.
• A correlation coefficient greater than zero indicates a positive relationship, while a value less
than zero signifies a negative relationship.
• A value close to zero indicates a weak relationship between the two variables being compared.
EDA- Exploring correlations between variables
• Multicollinearity
• Multicollinearity is a statistical phenomenon that occurs when two or more independent
variables are highly correlated with each other. In other words, multicollinearity indicates a
strong linear relationship among the independent variables.
• Multicollinearity can have implications for certain models, particularly those that assume the
independence of predictor variables.
• Highly correlated features provide redundant information about the target variable. Keeping
both features may not add much value to the model, and choosing one might be more efficient.
• Including highly correlated features can make it challenging to interpret the individual
contribution of each feature to the model. This is because changes in one highly correlated
variable are often associated with changes in the other.
EDA- Outliers
• What is Outlier?
• An outlier is an observation that significantly deviates from the rest of the data points in a
dataset.
• It is a data point that lies an abnormal distance away from other values in a random sample from
a population.
• The presence of outliers can substantially impact statistical analysis, as they can skew results and
lead to incorrect conclusions.
EDA- Outliers
• Box Plot
• Box plots provide a visual distribution summary, including the median, quartiles, and range.
• Outliers can be spotted as individual points outside the whiskers.
EDA- Data Visualization
• Scatter Plots
• Scatter plots visualize relationships between two continuous variables.
• Identify patterns, trends, and correlations between variables. Clusters of points may suggest
subgroups or patterns in the data.
• Scatter plots help assess the correlation, whether it's positive, negative, or nonexistent.
EDA- Data Visualization
• Line Charts
• Line charts show how a variable changes over time.
• Identify trends, seasonality, or cyclical patterns in time series data.
EDA- Data Visualization
• Histograms
• Histograms provide a visual representation of the distribution of a single variable.
• They help identify the data's central tendency, spread, and shape, including whether it follows a
normal distribution or exhibits skewness.