Assignment EDA
Assignment EDA
Trillanes
BSIT - 4D
EDA helps in gaining an initial understanding of the dataset, its structure, and the relationships between
variables. This understanding is fundamental before applying more advanced data mining techniques.
Through graphical and statistical methods, EDA helps in identifying patterns, trends, and outliers within
the data. This insight is valuable for formulating hypotheses and designing subsequent mining strategies.
EDA enables the identification and assessment of missing data. Understanding the nature of missing
data helps in making informed decisions on imputation or exclusion, crucial for accurate data mining
results.
4. Feature Selection:
EDA aids in the identification of relevant features or variables for the mining process. Selecting the right
features improves model performance, reduces complexity, and enhances interpretability.
5. Detecting Outliers:
Outliers can significantly impact the performance of data mining models. EDA helps in detecting outliers
and deciding on appropriate strategies for handling them, such as transformation or removal.
EDA guides preprocessing steps, such as normalization, scaling, or encoding categorical variables.
Ensuring that the data is properly preprocessed contributes to the effectiveness of subsequent data
mining techniques.
7. Choosing the Right Model:
EDA insights guide the selection of appropriate data mining models. Understanding the distribution of
data and relationships between variables helps in choosing models that align with the inherent
characteristics of the data.
8. Assessing Assumptions:
Data mining algorithms often come with assumptions. EDA helps in assessing whether these
assumptions are met and if not, provides insights into potential adjustments or transformations
required.
9. Avoiding Overfitting:
EDA aids in recognizing patterns and relationships that might be indicative of overfitting. It helps in
balancing model complexity to prevent overfitting, especially in the context of machine learning.
EDA results are crucial for communicating findings to stakeholders. Visualizations and summaries
generated during EDA provide a clear and interpretable representation of the data, facilitating effective
communication.
1. Descriptive Statistics:
● Purpose: Provides a summary of the main characteristics of a dataset.
● Techniques: Mean, median, mode, range, variance, standard deviation, skewness, kurtosis.
● Visualizations: Box plots, histograms, bar charts.
2. Univariate Analysis:
● Purpose: Examines the distribution and characteristics of individual variables.
● Techniques: Frequency distribution, summary statistics.
● Visualizations: Histograms, kernel density plots, bar charts.
3. Bivariate Analysis:
● Purpose: Explores relationships between two variables.
● Techniques: Correlation, covariance, cross-tabulation.
● Visualizations: Scatter plots, line charts, heatmaps.
4. Multivariate Analysis:
● Purpose: Examines relationships between three or more variables.
● Techniques: Factor analysis, principal component analysis (PCA).
● Visualizations: 3D scatter plots, bubble charts.
6. Histograms:
● Purpose: Illustrates the distribution of a single variable.
● Visualizations: Histograms, kernel density plots.
7. Scatter Plots:
● Purpose: Reveals relationships between two continuous variables.
● Visualizations: Scatter plots, pair plots.
8. Heatmaps:
● Purpose: Visualizes the correlation matrix or pairwise relationships between variables.
● Visualizations: Heatmaps.
9. Violin Plots:
● Purpose: Combines aspects of box plots and kernel density plots for a richer representation of
data distribution.
● Visualizations: Violin plots.