0% found this document useful (0 votes)
31 views21 pages

Document

data science

Uploaded by

Akansha S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views21 pages

Document

data science

Uploaded by

Akansha S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Exploratory Data Analysis (EDA): Summary Notes and Explanation

What is Exploratory Data Analysis?

EDA is the initial step in data analysis where graphical and statistical
techniques are used to:

Summarize the main characteristics of the dataset.

Detect patterns, anomalies, or relationships.

Refine hypotheses for further analysis.

Prepare data for formal modeling by identifying trends, distributions, and


outliers.

Key Steps in EDA

1. Data Collection and Loading

Gather data from sources like CSV, databases, or APIs.

Load the data into analysis tools (e.g., Python, R, Excel).


2. Data Cleaning

Handle missing data (imputation, removal, or flagging).

Correct data types (e.g., integers, dates, strings).

Identify and address duplicate or inconsistent records.

3. Descriptive Statistics

Central Tendency: Mean, median, mode.

Dispersion: Range, variance, standard deviation, IQR (Interquartile Range).

Distribution: Skewness and kurtosis to understand the shape of data.

4. Data Visualization

Graphical methods to explore relationships, trends, and patterns.


Essential EDA Techniques

1. Univariate Analysis

Focus on a single variable at a time.

Tools:

Histograms

Boxplots

Frequency tables

Summary statistics

2. Bivariate Analysis

Study relationships between two variables.


Tools:

Scatter plots

Correlation coefficients

Line plots (for trends over time)

Bar plots (categorical vs. numerical relationships)

3. Multivariate Analysis

Explore relationships among three or more variables simultaneously.

Tools:

Pair plots (e.g., seaborn.pairplot in Python)

Heatmaps (correlation matrix visualization)

Parallel coordinate plots


Common Multivariate Statistical Techniques

1. Principal Component Analysis (PCA)

Reduces the dimensionality of data while retaining as much variance as


possible.

Visualizes high-dimensional data in 2D or 3D plots.

2. Clustering

Groups similar data points together:

K-Means Clustering

Hierarchical Clustering
3. Multidimensional Scaling (MDS)

Projects high-dimensional data into a lower-dimensional space for


visualization.

4. Factor Analysis

Identifies underlying latent variables that influence the observed variables.

5. T-SNE (t-Distributed Stochastic Neighbor Embedding)

Non-linear technique to visualize complex patterns in high-dimensional data.

6. Heatmaps

Used for showing relationships between variables in a matrix format, often


paired with correlation coefficients.
Key Questions EDA Aims to Address

1. Data Quality

Are there missing, duplicate, or erroneous entries?

2. Data Distribution

What is the shape of the data (e.g., normal, skewed)?

3. Relationships

How do variables interact with each other?

4. Outliers

Are there unusual data points, and what impact do they have?
Steps to Perform EDA in Python

1. Loading Libraries

Import pandas as pd

Import numpy as np

Import matplotlib.pyplot as plt

Import seaborn as sns

2. Data Loading

Df = pd.read_csv(‘data.csv’) # Load dataset

3. Summary Statistics

Df.describe()

Df.info()

4. Visualization
Histogram:

Df[‘column_name’].hist()

Scatter Plot:

Sns.scatterplot(x=’feature1’, y=’feature2’, data=df)

Heatmap:

Sns.heatmap(df.corr(), annot=True, cmap=’coolwarm’)

Importance of EDA

Helps in identifying errors and cleaning data.

Provides insights into the structure of the dataset.

Aids in selecting the right modeling techniques.

Reduces time spent troubleshooting issues during modeling.


Exploratory Data Analysis (EDA): Detailed Notes

1. Goals of EDA

Understand the Dataset: Gain insights into the dataset’s structure,


distributions, and key characteristics.

Identify Patterns: Uncover relationships, trends, and structures in the data.

Detect Anomalies: Spot missing values, outliers, or unusual observations.

Hypothesis Testing: Refine or eliminate hypotheses before formal modeling.

Feature Engineering: Identify features to add, transform, or remove for better


modeling results.

2. Types of EDA Techniques

1. Univariate Analysis

Examines a single variable.


Focuses on:

Central tendency (mean, median, mode).

Dispersion (variance, standard deviation, range).

Distribution shape (normal, skewed, etc.).

Visualizations:

Numerical Data: Histograms, boxplots, density plots.

Categorical Data: Bar charts, pie charts.

2. Bivariate Analysis

Analyzes relationships between two variables.

Visualizations:

Numerical-Numerical: Scatter plots, line charts, correlation matrices.

Numerical-Categorical: Boxplots, violin plots, bar plots.


Categorical-Categorical: Grouped bar charts, contingency tables.

3. Multivariate Analysis

Explores relationships among three or more variables.

Useful for datasets with high dimensions.

Visualizations:

Pair plots.

3D scatter plots.

Heatmaps (for correlation matrices).

Statistical Techniques:

PCA, t-SNE, clustering methods, multidimensional scaling.


4. EDA Techniques in Depth

Data Cleaning

1. Handling Missing Data:

Strategies:

Deletion: Remove rows or columns with missing values.

Imputation: Replace with mean, median, mode, or interpolation.

Flagging: Add an indicator variable for missing values.

2. Handling Outliers:

Detect using boxplots, z-scores, or the IQR method.

Options:

Remove.

Transform (log, square root).


Cap values at reasonable thresholds.

Data Transformation

1. Standardization:

Scale data to have a mean of 0 and standard deviation of 1.

Suitable for techniques sensitive to magnitude (e.g., PCA, clustering).

2. Normalization:

Scale data to range [0, 1].

Useful for features with different units.

3. Encoding Categorical Variables:

One-hot encoding, label encoding, or ordinal encoding.


4. Statistical Analysis

1. Descriptive Statistics:

Mean, median, mode, variance, standard deviation.

Skewness and kurtosis to understand distribution shape.

2. Inferential Statistics:

Hypothesis testing (e.g., t-tests, chi-square tests).

Confidence intervals.

3. Correlation Analysis:

Pearson correlation: Measures linear relationships (-1 to 1).


Spearman correlation: Measures monotonic relationships.

4. Advanced Multivariate Techniques

1. Principal Component Analysis (PCA):

Reduces dimensionality while preserving most variance.

Helps visualize high-dimensional data in 2D or 3D.

2. Clustering:

Groups similar data points into clusters.

Examples: K-means, hierarchical clustering, DBSCAN.

3. T-SNE:
Visualizes high-dimensional data by projecting it into a 2D or 3D space.

Effective for clustering visualization.

4. Heatmaps:

Displays relationships between variables using color-coded matrices.

Common for correlation analysis.

5. Tools for EDA

Python Libraries:

Pandas: Data manipulation and summary statistics.

Matplotlib and seaborn: Visualization.

Numpy: Numerical computations.


Scipy: Statistical computations.

Plotly and bokeh: Interactive visualizations.

R Libraries:

Ggplot2: Data visualization.

Dplyr: Data manipulation.

Caret: Preprocessing and feature engineering.

Other Tools:

Tableau, Power BI for visualization.

Excel for basic EDA.

6. Practical Tips for Effective EDA


1. Start Simple:

Begin with univariate analysis before diving into complex relationships.

2. Iterative Process:

EDA is not linear; revisit earlier steps as needed.

3. Ask Questions:

What does each variable represent?

Are there any inconsistencies or anomalies?

How do variables interact?

4. Document Findings:

Keep track of key insights and issues for future reference.


5. Automate Repetitive Steps:

Use scripts to standardize data cleaning and visualization.

7. Example Code for EDA in Python

# Importing Libraries

Import pandas as pd

Import numpy as np

Import seaborn as sns

Import matplotlib.pyplot as plt

# Load Dataset

Df = pd.read_csv(‘data.csv’)

# Basic Information

Print(df.info()) # Data types and non-null counts

Print(df.describe()) # Summary statistics

Print(df.head()) # First few rows

# Missing Values
Print(df.isnull().sum()) # Count missing values

# Univariate Analysis

Df[‘column_name’].hist(bins=30)

Plt.title(‘Histogram of column_name’)

Plt.show()

# Bivariate Analysis

Sns.scatterplot(x=’feature1’, y=’feature2’, data=df)

Plt.title(‘Scatter Plot of Feature1 vs Feature2’)

Plt.show()

# Correlation Heatmap

Sns.heatmap(df.corr(), annot=True, cmap=’coolwarm’)

Plt.title(‘Correlation Heatmap’)

Plt.show()

# Pair Plot

Sns.pairplot(df)

Plt.show()

You might also like