Document
Document
EDA is the initial step in data analysis where graphical and statistical
techniques are used to:
3. Descriptive Statistics
4. Data Visualization
1. Univariate Analysis
Tools:
Histograms
Boxplots
Frequency tables
Summary statistics
2. Bivariate Analysis
Scatter plots
Correlation coefficients
3. Multivariate Analysis
Tools:
2. Clustering
K-Means Clustering
Hierarchical Clustering
3. Multidimensional Scaling (MDS)
4. Factor Analysis
6. Heatmaps
1. Data Quality
2. Data Distribution
3. Relationships
4. Outliers
Are there unusual data points, and what impact do they have?
Steps to Perform EDA in Python
1. Loading Libraries
Import pandas as pd
Import numpy as np
2. Data Loading
3. Summary Statistics
Df.describe()
Df.info()
4. Visualization
Histogram:
Df[‘column_name’].hist()
Scatter Plot:
Heatmap:
Importance of EDA
1. Goals of EDA
1. Univariate Analysis
Visualizations:
2. Bivariate Analysis
Visualizations:
3. Multivariate Analysis
Visualizations:
Pair plots.
3D scatter plots.
Statistical Techniques:
Data Cleaning
Strategies:
2. Handling Outliers:
Options:
Remove.
Data Transformation
1. Standardization:
2. Normalization:
1. Descriptive Statistics:
2. Inferential Statistics:
Confidence intervals.
3. Correlation Analysis:
2. Clustering:
3. T-SNE:
Visualizes high-dimensional data by projecting it into a 2D or 3D space.
4. Heatmaps:
Python Libraries:
R Libraries:
Other Tools:
2. Iterative Process:
3. Ask Questions:
4. Document Findings:
# Importing Libraries
Import pandas as pd
Import numpy as np
# Load Dataset
Df = pd.read_csv(‘data.csv’)
# Basic Information
# Missing Values
Print(df.isnull().sum()) # Count missing values
# Univariate Analysis
Df[‘column_name’].hist(bins=30)
Plt.title(‘Histogram of column_name’)
Plt.show()
# Bivariate Analysis
Plt.show()
# Correlation Heatmap
Plt.title(‘Correlation Heatmap’)
Plt.show()
# Pair Plot
Sns.pairplot(df)
Plt.show()