0% found this document useful (0 votes)
1 views

EDA_SQL_Document

EDA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

EDA_SQL_Document

EDA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Exploratory Data Analysis (EDA) Using SQL

1. Understanding the Dataset

- Data Overview:

Use SQL queries like SELECT TOP (5) * FROM table_name; or SELECT COLUMN_NAME,

DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'table_name';

to quickly understand the structure of your dataset. Identify whether the data types are appropriate

(categorical, numerical, dates).

2. Data Cleaning

- Check for Missing Values:

Identify missing values with a query like:

SELECT COUNT(*) AS missing_count FROM table_name WHERE column_name IS NULL;

- Check for Duplicates:

Find duplicate records using:

SELECT column_name, COUNT(*) FROM table_name GROUP BY column_name HAVING

COUNT(*) > 1;

- Outlier Detection:

For numerical data, outliers can be detected using standard deviation:

SELECT AVG(column_name), STDEV(column_name) FROM table_name;

Filter out rows where values fall beyond a certain threshold:

SELECT * FROM table_name WHERE column_name > (AVG(column_name) + 3 *

STDEV(column_name));
- Data Type Corrections:

Use ALTER TABLE statements to ensure columns have the correct data type, e.g.:

ALTER TABLE table_name ALTER COLUMN column_name INT;

3. Descriptive Statistics

- Summary Statistics:

Get summary statistics (mean, min, max, etc.) with:

SELECT MIN(column_name), MAX(column_name), AVG(column_name), COUNT(*) FROM

table_name;

- For categorical data, get frequency distribution:

SELECT column_name, COUNT(*) FROM table_name GROUP BY column_name;

4. Data Relationships

- Correlation Analysis:

Use aggregate SQL functions to mimic correlation.

- Scatter Plots:

SQL cannot directly create plots, but you can retrieve the necessary data for visualization.

5. Data Visualization:

SQL doesn't produce charts directly. Export results for visualization using external tools.

6. Handling Categorical Variables

- Encoding:
Use CASE statements to manually assign numerical values to categories.

- Frequency Analysis:

Analyze category frequency distribution with GROUP BY.

7. Feature Engineering:

Use SQL to create new features, bins, or calculated columns.

8. Outlier Treatment:

Identify and manage outliers using threshold-based queries.

9. Dimensionality Reduction:

SQL supports column filtering via SELECT.

10. Summarizing Findings:

Use GROUP BY and aggregations to reveal trends.

You might also like