0% found this document useful (0 votes)

53 views12 pages

Data Exploration Preparation

The document discusses exploratory data analysis (EDA) techniques for understanding, cleaning, and preparing data using Python. Key steps include loading data into a DataFrame, exploring characteristics like shape and data types, visualizing data to identify patterns, cleaning data by handling outliers and missing values, feature engineering to create new useful variables, and transforming data for modeling. Popular Python libraries mentioned are Pandas, NumPy, Matplotlib, and Seaborn.

Uploaded by

hamidsithole65

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views12 pages

Data Exploration Preparation

Uploaded by

hamidsithole65

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Data Exploration

Preparation/Manipulation
HDSC 103
Data Exploration
 Here we explore and understand the data you have acquired. You
perform descriptive statistics, data visualization, and data cleaning
to identify patterns, correlations, and potential issues. Data
preprocessing techniques like data normalization, handling missing
values, and feature engineering are applied to prepare the data for
modeling.
 It allows us to uncover how the dataset you are working with looks
like:
 How big is the dataset (number of rows, columns/features, shape of
the data)
 What are the variables or features of the dataset
 How are data points distributed - are there any outliers?
 What are the relationships between the data points
Steps to understand, clear, and prepare your data

1. Variable Identification
2. Univariate Analysis
3. Bi-variate Analysis
4. Missing values treatment
5. Outlier treatment
6. Variable transformation
7. Variable creation
Data Manipulation and Cleaning
 In this lesson, we will explore various techniques for data
manipulation and cleaning using Python. Data manipulation involves
transforming, reorganizing, and modifying data to extract relevant
information or prepare it for analysis. Data cleaning focuses on
identifying and addressing errors, inconsistencies, and missing
values in the dataset. Python provides powerful libraries and tools
for these tasks, such as Pandas and NumPy.

1. Importing Libraries: Start by importing the necessary libraries for

data manipulation and cleaning:

import pandas as pd
import numpy as np
Cont.

2. Loading Data: Use Pandas to load the dataset into a DataFrame:

df = pd.read_csv('dataset.csv')

3. Exploring the Data: Understand the structure and content of the dataset
using various Pandas functions:

df.head() # View the first few rows of data

df.shape # Get the dimensions of the DataFrame (rows, columns)
df.info() # Get information about the DataFrame (data types, missing
values)
df.describe() # Generate descriptive statistics of numerical columns

4. Handling Missing Values: Deal with missing values in the dataset:

df.isnull() # Identify missing values in the DataFrame

df.dropna() # Drop rows with missing values
df.fillna(value) # Fill missing values with a specific value
Cont.

5. Removing Duplicates: Identify and remove duplicate records from the dataset:

df.duplicated() # Identify duplicate rows

df.drop_duplicates() # Remove duplicate rows

6. Data Transformation: Perform various data transformations to prepare the data for
analysis:

df.rename(columns={'old_name': 'new_name'}) # Rename columns

df.replace(old_val, new_val) # Replace values in the DataFrame
df.sort_values('column_name') # Sort the DataFrame by a column
df.drop(columns=['column_name']) # Drop columns from the DataFrame

7. Data Filtering and Selection: Select and filter data based on specific conditions:

df['column_name'] # Access a specific column

df.loc[row_indexer, column_indexer] # Access subsets of rows and columns using labels
df.iloc[row_indexer, column_indexer] # Access subsets of rows and columns using integer
indexes
df[df['column_name'] > value] # Filter rows based on a condition
Cont.

8. Data Aggregation: Aggregate and summarize data using group-by operations:

df.groupby('column_name').mean() # Calculate the mean of each group

df.groupby('column_name').sum() # Calculate the sum of each group
df.groupby('column_name').count() # Count the number of occurrences in each group

9. Handling Outliers: Detect and deal with outliers in the dataset:

z_scores = (df['column_name'] - df['column_name'].mean()) / df['column_name'].std()

df_no_outliers = df[abs(z_scores) <threshold] # Remove outliers based on z-score threshold

10. Data Type Conversion: Convert data types to the appropriate format:

df['column_name'] = df['column_name'].astype('new_type')

11. Handling Categorical Data: Encode categorical variables for analysis:

df['column_name'].unique() # Get unique categories in a column

df['column_name'].value_counts() # Count occurrences of each category
df['column_name'] = pd.Categorical(df['column_name']) # Convert column to categorical type
df['column_name'] = df['column_name'].cat.codes # Encode categories as numeric codes
Cont.

12. Handling Dates and Time: Manipulate and extract information from date and time
data:

df['date_column'] = pd.to_datetime(df['date_column']) # Convert column to date time

df['year'] = df['date_column'].dt.year # Extract year from date

df['month'] = df['date_column'].dt.month # Extract month from date
df['weekday'] = df['date_column'].dt.weekday_name # Extract weekday from date

These are some of the fundamental techniques for data manipulation and cleaning with
Python. Depending on your specific dataset and analysis needs, you may need to explore
additional methods and functions. Remember to document your steps and make use of
Pandas and NumPy documentation for detailed information on available functions and
options.
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is a crucial step in the data analysis process. It involves
understanding the data, discovering patterns, identifying anomalies, and extracting insights.
Python provides several libraries and tools that make EDA efficient and effective. In this
lesson, we will explore some of the popular Python libraries and techniques for performing
EDA.

1. Importing Libraries: Before starting EDA, import the required libraries. Some commonly
used libraries are:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

2. Loading Data: Read the data into a Pandas DataFrame using `pd.read_csv()` or other
relevant functions. For example:

df = pd.read_csv('data.csv')
Cont.

3. Understanding the Data: Explore the basic characteristics of the dataset:

- Check the dimensions: `df.shape`

- View the first few rows: `df.head()`
- View the data types: `df.dtypes`
- Check for missing values: `df.isnull().sum()`
- Summary statistics: `df.describe()`

4. Handling Missing Values: Missing values can impact analysis, so it's important to
handle them appropriately:

- Drop rows or columns with missing values: `df.dropna()`, `df.dropna(axis=1)`

- Fill missing values with appropriate strategies: `df.fillna(value)`
Cont.

5. Data Visualization: Visualizing data helps in identifying patterns, trends, and outliers:

-Histograms: df['column'].plot.hist()
- Box plots: sns.boxplot(x='column', data=df)
- Scatter plots: plt.scatter(x='column1', y='column2', data=df)
- Heatmaps: sns.heatmap(df.corr(), annot=True)

6. Data Cleaning: Clean the data by removing duplicates, handling outliers, and
transforming variables:

- Removing duplicates: `df.drop_duplicates()`

- Handling outliers: `df[(df['column'] > lower_limit) & (df['column'] < upper_limit)]`
- Variable transformations: `df['new_column'] = np.log(df['column'])`

7. Feature Engineering: Create new features or transform existing ones to improve

predictive models:

- Creating new features: `df['new_feature'] = df['feature1'] + df['feature2']`

- Binning: `pd.cut(df['column'], bins=5)`
- One-hot encoding: `pd.get_dummies(df['column'])`
Cont.

8. Correlation Analysis: Explore the relationships between variables using correlation analysis:

- Correlation matrix: `df.corr()`

- Pairplot: `sns.pairplot(df)`

9. Data Transformation: Normalize or scale the data to prepare it for modeling:

- Min-max scaling: `(df - df.min()) / (df.max() - df.min())`

- Standardization: `(df - df.mean()) / df.std()`

10. Data Subset: Create subsets of data based on specific criteria:

- Filtering rows: `df[df['column'] > threshold]`

- Selecting columns: `df[['column1', 'column2']]

Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
PDS Exp 7 To 9
No ratings yet
PDS Exp 7 To 9
10 pages
CSE445 NSU Week - 3
No ratings yet
CSE445 NSU Week - 3
48 pages
Data Cleaning
No ratings yet
Data Cleaning
28 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Pandas 1702216043
No ratings yet
Pandas 1702216043
86 pages
Module 3
No ratings yet
Module 3
20 pages
EDA Cheat Sheet
No ratings yet
EDA Cheat Sheet
7 pages
S08 Slides
No ratings yet
S08 Slides
14 pages
DAP Writeups - Merged
No ratings yet
DAP Writeups - Merged
33 pages
Edap Lab
No ratings yet
Edap Lab
47 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
14 pages
Python For Data Analysis
No ratings yet
Python For Data Analysis
84 pages
Data Preprocessing
No ratings yet
Data Preprocessing
84 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
915 Exam
No ratings yet
915 Exam
6 pages
Avneesh - To Be Printed Information Practice
No ratings yet
Avneesh - To Be Printed Information Practice
8 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
Data Analysis
No ratings yet
Data Analysis
42 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Datascience
No ratings yet
Datascience
26 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
# (Data Preprocessing) : (Cheatsheet)
No ratings yet
# (Data Preprocessing) : (Cheatsheet)
10 pages
Exp 8 - LM
No ratings yet
Exp 8 - LM
10 pages
3rd Week Report
No ratings yet
3rd Week Report
7 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
Data Handling Module
No ratings yet
Data Handling Module
10 pages
Comprehensive EDA Python Guide
No ratings yet
Comprehensive EDA Python Guide
13 pages
Data Cleaning & Preparation
100% (2)
Data Cleaning & Preparation
2 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Week 6 - Data Cleaning
No ratings yet
Week 6 - Data Cleaning
8 pages
L6 and 7-Data Preprocessing-Coding
No ratings yet
L6 and 7-Data Preprocessing-Coding
34 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
BasicAnalysis Using PYTHON
No ratings yet
BasicAnalysis Using PYTHON
6 pages
Document
No ratings yet
Document
29 pages
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
11 pages
Final Dev Record
No ratings yet
Final Dev Record
49 pages
Prac 7
No ratings yet
Prac 7
5 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
27 pages
Exp3 Python
No ratings yet
Exp3 Python
15 pages
Pandas 1
No ratings yet
Pandas 1
13 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
4 pages
Introduction To Pandas Programming 2
No ratings yet
Introduction To Pandas Programming 2
3 pages
Unit-2 Bda
No ratings yet
Unit-2 Bda
11 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Core of ML - Part 1 Handling Data
No ratings yet
Core of ML - Part 1 Handling Data
3 pages
Course - Introduction To Data Science (SD211105)
No ratings yet
Course - Introduction To Data Science (SD211105)
10 pages
Data Analysis
No ratings yet
Data Analysis
4 pages
Pandas Fuction Notes
No ratings yet
Pandas Fuction Notes
3 pages
What Is Pandas
No ratings yet
What Is Pandas
9 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
29 pages
SI Modernization Scorecard
No ratings yet
SI Modernization Scorecard
43 pages
Pandas CheatSheet
No ratings yet
Pandas CheatSheet
18 pages
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
Data Science Workflow
No ratings yet
Data Science Workflow
7 pages
305 DOCUMENT - Merged - Merged
No ratings yet
305 DOCUMENT - Merged - Merged
71 pages
EDA With Pandas CheatSheet
No ratings yet
EDA With Pandas CheatSheet
3 pages
Predictive Maintenance System For Production Lines in Manufacturing (ESTE)
No ratings yet
Predictive Maintenance System For Production Lines in Manufacturing (ESTE)
10 pages
200 Important MCQ On RDBMS
No ratings yet
200 Important MCQ On RDBMS
54 pages
Set 2 Preboard 2022 Answer Key
No ratings yet
Set 2 Preboard 2022 Answer Key
13 pages
Chapter 9: Transactions: Modified From: Database System Concepts, 6 Ed
No ratings yet
Chapter 9: Transactions: Modified From: Database System Concepts, 6 Ed
55 pages
db2z 11 Perfbook
No ratings yet
db2z 11 Perfbook
1,080 pages
CPP Report Final Done
No ratings yet
CPP Report Final Done
56 pages
Trending Historical Data
No ratings yet
Trending Historical Data
89 pages
Dba Interview Questions & Answers
No ratings yet
Dba Interview Questions & Answers
43 pages
Class 12 CS Preboard 2024 25
No ratings yet
Class 12 CS Preboard 2024 25
5 pages
CSI3013 Blockchain Technologies
No ratings yet
CSI3013 Blockchain Technologies
18 pages
Oracle Actualtests 1z0-931-20 Vce Download 2021-May-20 by Beau 23q Vce
No ratings yet
Oracle Actualtests 1z0-931-20 Vce Download 2021-May-20 by Beau 23q Vce
7 pages
Class 12 CS Project On Cricket Stat Analysis
No ratings yet
Class 12 CS Project On Cricket Stat Analysis
37 pages
Editable Tables in JavaFX - DZone Java
No ratings yet
Editable Tables in JavaFX - DZone Java
15 pages
Full Stack Development-Module 3
No ratings yet
Full Stack Development-Module 3
56 pages
Migrating Oracle E-Business Suite On AWS
No ratings yet
Migrating Oracle E-Business Suite On AWS
26 pages
Sample Questions
No ratings yet
Sample Questions
21 pages
Jni Qsee
No ratings yet
Jni Qsee
7 pages
Two Scoops of Django 3.x by Daniel Audrey Feldroy (075-150)
No ratings yet
Two Scoops of Django 3.x by Daniel Audrey Feldroy (075-150)
76 pages
Microsoft Online Services and Support
No ratings yet
Microsoft Online Services and Support
19 pages
Creating Triggers in The NorthWind
No ratings yet
Creating Triggers in The NorthWind
10 pages
8 Dbms
No ratings yet
8 Dbms
16 pages
BDA Hive Practical
No ratings yet
BDA Hive Practical
7 pages
Introduction To SQL: Dr. Sambit Bakshi
No ratings yet
Introduction To SQL: Dr. Sambit Bakshi
28 pages
Assignment - User Authentication
No ratings yet
Assignment - User Authentication
4 pages
Course Flyer Course Overview and Sample Certificate Together For The MDMS by RBPL New
No ratings yet
Course Flyer Course Overview and Sample Certificate Together For The MDMS by RBPL New
3 pages
Numerical Integration Project
No ratings yet
Numerical Integration Project
9 pages
Netapp Flash Cache 2
No ratings yet
Netapp Flash Cache 2
2 pages
Mastering Objectoriented Python
From Everand
Mastering Objectoriented Python
Steven F. Lott
5/5 (2)
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet

Data Exploration Preparation

Uploaded by

Data Exploration Preparation

Uploaded by

Data Exploration

1. Importing Libraries: Start by importing the necessary libraries for

2. Loading Data: Use Pandas to load the dataset into a DataFrame:

df.head() # View the first few rows of data

4. Handling Missing Values: Deal with missing values in the dataset:

df.isnull() # Identify missing values in the DataFrame

df.duplicated() # Identify duplicate rows

df.rename(columns={'old_name': 'new_name'}) # Rename columns

df['column_name'] # Access a specific column

8. Data Aggregation: Aggregate and summarize data using group-by operations:

df.groupby('column_name').mean() # Calculate the mean of each group

9. Handling Outliers: Detect and deal with outliers in the dataset:

z_scores = (df['column_name'] - df['column_name'].mean()) / df['column_name'].std()

df_no_outliers = df[abs(z_scores) <threshold] # Remove outliers based on z-score threshold

11. Handling Categorical Data: Encode categorical variables for analysis:

df['column_name'].unique() # Get unique categories in a column

df['date_column'] = pd.to_datetime(df['date_column']) # Convert column to date time

df['year'] = df['date_column'].dt.year # Extract year from date

3. Understanding the Data: Explore the basic characteristics of the dataset:

- Check the dimensions: `df.shape`

- Drop rows or columns with missing values: `df.dropna()`, `df.dropna(axis=1)`

- Removing duplicates: `df.drop_duplicates()`

7. Feature Engineering: Create new features or transform existing ones to improve

- Creating new features: `df['new_feature'] = df['feature1'] + df['feature2']`

- Correlation matrix: `df.corr()`

9. Data Transformation: Normalize or scale the data to prepare it for modeling:

- Min-max scaling: `(df - df.min()) / (df.max() - df.min())`

10. Data Subset: Create subsets of data based on specific criteria:

- Filtering rows: `df[df['column'] > threshold]`

You might also like