0% found this document useful (0 votes)
53 views12 pages

Data Exploration Preparation

The document discusses exploratory data analysis (EDA) techniques for understanding, cleaning, and preparing data using Python. Key steps include loading data into a DataFrame, exploring characteristics like shape and data types, visualizing data to identify patterns, cleaning data by handling outliers and missing values, feature engineering to create new useful variables, and transforming data for modeling. Popular Python libraries mentioned are Pandas, NumPy, Matplotlib, and Seaborn.

Uploaded by

hamidsithole65
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views12 pages

Data Exploration Preparation

The document discusses exploratory data analysis (EDA) techniques for understanding, cleaning, and preparing data using Python. Key steps include loading data into a DataFrame, exploring characteristics like shape and data types, visualizing data to identify patterns, cleaning data by handling outliers and missing values, feature engineering to create new useful variables, and transforming data for modeling. Popular Python libraries mentioned are Pandas, NumPy, Matplotlib, and Seaborn.

Uploaded by

hamidsithole65
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Data Exploration

Preparation/Manipulation
HDSC 103
Data Exploration
 Here we explore and understand the data you have acquired. You
perform descriptive statistics, data visualization, and data cleaning
to identify patterns, correlations, and potential issues. Data
preprocessing techniques like data normalization, handling missing
values, and feature engineering are applied to prepare the data for
modeling.
 It allows us to uncover how the dataset you are working with looks
like:
 How big is the dataset (number of rows, columns/features, shape of
the data)
 What are the variables or features of the dataset
 How are data points distributed - are there any outliers?
 What are the relationships between the data points
Steps to understand, clear, and prepare your data

1. Variable Identification
2. Univariate Analysis
3. Bi-variate Analysis
4. Missing values treatment
5. Outlier treatment
6. Variable transformation
7. Variable creation
Data Manipulation and Cleaning
 In this lesson, we will explore various techniques for data
manipulation and cleaning using Python. Data manipulation involves
transforming, reorganizing, and modifying data to extract relevant
information or prepare it for analysis. Data cleaning focuses on
identifying and addressing errors, inconsistencies, and missing
values in the dataset. Python provides powerful libraries and tools
for these tasks, such as Pandas and NumPy.

1. Importing Libraries: Start by importing the necessary libraries for


data manipulation and cleaning:

import pandas as pd
import numpy as np
Cont.

2. Loading Data: Use Pandas to load the dataset into a DataFrame:


df = pd.read_csv('dataset.csv')

3. Exploring the Data: Understand the structure and content of the dataset
using various Pandas functions:

df.head() # View the first few rows of data


df.shape # Get the dimensions of the DataFrame (rows, columns)
df.info() # Get information about the DataFrame (data types, missing
values)
df.describe() # Generate descriptive statistics of numerical columns

4. Handling Missing Values: Deal with missing values in the dataset:

df.isnull() # Identify missing values in the DataFrame


df.dropna() # Drop rows with missing values
df.fillna(value) # Fill missing values with a specific value
Cont.

5. Removing Duplicates: Identify and remove duplicate records from the dataset:

df.duplicated() # Identify duplicate rows


df.drop_duplicates() # Remove duplicate rows

6. Data Transformation: Perform various data transformations to prepare the data for
analysis:

df.rename(columns={'old_name': 'new_name'}) # Rename columns


df.replace(old_val, new_val) # Replace values in the DataFrame
df.sort_values('column_name') # Sort the DataFrame by a column
df.drop(columns=['column_name']) # Drop columns from the DataFrame

7. Data Filtering and Selection: Select and filter data based on specific conditions:

df['column_name'] # Access a specific column


df.loc[row_indexer, column_indexer] # Access subsets of rows and columns using labels
df.iloc[row_indexer, column_indexer] # Access subsets of rows and columns using integer
indexes
df[df['column_name'] > value] # Filter rows based on a condition
Cont.

8. Data Aggregation: Aggregate and summarize data using group-by operations:

df.groupby('column_name').mean() # Calculate the mean of each group


df.groupby('column_name').sum() # Calculate the sum of each group
df.groupby('column_name').count() # Count the number of occurrences in each group

9. Handling Outliers: Detect and deal with outliers in the dataset:

z_scores = (df['column_name'] - df['column_name'].mean()) / df['column_name'].std()

df_no_outliers = df[abs(z_scores) <threshold] # Remove outliers based on z-score threshold

10. Data Type Conversion: Convert data types to the appropriate format:

df['column_name'] = df['column_name'].astype('new_type')

11. Handling Categorical Data: Encode categorical variables for analysis:

df['column_name'].unique() # Get unique categories in a column


df['column_name'].value_counts() # Count occurrences of each category
df['column_name'] = pd.Categorical(df['column_name']) # Convert column to categorical type
df['column_name'] = df['column_name'].cat.codes # Encode categories as numeric codes
Cont.

12. Handling Dates and Time: Manipulate and extract information from date and time
data:

df['date_column'] = pd.to_datetime(df['date_column']) # Convert column to date time

df['year'] = df['date_column'].dt.year # Extract year from date


df['month'] = df['date_column'].dt.month # Extract month from date
df['weekday'] = df['date_column'].dt.weekday_name # Extract weekday from date

These are some of the fundamental techniques for data manipulation and cleaning with
Python. Depending on your specific dataset and analysis needs, you may need to explore
additional methods and functions. Remember to document your steps and make use of
Pandas and NumPy documentation for detailed information on available functions and
options.
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is a crucial step in the data analysis process. It involves
understanding the data, discovering patterns, identifying anomalies, and extracting insights.
Python provides several libraries and tools that make EDA efficient and effective. In this
lesson, we will explore some of the popular Python libraries and techniques for performing
EDA.

1. Importing Libraries: Before starting EDA, import the required libraries. Some commonly
used libraries are:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

2. Loading Data: Read the data into a Pandas DataFrame using `pd.read_csv()` or other
relevant functions. For example:

df = pd.read_csv('data.csv')
Cont.

3. Understanding the Data: Explore the basic characteristics of the dataset:

- Check the dimensions: `df.shape`


- View the first few rows: `df.head()`
- View the data types: `df.dtypes`
- Check for missing values: `df.isnull().sum()`
- Summary statistics: `df.describe()`

4. Handling Missing Values: Missing values can impact analysis, so it's important to
handle them appropriately:

- Drop rows or columns with missing values: `df.dropna()`, `df.dropna(axis=1)`


- Fill missing values with appropriate strategies: `df.fillna(value)`
Cont.

5. Data Visualization: Visualizing data helps in identifying patterns, trends, and outliers:

-Histograms: df['column'].plot.hist()
- Box plots: sns.boxplot(x='column', data=df)
- Scatter plots: plt.scatter(x='column1', y='column2', data=df)
- Heatmaps: sns.heatmap(df.corr(), annot=True)

6. Data Cleaning: Clean the data by removing duplicates, handling outliers, and
transforming variables:

- Removing duplicates: `df.drop_duplicates()`


- Handling outliers: `df[(df['column'] > lower_limit) & (df['column'] < upper_limit)]`
- Variable transformations: `df['new_column'] = np.log(df['column'])`

7. Feature Engineering: Create new features or transform existing ones to improve


predictive models:

- Creating new features: `df['new_feature'] = df['feature1'] + df['feature2']`


- Binning: `pd.cut(df['column'], bins=5)`
- One-hot encoding: `pd.get_dummies(df['column'])`
Cont.

8. Correlation Analysis: Explore the relationships between variables using correlation analysis:

- Correlation matrix: `df.corr()`


- Pairplot: `sns.pairplot(df)`

9. Data Transformation: Normalize or scale the data to prepare it for modeling:

- Min-max scaling: `(df - df.min()) / (df.max() - df.min())`


- Standardization: `(df - df.mean()) / df.std()`

10. Data Subset: Create subsets of data based on specific criteria:

- Filtering rows: `df[df['column'] > threshold]`


- Selecting columns: `df[['column1', 'column2']]

You might also like