0% found this document useful (0 votes)
231 views3 pages

Handson Data Preprocessing PYTHON

Uploaded by

Shahmir Yousaf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
231 views3 pages

Handson Data Preprocessing PYTHON

Uploaded by

Shahmir Yousaf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Page |1

Handson data preprocessing PYTHON


1. Loading and Inspecting the Data

• Load a dataset from a CSV file.

• Display the first and last 10 rows of the dataset.

• Identify the data types of each column.

• Count the number of rows and columns in the dataset.

• Check for null (missing) values and count them in each column.

2. Handling Missing Values

• Replace missing numerical values with the mean of the respective column.

• Replace missing numerical values with the median of the respective column.

• Replace missing numerical values with a constant value of your choice.

• Drop rows with missing values.

• Fill missing categorical values with the most frequent value in the column.

3. Data Cleaning

• Remove duplicate rows from the dataset.

• Drop unnecessary columns from the dataset.

• Rename columns to have consistent naming conventions.

• Standardize text data to lowercase or uppercase.

• Remove leading and trailing whitespaces from text columns.

4. Encoding Categorical Data

• Convert categorical variables into numeric form using:

o One-hot encoding.

Into to ML by S i r. A s i f Ahsa n
Page |2

o Label encoding.

o Mapping (e.g., Male = 0, Female = 1).

• Handle categorical columns with multiple categories (more than 10 unique values).

5. Feature Scaling

• Normalize numerical columns to a range of [0, 1].

• Standardize numerical columns to have a mean of 0 and a standard deviation of 1.

• Apply min-max scaling to numerical columns.

6. Outlier Detection and Handling

• Identify outliers in numerical columns using:

o Interquartile Range (IQR).

o Z-score.

• Remove rows with outliers.

• Cap or floor outliers to a maximum or minimum threshold.

7. Feature Engineering

• Create new features based on existing columns (e.g., age groups, salary ranges).

• Combine multiple columns into one (e.g., full name from first and last name).

• Extract information from columns (e.g., extracting year from a date column).

• Calculate summary statistics for groups (e.g., average salary by gender).

8. Data Transformation

• Log-transform skewed numerical columns.

• Apply square-root transformation to reduce the impact of large values.

• Normalize text data by removing special characters.

Into to ML by S i r. A s i f Ahsa n
Page |3

• Split a column into multiple columns (e.g., splitting a full name into first and last
names).

9. Working with Date/Time Data

• Convert a column to datetime format.

• Extract year, month, and day from a date column.

• Calculate the difference in days between two date columns.

• Group data by time periods (e.g., monthly or yearly).

10. Splitting and Exporting Data

• Split the dataset into training and testing sets.

• Save the cleaned dataset to a new CSV file.

• Save specific columns or subsets of the dataset to a file.

Additional Challenges

• Handle imbalanced datasets by oversampling or undersampling.

• Detect and correct inconsistent data (e.g., inconsistent spellings in text columns).

• Identify and remove columns with high correlation (redundant features).

• Visualize missing data and outliers in the dataset.

Instructions for Students

1. Complete each task on the provided dataset or any dataset of your choice.

2. Document the steps taken for each task.

3. Submit a cleaned dataset and a summary of the preprocessing steps performed.

Into to ML by S i r. A s i f Ahsa n

You might also like