Data Cleaning & Preparation
Data Cleaning & Preparation
Description:
Data preparation involves data collection and data cleaning. When working with multiple
sources of data, there are instances where the collected data could be incorrect, mislabeled, or
even duplicated. This would lead to unreliable machine learning models and wrong
outcomes. Hence, it is important to clean your data and get it into a usable form beforehand.
In this article, we cover the concept of data cleaning using Pandas.
Data cleaning is the process of dealing with messy, disordered data and eliminating incorrect,
missing, duplicated values in your dataset. It improves the quality and accuracy of the data
being fed to the algorithms that will solve your data science problem.
Dataset:
https://www.kaggle.com/datasets/andrewmvd/occupation-salary-and-likelihood-of-
automation
Load the dataset and display the first 5 rows [head function]
Get information about the dataset [info. Function]
Find duplicated values in a DataFrame [duplicated function]
Drop duplicate values in a dataframe [drop_duplicates function]
In many cases, you might require renaming the columns for better interpretation [rename()
function]