Detailed Step-by-Step Data Cleaning in Python
for movies.csv
This document provides a detailed guide for cleaning a movie dataset (movies.csv)
using Python with Pandas and NumPy. Each step includes a purpose, actions,
considerations, and Python code. The dataset is assumed to have columns like
title, year, genre, rating, and released ate.
1 Load and Inspect the Dataset
1.1 Purpose
Load the dataset and examine its structure to identify issues like
missing values, incorrect data types, or anomalies.
1.2 Actions
• Load movies.csv using pd.readc sv().Inspectwithhead(), info(), describe(), and
1.3 Considerations
• Ensure the file path is correct (use absolute paths if needed).
• Handle encoding errors (e.g., UTF-8, Latin1) or incorrect delimiters.
1.4 Code
1 import pandas as pd
2 import os
3
4 # Verify working directory
5 print(”Current working directory:”, os.getcwd())
6
7 # Load dataset with error handling
8 try:
9 dataset = pd.read_csv(’movies.csv’, encoding=’utf-8’)
10 except FileNotFoundError:
11 print(”Error: ’movies.csv’ not found. Please check the file path
.”)
12 raise
1
13 except UnicodeDecodeError:
14 dataset = pd.read_csv(’movies.csv’, encoding=’latin1’)
15
16 # Inspect dataset
17 print(”First 5 rows:\n”, dataset.head())
18 print(”\nData types and missing values:\n”, dataset.info())
19 print(”\nSummary statistics:\n”, dataset.describe())
20 print(”\nUnique values in ’genre’:\n”, dataset[’genre’].value_counts
())
2 Handle Missing Values
2.1 Purpose
Address missing data (NaN/None) to prevent errors in analysis or
modeling.
2.2 Actions
• Identify missing values with isna().sum().
• Drop rows with missing title using dropna().
• Impute rating with mean and genre with mode.
2.3 Considerations
• Dropping rows reduces data size; impute non-critical columns.
• Check for hidden missing values (e.g., ”N/A” or empty strings).
2.4 Code
1 # Check missing values
2 print(”Missing values:\n”, dataset.isna().sum())
3
4 # Replace common placeholders with NaN
5 dataset.replace([’N/A’, ’’, ’None’], pd.NA, inplace=True)
6
7 # Drop rows with missing ’title’
8 dataset.dropna(subset=[’title’], inplace=True)
9
10 # Impute missing ’rating’ with mean
11 dataset[’rating’].fillna(dataset[’rating’].mean(), inplace=True)
12
13 # Impute missing ’genre’ with mode
14 dataset[’genre’].fillna(dataset[’genre’].mode()[0], inplace=True)
15
16 print(”Missing values after handling:\n”, dataset.isna().sum())
2
3 Remove Duplicates
3.1 Purpose
Eliminate duplicate rows to avoid bias in analysis or modeling.
3.2 Actions
• Identify duplicates with duplicated().sum().
• Remove duplicates based on title and year.
3.3 Considerations
• Decide whether to keep the first, last, or no duplicates.
• Duplicates may arise from multiple entries for the same movie.
3.4 Code
1 # Check for duplicates
2 print(”Number of duplicate rows:”, dataset.duplicated().sum())
3
4 # Remove duplicates based on ’title’ and ’year’
5 dataset.drop_duplicates(subset=[’title’, ’year’], keep=’first’,
inplace=True)
6
7 print(”Dataset shape after removing duplicates:”, dataset.shape)
4 Correct Data Types
4.1 Purpose
Ensure columns have appropriate data types for analysis (e.g., numerical,
datetime).
4.2 Actions
• Check data types with info().
• Convert year to integer and released atetodatetime.
4.3 Considerations
• Handle conversion errors with errors=’coerce’.
• Verify conversions to ensure correctness.
3
4.4 Code
1 # Check data types
2 print(”Data types before conversion:\n”, dataset.dtypes)
3
4 # Convert ’year’ to integer
5 dataset[’year’] = dataset[’year’].astype(’Int64’)
6
7 # Convert ’release_date’ to datetime
8 dataset[’release_date’] = pd.to_datetime(dataset[’release_date’],
errors=’coerce’)
9
10 print(”Data types after conversion:\n”, dataset.dtypes)
5 Handle Inconsistent Data
5.1 Purpose
Fix inconsistencies in categorical or text data (e.g., typos, mixed
cases).
5.2 Actions
• Standardize genre with str.lower() and str.strip().
• Correct specific inconsistencies with replace().
5.3 Considerations
• Check unique values to identify inconsistencies.
• Use a mapping dictionary for common corrections.
5.4 Code
1 # Check unique values in ’genre’
2 print(”Unique genres before cleaning:\n”, dataset[’genre’].
value_counts())
3
4 # Standardize ’genre’
5 dataset[’genre’] = dataset[’genre’].str.lower().str.strip()
6
7 # Correct inconsistencies
8 genre_mapping = {’scifi’: ’sci-fi’, ’comedy ’: ’comedy’, ’dram’: ’
drama’}
9 dataset[’genre’] = dataset[’genre’].replace(genre_mapping)
10
11 print(”Unique genres after cleaning:\n”, dataset[’genre’].
value_counts())
4
6 Handle Outliers
6.1 Purpose
Identify and address extreme values in numerical columns that may
skew analysis.
6.2 Actions
• Use the Interquartile Range (IQR) method to detect outliers
in rating.
• Remove outliers from the dataset.
6.3 Considerations
• Validate outliers (e.g., extreme rating values).
• Consider capping outliers instead of removing them.
6.4 Code
1 import numpy as np
2
3 # Calculate IQR for ’rating’
4 Q1, Q3 = dataset[’rating’].quantile([0.25, 0.75])
5 IQR = Q3 - Q1
6 lower_bound = Q1 - 1.5 * IQR
7 upper_bound = Q3 + 1.5 * IQR
8
9 print(”Outliers in ’rating’:\n”, dataset[(dataset[’rating’] <
lower_bound) | (dataset[’rating’] > upper_bound)])
10
11 # Remove outliers
12 dataset = dataset[(dataset[’rating’] >= lower_bound) & (dataset[’
rating’] <= upper_bound)]
13
14 print(”Summary statistics after removing outliers:\n”, dataset.
describe())
7 Encode Categorical Variables
7.1 Purpose
Convert categorical columns to numerical formats for analysis or
modeling.
7.2 Actions
• Use getd ummies()f orone − hotencodingof genre.
5
7.3 Considerations
• One-hot encoding increases dimensionality for high-cardinality
columns.
• Use label encoding for ordinal categories if applicable.
7.4 Code
1 # One-hot encode ’genre’
2 dataset = pd.get_dummies(dataset, columns=[’genre’], prefix=’genre’)
3
4 print(”Dataset with encoded genres:\n”, dataset.head())
8 Clean Text Data
8.1 Purpose
Standardize text columns by removing unwanted characters or extra
spaces.
8.2 Actions
• Remove special characters from title using regular expressions.
• Remove leading/trailing spaces with str.strip().
8.3 Considerations
• Ensure cleaning preserves meaningful data.
• Handle edge cases (e.g., empty strings after cleaning).
8.4 Code
1 import re
2
3 # Clean ’title’ by removing special characters
4 dataset[’title’] = dataset[’title’].apply(lambda x: re.sub(r’[^a-zA-
Z0-9\s]’, ’’, str(x)))
5
6 # Remove extra spaces
7 dataset[’title’] = dataset[’title’].str.strip()
8
9 print(”Cleaned titles:\n”, dataset[’title’].head())
6
9 Filter Irrelevant Data
9.1 Purpose
Remove rows or columns irrelevant to the analysis (e.g., old movies).
9.2 Actions
• Filter movies from 2000 or later.
• Drop irrelevant columns like comments.
9.3 Considerations
• Define relevance based on analysis goals.
• Use errors=’ignore’ to avoid errors if columns are missing.
9.4 Code
1 # Filter movies from 2000 or later
2 dataset = dataset[dataset[’year’] >= 2000]
3
4 # Drop irrelevant columns
5 dataset.drop(columns=[’comments’], inplace=True, errors=’ignore’)
6
7 print(”Filtered dataset:\n”, dataset.head())
10 Validate and Save
10.1 Purpose
Verify the cleaned dataset and save it for further use.
10.2 Actions
• Re-check missing values, data types, and duplicates.
• Save to moviesc leaned.csv.
10.3 Considerations
• Ensure no new issues were introduced.
• Set index=False to avoid saving the index.
7
10.4 Code
1 # Validate dataset
2 print(”Final data types:\n”, dataset.dtypes)
3 print(”Final missing values:\n”, dataset.isna().sum())
4 print(”Final duplicates:”, dataset.duplicated().sum())
5
6 # Save cleaned dataset
7 dataset.to_csv(’movies_cleaned.csv’, index=False)
8 print(”Cleaned dataset saved as ’movies_cleaned.csv’”)
11 Additional Notes
• Ensure movies.csv is in the working directory or provide the
full path.
• Install required libraries: pip install pandas numpy.
• Adjust steps based on specific dataset issues (e.g., unique
columns, formats).
• Use os.getcwd() to verify the working directory.
• Visualize cleaning progress (e.g., missing values) with plots
if needed.