0% found this document useful (0 votes)

3 views8 pages

All CLR

This document outlines a comprehensive guide for cleaning a movie dataset (movies.csv) using Python with Pandas and NumPy. It details each step of the cleaning process, including loading the dataset, handling missing values, removing duplicates, correcting data types, and encoding categorical variables, along with relevant code snippets. The final steps involve validating the cleaned dataset and saving it for further use.

Uploaded by

Awan Adhikari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views8 pages

All CLR

Uploaded by

Awan Adhikari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Detailed Step-by-Step Data Cleaning in Python

for movies.csv

This document provides a detailed guide for cleaning a movie dataset (movies.csv)
using Python with Pandas and NumPy. Each step includes a purpose, actions,
considerations, and Python code. The dataset is assumed to have columns like
title, year, genre, rating, and released ate.

1 Load and Inspect the Dataset

1.1 Purpose

Load the dataset and examine its structure to identify issues like
missing values, incorrect data types, or anomalies.

1.2 Actions

• Load movies.csv using pd.readc sv().Inspectwithhead(), info(), describe(), and

1.3 Considerations

• Ensure the file path is correct (use absolute paths if needed).

• Handle encoding errors (e.g., UTF-8, Latin1) or incorrect delimiters.

1.4 Code
1 import pandas as pd
2 import os
3

4 # Verify working directory

5 print(”Current working directory:”, os.getcwd())
6

7 # Load dataset with error handling

8 try:
9 dataset = pd.read_csv(’movies.csv’, encoding=’utf-8’)
10 except FileNotFoundError:
11 print(”Error: ’movies.csv’ not found. Please check the file path
.”)
12 raise

1
13 except UnicodeDecodeError:
14 dataset = pd.read_csv(’movies.csv’, encoding=’latin1’)
15

16 # Inspect dataset
17 print(”First 5 rows:\n”, dataset.head())
18 print(”\nData types and missing values:\n”, dataset.info())
19 print(”\nSummary statistics:\n”, dataset.describe())
20 print(”\nUnique values in ’genre’:\n”, dataset[’genre’].value_counts
())

2 Handle Missing Values

2.1 Purpose

Address missing data (NaN/None) to prevent errors in analysis or

modeling.

2.2 Actions

• Identify missing values with isna().sum().

• Drop rows with missing title using dropna().
• Impute rating with mean and genre with mode.

2.3 Considerations

• Dropping rows reduces data size; impute non-critical columns.

• Check for hidden missing values (e.g., ”N/A” or empty strings).

2.4 Code
1 # Check missing values
2 print(”Missing values:\n”, dataset.isna().sum())
3

4 # Replace common placeholders with NaN

5 dataset.replace([’N/A’, ’’, ’None’], pd.NA, inplace=True)
6

7 # Drop rows with missing ’title’

8 dataset.dropna(subset=[’title’], inplace=True)
9

10 # Impute missing ’rating’ with mean

11 dataset[’rating’].fillna(dataset[’rating’].mean(), inplace=True)
12

13 # Impute missing ’genre’ with mode

14 dataset[’genre’].fillna(dataset[’genre’].mode()[0], inplace=True)
15

16 print(”Missing values after handling:\n”, dataset.isna().sum())

2
3 Remove Duplicates

3.1 Purpose

Eliminate duplicate rows to avoid bias in analysis or modeling.

3.2 Actions

• Identify duplicates with duplicated().sum().

• Remove duplicates based on title and year.

3.3 Considerations

• Decide whether to keep the first, last, or no duplicates.

• Duplicates may arise from multiple entries for the same movie.

3.4 Code
1 # Check for duplicates
2 print(”Number of duplicate rows:”, dataset.duplicated().sum())
3

4 # Remove duplicates based on ’title’ and ’year’

5 dataset.drop_duplicates(subset=[’title’, ’year’], keep=’first’,
inplace=True)
6

7 print(”Dataset shape after removing duplicates:”, dataset.shape)

4 Correct Data Types

4.1 Purpose

Ensure columns have appropriate data types for analysis (e.g., numerical,
datetime).

4.2 Actions

• Check data types with info().

• Convert year to integer and released atetodatetime.

4.3 Considerations

• Handle conversion errors with errors=’coerce’.

• Verify conversions to ensure correctness.

3
4.4 Code
1 # Check data types
2 print(”Data types before conversion:\n”, dataset.dtypes)
3

4 # Convert ’year’ to integer

5 dataset[’year’] = dataset[’year’].astype(’Int64’)
6

7 # Convert ’release_date’ to datetime

8 dataset[’release_date’] = pd.to_datetime(dataset[’release_date’],
errors=’coerce’)
9

10 print(”Data types after conversion:\n”, dataset.dtypes)

5 Handle Inconsistent Data

5.1 Purpose

Fix inconsistencies in categorical or text data (e.g., typos, mixed

cases).

5.2 Actions

• Standardize genre with str.lower() and str.strip().

• Correct specific inconsistencies with replace().

5.3 Considerations

• Check unique values to identify inconsistencies.

• Use a mapping dictionary for common corrections.

5.4 Code
1 # Check unique values in ’genre’
2 print(”Unique genres before cleaning:\n”, dataset[’genre’].
value_counts())
3

4 # Standardize ’genre’
5 dataset[’genre’] = dataset[’genre’].str.lower().str.strip()
6

7 # Correct inconsistencies
8 genre_mapping = {’scifi’: ’sci-fi’, ’comedy ’: ’comedy’, ’dram’: ’
drama’}
9 dataset[’genre’] = dataset[’genre’].replace(genre_mapping)
10

11 print(”Unique genres after cleaning:\n”, dataset[’genre’].

value_counts())

4
6 Handle Outliers

6.1 Purpose

Identify and address extreme values in numerical columns that may

skew analysis.

6.2 Actions

• Use the Interquartile Range (IQR) method to detect outliers

in rating.
• Remove outliers from the dataset.

6.3 Considerations

• Validate outliers (e.g., extreme rating values).

• Consider capping outliers instead of removing them.

6.4 Code
1 import numpy as np
2

3 # Calculate IQR for ’rating’

4 Q1, Q3 = dataset[’rating’].quantile([0.25, 0.75])
5 IQR = Q3 - Q1
6 lower_bound = Q1 - 1.5 * IQR
7 upper_bound = Q3 + 1.5 * IQR
8

9 print(”Outliers in ’rating’:\n”, dataset[(dataset[’rating’] <

lower_bound) | (dataset[’rating’] > upper_bound)])
10

11 # Remove outliers
12 dataset = dataset[(dataset[’rating’] >= lower_bound) & (dataset[’
rating’] <= upper_bound)]
13

14 print(”Summary statistics after removing outliers:\n”, dataset.

describe())

7 Encode Categorical Variables

7.1 Purpose

Convert categorical columns to numerical formats for analysis or

modeling.

7.2 Actions

• Use getd ummies()f orone − hotencodingof genre.

5
7.3 Considerations

• One-hot encoding increases dimensionality for high-cardinality

columns.
• Use label encoding for ordinal categories if applicable.

7.4 Code
1 # One-hot encode ’genre’
2 dataset = pd.get_dummies(dataset, columns=[’genre’], prefix=’genre’)
3

4 print(”Dataset with encoded genres:\n”, dataset.head())

8 Clean Text Data

8.1 Purpose

Standardize text columns by removing unwanted characters or extra

spaces.

8.2 Actions

• Remove special characters from title using regular expressions.

• Remove leading/trailing spaces with str.strip().

8.3 Considerations

• Ensure cleaning preserves meaningful data.

• Handle edge cases (e.g., empty strings after cleaning).

8.4 Code
1 import re
2

3 # Clean ’title’ by removing special characters

4 dataset[’title’] = dataset[’title’].apply(lambda x: re.sub(r’[^a-zA-
Z0-9\s]’, ’’, str(x)))
5

6 # Remove extra spaces

7 dataset[’title’] = dataset[’title’].str.strip()
8

9 print(”Cleaned titles:\n”, dataset[’title’].head())

6
9 Filter Irrelevant Data

9.1 Purpose

Remove rows or columns irrelevant to the analysis (e.g., old movies).

9.2 Actions

• Filter movies from 2000 or later.

• Drop irrelevant columns like comments.

9.3 Considerations

• Define relevance based on analysis goals.

• Use errors=’ignore’ to avoid errors if columns are missing.

9.4 Code
1 # Filter movies from 2000 or later
2 dataset = dataset[dataset[’year’] >= 2000]
3

4 # Drop irrelevant columns

5 dataset.drop(columns=[’comments’], inplace=True, errors=’ignore’)
6

7 print(”Filtered dataset:\n”, dataset.head())

10 Validate and Save

10.1 Purpose

Verify the cleaned dataset and save it for further use.

10.2 Actions

• Re-check missing values, data types, and duplicates.

• Save to moviesc leaned.csv.

10.3 Considerations

• Ensure no new issues were introduced.

• Set index=False to avoid saving the index.

7
10.4 Code
1 # Validate dataset
2 print(”Final data types:\n”, dataset.dtypes)
3 print(”Final missing values:\n”, dataset.isna().sum())
4 print(”Final duplicates:”, dataset.duplicated().sum())
5

6 # Save cleaned dataset

7 dataset.to_csv(’movies_cleaned.csv’, index=False)
8 print(”Cleaned dataset saved as ’movies_cleaned.csv’”)

11 Additional Notes

• Ensure movies.csv is in the working directory or provide the

full path.
• Install required libraries: pip install pandas numpy.
• Adjust steps based on specific dataset issues (e.g., unique
columns, formats).
• Use os.getcwd() to verify the working directory.
• Visualize cleaning progress (e.g., missing values) with plots
if needed.

Grade 12 Geography Research 2025
No ratings yet
Grade 12 Geography Research 2025
8 pages
OneFS 7.2 Web Administration Guide
No ratings yet
OneFS 7.2 Web Administration Guide
478 pages
Needs Assessment For Refugee Emergencies NARE
No ratings yet
Needs Assessment For Refugee Emergencies NARE
12 pages
Cleaning
No ratings yet
Cleaning
4 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
47 pages
Project 5
No ratings yet
Project 5
5 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
46 pages
Cleaning Dirty Data With Pandas & Python - DevelopIntelligence Blog PDF
No ratings yet
Cleaning Dirty Data With Pandas & Python - DevelopIntelligence Blog PDF
8 pages
Movie Ticket Booking
No ratings yet
Movie Ticket Booking
30 pages
Python Scenario Based Interview QA
No ratings yet
Python Scenario Based Interview QA
3 pages
Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
B Tech-AIML-question Bank-2 Answer Key
No ratings yet
B Tech-AIML-question Bank-2 Answer Key
9 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Chapter 2
No ratings yet
Chapter 2
36 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Chapter2 - Data Wrangling
No ratings yet
Chapter2 - Data Wrangling
48 pages
Kunj Project 2
No ratings yet
Kunj Project 2
31 pages
Data Preprocessing
No ratings yet
Data Preprocessing
84 pages
Final Project
No ratings yet
Final Project
7 pages
Lab 6
No ratings yet
Lab 6
9 pages
ch4 Slides PDF
No ratings yet
ch4 Slides PDF
44 pages
Lab 3 DWM
No ratings yet
Lab 3 DWM
5 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
14 pages
Handson Data Preprocessing PYTHON
No ratings yet
Handson Data Preprocessing PYTHON
3 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Membership Constraints: Adel Nehme
No ratings yet
Membership Constraints: Adel Nehme
36 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Lecture 1 Pyhton Programming DOST 1
No ratings yet
Lecture 1 Pyhton Programming DOST 1
67 pages
Text 3
No ratings yet
Text 3
3 pages
HCLTech
No ratings yet
HCLTech
5 pages
Cleaning Data in Python: Pu!ing It All Together
No ratings yet
Cleaning Data in Python: Pu!ing It All Together
14 pages
Dataframing in CSV
No ratings yet
Dataframing in CSV
14 pages
Arpit
No ratings yet
Arpit
30 pages
Practical 3
No ratings yet
Practical 3
2 pages
FDS Chapter 3
No ratings yet
FDS Chapter 3
103 pages
Import As Import As Import As Import Import As From Import: 'Ggplot'
No ratings yet
Import As Import As Import As Import Import As From Import: 'Ggplot'
13 pages
Hitchhiker's Guide To Exploratory Data Analysis - by Harshit Tyagi - Towards Data Science
No ratings yet
Hitchhiker's Guide To Exploratory Data Analysis - by Harshit Tyagi - Towards Data Science
14 pages
Assignment 2
No ratings yet
Assignment 2
6 pages
PDS Exp 7 To 9
No ratings yet
PDS Exp 7 To 9
10 pages
Lab-4, Data Wrangling With Python
No ratings yet
Lab-4, Data Wrangling With Python
11 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
DS Question Bank Unit-1 Part-2
No ratings yet
DS Question Bank Unit-1 Part-2
3 pages
Informatics Practices: Delhi Tamil Education Association Senior Secondary School
No ratings yet
Informatics Practices: Delhi Tamil Education Association Senior Secondary School
29 pages
22CS5PEDEV (1)
No ratings yet
22CS5PEDEV (1)
3 pages
Question Bank CIA 2
No ratings yet
Question Bank CIA 2
3 pages
Lab 1 ML Lab
No ratings yet
Lab 1 ML Lab
15 pages
Project Report
No ratings yet
Project Report
16 pages
Prac 1
No ratings yet
Prac 1
5 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
Data Science - Sec4
No ratings yet
Data Science - Sec4
16 pages
Assignment
No ratings yet
Assignment
12 pages
Amity International School SESSION: 2024-25 Informatics Practices (065) Class Xii Practical List
No ratings yet
Amity International School SESSION: 2024-25 Informatics Practices (065) Class Xii Practical List
5 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
IMDB Movie Analysis
No ratings yet
IMDB Movie Analysis
80 pages
IMDb+Movie+Assignment Stub
No ratings yet
IMDb+Movie+Assignment Stub
9 pages
III Unit
No ratings yet
III Unit
4 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Simplifying Data Science With Python
From Everand
Simplifying Data Science With Python
Billy David millican
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Ymsg Voice Analysis
No ratings yet
Ymsg Voice Analysis
7 pages
Ebook Cdos Guide To Measuring Ais Business Value Updated
No ratings yet
Ebook Cdos Guide To Measuring Ais Business Value Updated
23 pages
All MongoDb Commands You Will Ever Need (MongoDb Cheatsheet) - CodeWithHarry
No ratings yet
All MongoDb Commands You Will Ever Need (MongoDb Cheatsheet) - CodeWithHarry
4 pages
CE-131L AICT Lab 09
No ratings yet
CE-131L AICT Lab 09
5 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
49 pages
Chapter 3
No ratings yet
Chapter 3
10 pages
Business Intelligence and Analytics: Prepared by Dr. Hima Suresh Assistant Professor Division of CS, SOE
No ratings yet
Business Intelligence and Analytics: Prepared by Dr. Hima Suresh Assistant Professor Division of CS, SOE
36 pages
Chapter-4-File Input-Output in CPP
No ratings yet
Chapter-4-File Input-Output in CPP
68 pages
AOMEI Partition Assistant
No ratings yet
AOMEI Partition Assistant
69 pages
Behavior Data Bundle End User License Agreement
No ratings yet
Behavior Data Bundle End User License Agreement
1 page
PBS23101013 MGT7105
No ratings yet
PBS23101013 MGT7105
12 pages
DP-203 Exam - Free Actual Q&As, Page 6 - ExamTopics
No ratings yet
DP-203 Exam - Free Actual Q&As, Page 6 - ExamTopics
17 pages
Error Detection & Correction
No ratings yet
Error Detection & Correction
31 pages
ProductionHUB INTRO SLIDES
No ratings yet
ProductionHUB INTRO SLIDES
26 pages
Lab Database Connection Csharp 4labs
No ratings yet
Lab Database Connection Csharp 4labs
21 pages
SSRN 4849578
No ratings yet
SSRN 4849578
54 pages
Family Environment and Adolesc
No ratings yet
Family Environment and Adolesc
67 pages
8.student Development Program On Introduction To Machine Learning 2022 2
No ratings yet
8.student Development Program On Introduction To Machine Learning 2022 2
15 pages
Seminar Report - Teradata Is A Relational Database Managemen
50% (2)
Seminar Report - Teradata Is A Relational Database Managemen
30 pages
CS409 Sample Paper Solved
No ratings yet
CS409 Sample Paper Solved
5 pages
MS CH 9-10
No ratings yet
MS CH 9-10
3 pages
Unstructured Data & AI
No ratings yet
Unstructured Data & AI
76 pages
Ani SQL
No ratings yet
Ani SQL
173 pages
MYSQL
No ratings yet
MYSQL
44 pages
EmTech Q2 QA Reviewer
No ratings yet
EmTech Q2 QA Reviewer
3 pages
MR Research Problem
No ratings yet
MR Research Problem
17 pages
Data Extraction From Hand-Filled Form Using Form Template
No ratings yet
Data Extraction From Hand-Filled Form Using Form Template
7 pages

All CLR

Uploaded by

All CLR

Uploaded by

Detailed Step-by-Step Data Cleaning in Python

1 Load and Inspect the Dataset

• Load movies.csv using pd.readc sv().Inspectwithhead(), info(), describe(), and

• Ensure the file path is correct (use absolute paths if needed).

4 # Verify working directory

7 # Load dataset with error handling

2 Handle Missing Values

Address missing data (NaN/None) to prevent errors in analysis or

• Identify missing values with isna().sum().

• Dropping rows reduces data size; impute non-critical columns.

4 # Replace common placeholders with NaN

7 # Drop rows with missing ’title’

10 # Impute missing ’rating’ with mean

13 # Impute missing ’genre’ with mode

16 print(”Missing values after handling:\n”, dataset.isna().sum())

Eliminate duplicate rows to avoid bias in analysis or modeling.

• Identify duplicates with duplicated().sum().

• Decide whether to keep the first, last, or no duplicates.

4 # Remove duplicates based on ’title’ and ’year’

7 print(”Dataset shape after removing duplicates:”, dataset.shape)

4 Correct Data Types

• Check data types with info().

• Handle conversion errors with errors=’coerce’.

4 # Convert ’year’ to integer

7 # Convert ’release_date’ to datetime

10 print(”Data types after conversion:\n”, dataset.dtypes)

5 Handle Inconsistent Data

Fix inconsistencies in categorical or text data (e.g., typos, mixed

• Standardize genre with str.lower() and str.strip().

• Check unique values to identify inconsistencies.

11 print(”Unique genres after cleaning:\n”, dataset[’genre’].

Identify and address extreme values in numerical columns that may

• Use the Interquartile Range (IQR) method to detect outliers

• Validate outliers (e.g., extreme rating values).

3 # Calculate IQR for ’rating’

9 print(”Outliers in ’rating’:\n”, dataset[(dataset[’rating’] <

14 print(”Summary statistics after removing outliers:\n”, dataset.

7 Encode Categorical Variables

Convert categorical columns to numerical formats for analysis or

• Use getd ummies()f orone − hotencodingof genre.

• One-hot encoding increases dimensionality for high-cardinality

4 print(”Dataset with encoded genres:\n”, dataset.head())

8 Clean Text Data

Standardize text columns by removing unwanted characters or extra

• Remove special characters from title using regular expressions.

• Ensure cleaning preserves meaningful data.

3 # Clean ’title’ by removing special characters

6 # Remove extra spaces

9 print(”Cleaned titles:\n”, dataset[’title’].head())

Remove rows or columns irrelevant to the analysis (e.g., old movies).

• Filter movies from 2000 or later.

• Define relevance based on analysis goals.

4 # Drop irrelevant columns

7 print(”Filtered dataset:\n”, dataset.head())

10 Validate and Save

Verify the cleaned dataset and save it for further use.

• Re-check missing values, data types, and duplicates.

• Ensure no new issues were introduced.

6 # Save cleaned dataset

• Ensure movies.csv is in the working directory or provide the

You might also like