0% found this document useful (0 votes)
3 views8 pages

All CLR

This document outlines a comprehensive guide for cleaning a movie dataset (movies.csv) using Python with Pandas and NumPy. It details each step of the cleaning process, including loading the dataset, handling missing values, removing duplicates, correcting data types, and encoding categorical variables, along with relevant code snippets. The final steps involve validating the cleaned dataset and saving it for further use.

Uploaded by

Awan Adhikari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views8 pages

All CLR

This document outlines a comprehensive guide for cleaning a movie dataset (movies.csv) using Python with Pandas and NumPy. It details each step of the cleaning process, including loading the dataset, handling missing values, removing duplicates, correcting data types, and encoding categorical variables, along with relevant code snippets. The final steps involve validating the cleaned dataset and saving it for further use.

Uploaded by

Awan Adhikari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Detailed Step-by-Step Data Cleaning in Python

for movies.csv

This document provides a detailed guide for cleaning a movie dataset (movies.csv)
using Python with Pandas and NumPy. Each step includes a purpose, actions,
considerations, and Python code. The dataset is assumed to have columns like
title, year, genre, rating, and released ate.

1 Load and Inspect the Dataset

1.1 Purpose

Load the dataset and examine its structure to identify issues like
missing values, incorrect data types, or anomalies.

1.2 Actions

• Load movies.csv using pd.readc sv().Inspectwithhead(), info(), describe(), and

1.3 Considerations

• Ensure the file path is correct (use absolute paths if needed).


• Handle encoding errors (e.g., UTF-8, Latin1) or incorrect delimiters.

1.4 Code
1 import pandas as pd
2 import os
3

4 # Verify working directory


5 print(”Current working directory:”, os.getcwd())
6

7 # Load dataset with error handling


8 try:
9 dataset = pd.read_csv(’movies.csv’, encoding=’utf-8’)
10 except FileNotFoundError:
11 print(”Error: ’movies.csv’ not found. Please check the file path
.”)
12 raise

1
13 except UnicodeDecodeError:
14 dataset = pd.read_csv(’movies.csv’, encoding=’latin1’)
15

16 # Inspect dataset
17 print(”First 5 rows:\n”, dataset.head())
18 print(”\nData types and missing values:\n”, dataset.info())
19 print(”\nSummary statistics:\n”, dataset.describe())
20 print(”\nUnique values in ’genre’:\n”, dataset[’genre’].value_counts
())

2 Handle Missing Values

2.1 Purpose

Address missing data (NaN/None) to prevent errors in analysis or


modeling.

2.2 Actions

• Identify missing values with isna().sum().


• Drop rows with missing title using dropna().
• Impute rating with mean and genre with mode.

2.3 Considerations

• Dropping rows reduces data size; impute non-critical columns.


• Check for hidden missing values (e.g., ”N/A” or empty strings).

2.4 Code
1 # Check missing values
2 print(”Missing values:\n”, dataset.isna().sum())
3

4 # Replace common placeholders with NaN


5 dataset.replace([’N/A’, ’’, ’None’], pd.NA, inplace=True)
6

7 # Drop rows with missing ’title’


8 dataset.dropna(subset=[’title’], inplace=True)
9

10 # Impute missing ’rating’ with mean


11 dataset[’rating’].fillna(dataset[’rating’].mean(), inplace=True)
12

13 # Impute missing ’genre’ with mode


14 dataset[’genre’].fillna(dataset[’genre’].mode()[0], inplace=True)
15

16 print(”Missing values after handling:\n”, dataset.isna().sum())

2
3 Remove Duplicates

3.1 Purpose

Eliminate duplicate rows to avoid bias in analysis or modeling.

3.2 Actions

• Identify duplicates with duplicated().sum().


• Remove duplicates based on title and year.

3.3 Considerations

• Decide whether to keep the first, last, or no duplicates.


• Duplicates may arise from multiple entries for the same movie.

3.4 Code
1 # Check for duplicates
2 print(”Number of duplicate rows:”, dataset.duplicated().sum())
3

4 # Remove duplicates based on ’title’ and ’year’


5 dataset.drop_duplicates(subset=[’title’, ’year’], keep=’first’,
inplace=True)
6

7 print(”Dataset shape after removing duplicates:”, dataset.shape)

4 Correct Data Types

4.1 Purpose

Ensure columns have appropriate data types for analysis (e.g., numerical,
datetime).

4.2 Actions

• Check data types with info().


• Convert year to integer and released atetodatetime.

4.3 Considerations

• Handle conversion errors with errors=’coerce’.


• Verify conversions to ensure correctness.

3
4.4 Code
1 # Check data types
2 print(”Data types before conversion:\n”, dataset.dtypes)
3

4 # Convert ’year’ to integer


5 dataset[’year’] = dataset[’year’].astype(’Int64’)
6

7 # Convert ’release_date’ to datetime


8 dataset[’release_date’] = pd.to_datetime(dataset[’release_date’],
errors=’coerce’)
9

10 print(”Data types after conversion:\n”, dataset.dtypes)

5 Handle Inconsistent Data

5.1 Purpose

Fix inconsistencies in categorical or text data (e.g., typos, mixed


cases).

5.2 Actions

• Standardize genre with str.lower() and str.strip().


• Correct specific inconsistencies with replace().

5.3 Considerations

• Check unique values to identify inconsistencies.


• Use a mapping dictionary for common corrections.

5.4 Code
1 # Check unique values in ’genre’
2 print(”Unique genres before cleaning:\n”, dataset[’genre’].
value_counts())
3

4 # Standardize ’genre’
5 dataset[’genre’] = dataset[’genre’].str.lower().str.strip()
6

7 # Correct inconsistencies
8 genre_mapping = {’scifi’: ’sci-fi’, ’comedy ’: ’comedy’, ’dram’: ’
drama’}
9 dataset[’genre’] = dataset[’genre’].replace(genre_mapping)
10

11 print(”Unique genres after cleaning:\n”, dataset[’genre’].


value_counts())

4
6 Handle Outliers

6.1 Purpose

Identify and address extreme values in numerical columns that may


skew analysis.

6.2 Actions

• Use the Interquartile Range (IQR) method to detect outliers


in rating.
• Remove outliers from the dataset.

6.3 Considerations

• Validate outliers (e.g., extreme rating values).


• Consider capping outliers instead of removing them.

6.4 Code
1 import numpy as np
2

3 # Calculate IQR for ’rating’


4 Q1, Q3 = dataset[’rating’].quantile([0.25, 0.75])
5 IQR = Q3 - Q1
6 lower_bound = Q1 - 1.5 * IQR
7 upper_bound = Q3 + 1.5 * IQR
8

9 print(”Outliers in ’rating’:\n”, dataset[(dataset[’rating’] <


lower_bound) | (dataset[’rating’] > upper_bound)])
10

11 # Remove outliers
12 dataset = dataset[(dataset[’rating’] >= lower_bound) & (dataset[’
rating’] <= upper_bound)]
13

14 print(”Summary statistics after removing outliers:\n”, dataset.


describe())

7 Encode Categorical Variables

7.1 Purpose

Convert categorical columns to numerical formats for analysis or


modeling.

7.2 Actions

• Use getd ummies()f orone − hotencodingof genre.

5
7.3 Considerations

• One-hot encoding increases dimensionality for high-cardinality


columns.
• Use label encoding for ordinal categories if applicable.

7.4 Code
1 # One-hot encode ’genre’
2 dataset = pd.get_dummies(dataset, columns=[’genre’], prefix=’genre’)
3

4 print(”Dataset with encoded genres:\n”, dataset.head())

8 Clean Text Data

8.1 Purpose

Standardize text columns by removing unwanted characters or extra


spaces.

8.2 Actions

• Remove special characters from title using regular expressions.


• Remove leading/trailing spaces with str.strip().

8.3 Considerations

• Ensure cleaning preserves meaningful data.


• Handle edge cases (e.g., empty strings after cleaning).

8.4 Code
1 import re
2

3 # Clean ’title’ by removing special characters


4 dataset[’title’] = dataset[’title’].apply(lambda x: re.sub(r’[^a-zA-
Z0-9\s]’, ’’, str(x)))
5

6 # Remove extra spaces


7 dataset[’title’] = dataset[’title’].str.strip()
8

9 print(”Cleaned titles:\n”, dataset[’title’].head())

6
9 Filter Irrelevant Data

9.1 Purpose

Remove rows or columns irrelevant to the analysis (e.g., old movies).

9.2 Actions

• Filter movies from 2000 or later.


• Drop irrelevant columns like comments.

9.3 Considerations

• Define relevance based on analysis goals.


• Use errors=’ignore’ to avoid errors if columns are missing.

9.4 Code
1 # Filter movies from 2000 or later
2 dataset = dataset[dataset[’year’] >= 2000]
3

4 # Drop irrelevant columns


5 dataset.drop(columns=[’comments’], inplace=True, errors=’ignore’)
6

7 print(”Filtered dataset:\n”, dataset.head())

10 Validate and Save

10.1 Purpose

Verify the cleaned dataset and save it for further use.

10.2 Actions

• Re-check missing values, data types, and duplicates.


• Save to moviesc leaned.csv.

10.3 Considerations

• Ensure no new issues were introduced.


• Set index=False to avoid saving the index.

7
10.4 Code
1 # Validate dataset
2 print(”Final data types:\n”, dataset.dtypes)
3 print(”Final missing values:\n”, dataset.isna().sum())
4 print(”Final duplicates:”, dataset.duplicated().sum())
5

6 # Save cleaned dataset


7 dataset.to_csv(’movies_cleaned.csv’, index=False)
8 print(”Cleaned dataset saved as ’movies_cleaned.csv’”)

11 Additional Notes

• Ensure movies.csv is in the working directory or provide the


full path.
• Install required libraries: pip install pandas numpy.
• Adjust steps based on specific dataset issues (e.g., unique
columns, formats).
• Use os.getcwd() to verify the working directory.
• Visualize cleaning progress (e.g., missing values) with plots
if needed.

You might also like