Deleting Duplicate Files Using Python
Last Updated :
18 Mar, 2022
In this article, we are going to use a concept called hashing to identify unique files and delete duplicate files using Python.
Modules required:
- tkinter: We need to make a way for us to select the folder in which we want to do this cleaning process so every time we run the code we should get a file dialog to select a folder and we are going to use the Tkinter library. In this library, we have a method called "askdirectory" which can be used to ask user to choose a directory. To install this library, type the following command in IDE/terminal.
pip install tk
- hashlib: In order to use the md5 hash function we need the hashlib library. To install this library, type the following command in IDE/terminal.
pip install hashlib
- os: This module helps us in removing duplicate files by providing functions for fetching file contents and deleting files, etc. To install this library, type the following command in IDE/terminal. The os module is a part of the standard library within Python.
Approach:
- We will ask the user to select a folder & we will search under this umbrella directory for all the duplicate and redundant files.
- We will take the content of each file & pass it through a hash function which is going to generate a unique string corresponding to a unique file.
- The hash string is going to be a fixed size and the size is going to depend on the type of hash function we are using. We have many hash functions like md5, SHA1 or SHA 256, and many others. In this article, we'll use the md5 hash and it's always going to produce a hash value of 32 characters long irrespective of the size of the file & type of the file.
- In order to detect duplicate files and then delete those files, we are going to maintain a python dictionary.
- We are going to pass the hash string of each and every file inside every subfolder of the root directory as keys of dictionary & file paths as values of the dictionary.
- Every time while inserting a new file record we will check if we are getting any duplicate entries in our dictionary. If we find any duplicate file we will take the path of the file and delete that file.
Stepwise Implementation
Step 1: Import Tkinter, os, hashlib & pathlib libraries.
Python3
from tkinter.filedialog import askdirectory
from tkinter import Tk
import os
import hashlib
from pathlib import Path
Step 2: We are using tk.withdraw because we don't want the GUI window of tkinter to be appearing on our screen we only want the file dialog for selecting the folder. askdirectory(title="Select a folder") this line of code pop ups a dialog box on the screen through which we can select a folder.
Python3
Tk().withdraw()
file_path = askdirectory(title="Select a folder")
Step 3: Next we need to list out all the files inside our root folder. To do that we need OS module, os.walk() takes the path of our root folder as an argument and it will walk through each subdirectory of the folder given to it and it will list out all the files. This function returns a list of tuples with three elements. The first element is the path to that folder and the second element is all the subfolders inside that folder and the third element is list of all the files inside that folder.
Python3
list_of_files = os.walk(file_path)
Step 4: Our final goal is to list out all the files in each and every subdirectory and the main directory that's why we are running a for loop on all the files. We need to open up each and every file and convert it into a hash string in order to do that we will define a variable called hash_file. md5 hash function will convert all the content of our file into md5 hash. In order to open a file we need to first have the path to it so here we are using another function in os module called os.path.join(). So we'll say open the file using file path in read mode. This will convert our file into a md5 hash. In order to get the hash string we are going to use hexdigest() method.
Python3
for root, folders, files in list_of_files:
for file in files:
file_path = Path(os.path.join(root, file))
Hash_file = hashlib.md5(open(
file_path,'rb').read()).hexdigest()
Step 5: In order to detect the duplicate files we are going to define an empty dictionary. We will add elements to this dictionary and the key of each element is going to be file hash and the value is going to be the file path. If file hash has already been added to this unique files dictionary that means that we have found a duplicate file and we need to delete that file so we'll simply delete that file using os.remove() function. If it's not there then we are going to add it to that dictionary.
Python3
unique_files = dict()
if Hash_file not in unique_files:
unique_files[Hash_file] = file_path
else:
os.remove(file_path)
print(f"{file_path} has been deleted")
Below is the full implementation:
Python3
from tkinter.filedialog import askdirectory
# Importing required libraries.
from tkinter import Tk
import os
import hashlib
from pathlib import Path
# We don't want the GUI window of
# tkinter to be appearing on our screen
Tk().withdraw()
# Dialog box for selecting a folder.
file_path = askdirectory(title="Select a folder")
# Listing out all the files
# inside our root folder.
list_of_files = os.walk(file_path)
# In order to detect the duplicate
# files we are going to define an empty dictionary.
unique_files = dict()
for root, folders, files in list_of_files:
# Running a for loop on all the files
for file in files:
# Finding complete file path
file_path = Path(os.path.join(root, file))
# Converting all the content of
# our file into md5 hash.
Hash_file = hashlib.md5(open(file_path, 'rb').read()).hexdigest()
# If file hash has already #
# been added we'll simply delete that file
if Hash_file not in unique_files:
unique_files[Hash_file] = file_path
else:
os.remove(file_path)
print(f"{file_path} has been deleted")
Output:
Similar Reads
Python - Removing duplicate dicts in list In Python, we may often work with a list that contains dictionaries, and sometimes those dictionaries can be duplicates. Removing duplicate dictionaries from a list can help us clean our data. In this article, we will explore various methods to Remove duplicate dicts in the listUsing Set and frozens
3 min read
Finding Duplicate Files with Python In this article, we will code a python script to find duplicate files in the file system or inside a particular folder. Method 1: Using Filecmp The python module filecmp offers functions to compare directories and files. The cmp function compares the files and returns True if they appear identical
4 min read
Deleting Files in HDFS using Python Snakebite Prerequisite: Hadoop Installation, HDFS Python Snakebite is a very popular Python library we can use to communicate with the HDFS. Using the Python client library provided by the Snakebite package we can easily write python code that works on HDFS. It uses protobuf messages to communicate directly w
3 min read
Python | Pandas Series.drop_duplicates() Pandas Series.drop_duplicates() function returns a series object with duplicate values removed from the given series object. Syntax: Series.drop_duplicates(keep='first', inplace=False) Parameter : keep : {âfirstâ, âlastâ, False}, default âfirstâ inplace : If True, performs operation inplace and retu
2 min read
Find all duplicate characters in string in Python In this article, we will explore various methods to find all duplicate characters in string. The simplest approach is by using a loop with dictionary.Using Loop with DictionaryWe can use a for loop to find duplicate characters efficiently. First we count the occurrences of each character by iteratin
2 min read