Efficient Duplicate Image Removal: Using ImageDeDup Python
In this blog post, we will walk you through the process of identifying and removing duplicate images from your directories using Python. We will leverage perceptual hashing to find similar images and delete duplicates while keeping the largest file in the group. This solution is perfect for users who want to save disk space and keep their image collections organized.
Why You Should Use This Code
Over time, especially when dealing with large collections of images, duplicate files can accumulate. These duplicates take up unnecessary space on your system. Manually sifting through these images can be tedious, but with the help of Python, perceptual hashing, and concurrent processing, this task becomes much easier.
Benefits:
- Efficient Duplicate Detection: By using perceptual hashing (PHash), the code compares images based on their visual content rather than file names or sizes.
- Concurrency: The code utilizes
ThreadPoolExecutor
to process multiple images simultaneously, reducing processing time for large datasets. - Safe Deletion: The largest image in each duplicate group is kept, and others are safely deleted, ensuring you don’t lose important images.
- Error Handling: The code includes robust error handling for permissions and other file issues, making it resilient during execution.
Prerequisites: Packages to Install
Before using this script, make sure to install the required packages. You can do so using pip
:
pip install imagededup Pillow tqdm
- imagededup: A library for perceptual hashing to find duplicate images.
- Pillow: The Python Imaging Library for opening and processing images.
- tqdm: A library to add a progress bar to loops.
Parameters to Set
In the main
function, you need to specify the root directory where your images are stored:
root_dir = '/path/to/your/images' # Update this path
Ensure this directory exists and contains images that you want to process.
Code Walkthrough
1. Finding All Images
The find_all_images
function scans the directory and subdirectories to find all images with specified extensions (JPEG, PNG, GIF, etc.).
def find_all_images(root_dir, extensions=None):
if extensions is None:
extensions = ['.jpg', '.jpeg', '.png', '.heic', '.bmp', '.tiff', '.gif', '.webp',
'.ico', '.raw', '.svg', '.pjpeg', '.jfif', '.apng']
image_paths = []
for dirpath, _, filenames in os.walk(root_dir):
for file in filenames:
if any(file.lower().endswith(ext) for ext in extensions):
file_path = Path(dirpath) / file
image_paths.append(str(file_path.resolve())) # Handle special characters
return image_paths
Here, we use os.walk
to recursively go through the directory and its subdirectories, collecting image file paths. Special care is taken to handle paths with special characters.
2. Generating Hashes
The generate_hash
function uses perceptual hashing to generate a hash for each image. This hash allows us to compare images based on their visual content.
def generate_hash(image_path, phasher):
try:
with Image.open(image_path) as img:
if img.width == 0 or img.height == 0: # Skip invalid images
return image_path, None
encoding = phasher.encode_image(image_file=image_path)
return image_path, encoding
except Exception as e:
logger.error(f"Error processing {image_path}: {e}")
return image_path, None
If the image is valid, its hash is computed. If the image has invalid dimensions or cannot be opened, it is skipped.
3. Identifying and Deleting Duplicates
Once hashes for all images are generated, the script identifies duplicate images and deletes the smaller ones, keeping the largest in each group.
def delete_duplicates(duplicates, keep_largest=True):
total_deleted = 0
for original, duplicate_list in duplicates.items():
all_files = [original] + duplicate_list
if keep_largest:
largest_file = max(all_files, key=lambda x: os.path.getsize(x))
all_files.remove(largest_file)
for file in all_files:
try:
os.remove(file)
total_deleted += 1
except Exception as e:
logger.error(f"Error deleting file {file}: {e}")
logger.info(f"Total deleted files: {total_deleted}")
4. Parallel Processing with ThreadPoolExecutor
To speed up the hash generation process, the script uses ThreadPoolExecutor
to process images concurrently.
with ThreadPoolExecutor(max_workers=8) as executor:
for image_path, encoding in tqdm(executor.map(lambda path: generate_hash(path, phasher), image_paths), total=len(image_paths), desc="Generating hashes"):
if encoding:
encodings[image_path] = encoding
5. Putting It All Together
Finally, the main()
function ties everything together. It sets the root directory, finds images, generates hashes, detects duplicates, and deletes them.
def main():
root_dir = '/path/to/your/images' # Update this path
if not os.path.isdir(root_dir):
logger.error(f"Error: Directory {root_dir} does not exist.")
return
image_paths = find_all_images(root_dir)
if not image_paths:
logger.warning("No images found.")
return
phasher = PHash()
encodings = {}
with ThreadPoolExecutor(max_workers=8) as executor:
for image_path, encoding in tqdm(executor.map(lambda path: generate_hash(path, phasher), image_paths), total=len(image_paths), desc="Generating hashes"):
if encoding:
encodings[image_path] = encoding
duplicates = phasher.find_duplicates(encoding_map=encodings)
if duplicates:
delete_duplicates(duplicates, keep_largest=True)
else:
logger.info("No duplicates found.")
Conclusion
This script helps you efficiently find and remove duplicate images from your directories, using perceptual hashing to identify visually similar images. With built-in error handling, concurrency, and logging, it’s robust and efficient for managing large image collections.
Feel free to modify the code according to your needs, and be sure to run it on your own image datasets to save valuable storage space!
Happy coding!
Comments
Post a Comment
Let's discuss and learn more