Efficient Duplicate Image Removal: Using ImageDeDup Python

By Rishoo Mittal - January 27, 2025

Efficient Duplicate Image Removal: A Python Guide

In this blog post, we will walk you through the process of identifying and removing duplicate images from your directories using Python. We will leverage perceptual hashing to find similar images and delete duplicates while keeping the largest file in the group. This solution is perfect for users who want to save disk space and keep their image collections organized.

Why You Should Use This Code

Over time, especially when dealing with large collections of images, duplicate files can accumulate. These duplicates take up unnecessary space on your system. Manually sifting through these images can be tedious, but with the help of Python, perceptual hashing, and concurrent processing, this task becomes much easier.

Benefits:

Efficient Duplicate Detection: By using perceptual hashing (PHash), the code compares images based on their visual content rather than file names or sizes.
Concurrency: The code utilizes ThreadPoolExecutor to process multiple images simultaneously, reducing processing time for large datasets.
Safe Deletion: The largest image in each duplicate group is kept, and others are safely deleted, ensuring you don’t lose important images.
Error Handling: The code includes robust error handling for permissions and other file issues, making it resilient during execution.

Prerequisites: Packages to Install

Before using this script, make sure to install the required packages. You can do so using pip:

pip install imagededup Pillow tqdm

imagededup: A library for perceptual hashing to find duplicate images.
Pillow: The Python Imaging Library for opening and processing images.
tqdm: A library to add a progress bar to loops.

Parameters to Set

In the main function, you need to specify the root directory where your images are stored:

root_dir = '/path/to/your/images'  # Update this path

Ensure this directory exists and contains images that you want to process.

Code Walkthrough

1. Finding All Images

The find_all_images function scans the directory and subdirectories to find all images with specified extensions (JPEG, PNG, GIF, etc.).


def find_all_images(root_dir, extensions=None):
    if extensions is None:
        extensions = ['.jpg', '.jpeg', '.png', '.heic', '.bmp', '.tiff', '.gif', '.webp', 
            '.ico', '.raw', '.svg', '.pjpeg', '.jfif', '.apng']
    image_paths = []
    for dirpath, _, filenames in os.walk(root_dir):
        for file in filenames:
            if any(file.lower().endswith(ext) for ext in extensions):
                file_path = Path(dirpath) / file
                image_paths.append(str(file_path.resolve()))  # Handle special characters
    return image_paths

Here, we use os.walk to recursively go through the directory and its subdirectories, collecting image file paths. Special care is taken to handle paths with special characters.

2. Generating Hashes

The generate_hash function uses perceptual hashing to generate a hash for each image. This hash allows us to compare images based on their visual content.


def generate_hash(image_path, phasher):
    try:
        with Image.open(image_path) as img:
            if img.width == 0 or img.height == 0:  # Skip invalid images
                return image_path, None
            encoding = phasher.encode_image(image_file=image_path)
            return image_path, encoding
    except Exception as e:
        logger.error(f"Error processing {image_path}: {e}")
        return image_path, None

If the image is valid, its hash is computed. If the image has invalid dimensions or cannot be opened, it is skipped.

3. Identifying and Deleting Duplicates

Once hashes for all images are generated, the script identifies duplicate images and deletes the smaller ones, keeping the largest in each group.


def delete_duplicates(duplicates, keep_largest=True):
    total_deleted = 0
    for original, duplicate_list in duplicates.items():
        all_files = [original] + duplicate_list
        if keep_largest:
            largest_file = max(all_files, key=lambda x: os.path.getsize(x))
            all_files.remove(largest_file)
            for file in all_files:
                try:
                    os.remove(file)
                    total_deleted += 1
                except Exception as e:
                    logger.error(f"Error deleting file {file}: {e}")
    logger.info(f"Total deleted files: {total_deleted}")

4. Parallel Processing with ThreadPoolExecutor

To speed up the hash generation process, the script uses ThreadPoolExecutor to process images concurrently.


with ThreadPoolExecutor(max_workers=8) as executor:
    for image_path, encoding in tqdm(executor.map(lambda path: generate_hash(path, phasher), image_paths), total=len(image_paths), desc="Generating hashes"):
        if encoding:
            encodings[image_path] = encoding

5. Putting It All Together

Finally, the main() function ties everything together. It sets the root directory, finds images, generates hashes, detects duplicates, and deletes them.


def main():
    root_dir = '/path/to/your/images'  # Update this path
    if not os.path.isdir(root_dir):
        logger.error(f"Error: Directory {root_dir} does not exist.")
        return

    image_paths = find_all_images(root_dir)
    if not image_paths:
        logger.warning("No images found.")
        return

    phasher = PHash()
    encodings = {}
    with ThreadPoolExecutor(max_workers=8) as executor:
        for image_path, encoding in tqdm(executor.map(lambda path: generate_hash(path, phasher), image_paths), total=len(image_paths), desc="Generating hashes"):
            if encoding:
                encodings[image_path] = encoding

    duplicates = phasher.find_duplicates(encoding_map=encodings)
    if duplicates:
        delete_duplicates(duplicates, keep_largest=True)
    else:
        logger.info("No duplicates found.")

Conclusion

This script helps you efficiently find and remove duplicate images from your directories, using perceptual hashing to identify visually similar images. With built-in error handling, concurrency, and logging, it’s robust and efficient for managing large image collections.

Feel free to modify the code according to your needs, and be sure to run it on your own image datasets to save valuable storage space!

Happy coding!

Search This Blog

Tech Zombie

Featured Post