How to remove duplicates from a list in Python

Jul 09, 2024#python

A list in Python is a collection data type that is ordered, mutable, and allows duplicate elements. You can change, add, or remove elements after the list has been created.

In many scenarios, you might require unique elements to ensure the accuracy and integrity of the data. For example, a list of user IDs should not contain duplicates to avoid conflicts.

Suppose you have a list of email addresses and you need to send an email to each address exactly once. If your list contains duplicates, you might end up sending multiple emails to the same address, which is inefficient and undesirable.

Here are a few common approaches to remove duplicates from a list:

Using a Set

A set is an unordered collection data type that is iterable, mutable, and has no duplicate elements. Sets are defined by using curly braces {} or the set() function.

When you convert a list to a set, all duplicate elements are automatically removed because sets do not allow duplicates. To maintain the original list data type, you can then convert the set back to a list.

# Original list with duplicates
original_list = [1, 2, 3, 1, 2, 4, 5]

# Convert the list to a set to remove duplicates
unique_set = set(original_list)

# Convert the set back to a list
unique_list = list(unique_set)

print(unique_list)
# [1, 2, 3, 4, 5]

The order of elements in the resulting list may not be the same as the original list because sets are unordered.

If you need to remove duplicates while preserving the order of the original list, you can use a combination of a set and a list comprehension

original_list = [1, 2, 3, 1, 2, 4, 5]
unique_list = []
seen = set()

for item in original_list:
    if item not in seen:
        unique_list.append(item)
        seen.add(item)

print(unique_list)

Using dict.fromkeys() method

Using dict.fromkeys() to remove duplicates from a list in Python is a unique and efficient method that also preserves the order of elements.

It creates a new dictionary with keys from the provided iterable and values set to a specified value (default is None). Since dictionary keys must be unique, using dict.fromkeys() effectively removes duplicates.

# Original list with duplicates
original_list = [1, 2, 3, 1, 2, 4, 5]

# Use dict.fromkeys() to remove duplicates
unique_dict = dict.fromkeys(original_list)

# Convert the dictionary keys back to a list
unique_list = list(unique_dict)

print(unique_list)
# [1, 2, 3, 4, 5]

This method works well with any immutable and hashable data types in a list, making it a versatile and efficient method for removing duplicates while preserving order.

Using pandas.unique() function

The pandas.unique() function is a convenient way to remove duplicates from a list when working with pandas. It returns the unique values in the input array, preserving the order of their first occurrence.

import pandas as pd

# Original list with duplicates
original_list = [1, 2, 3, 1, 2, 4, 5]

# Use pandas.unique to remove duplicates
unique_array = pd.unique(original_list)

# Convert the resulting array back to a list
unique_list = unique_array.tolist()

print(unique_list)
# [1, 2, 3, 4, 5]

Ensure you have pandas installed and imported in your script. While pandas.unique() can work directly on lists, it’s often used with pandas Series or DataFrames.

Using itertools.groupby() function

The itertools.groupby() function groups consecutive elements in an iterable that have the same value. To use it for removing duplicates, you need to sort the list first so that duplicates are consecutive, then group the elements and take one element from each group.

from itertools import groupby

# Original list with duplicates
original_list = [1, 2, 3, 1, 2, 4, 5]

# Sort the list to make duplicates consecutive
sorted_list = sorted(original_list)

# Use itertools.groupby to group and extract unique elements
unique_list = [key for key, _ in groupby(sorted_list)]

print(unique_list)
# [1, 2, 3, 4, 5]

This method is particularly efficient in terms of memory usage for certain scenarios because itertools.groupby() uses lazy evaluation, meaning it generates groups one at a time and processes them as they are encountered. This avoids creating large intermediate data structures, which can save memory when dealing with large datasets.

Unlike sets or dictionaries, which require additional memory to store unique elements, itertools.groupby() works directly on the sorted list and processes elements in-place.

Sorting the list has a time complexity of O(nlogn)O(n\log{n}), which might not be ideal for very large lists if order preservation is not necessary.