python programming

Finding duplicates in a List: Python Programming Challenge 1

The Python Programming Challenge series is a new addition at WritersByte where we will explore typical programming problems and find out the best approaches to tackle it, keeping in mind algorithmic design, and time and space complexities. While the series will mostly use Python as the programming language, you’re free to use whatever language you’re comfortable with.

Let’s begin! For our first challenge, we’ll start with a very trivial problem, i.e., finding duplicates in a list. To make things concise, lets set some constraints:

  • There will always be exactly one duplicate number in the list. That number, however, can be duplicated more than 2 times.
  • The array of length n contains integers between 1 and n (inclusive).
  • Time Complexity of the solution must be less than θ(n2).
  • Space Complexity of the solution must be θ(1).
  • The list is strictly read-only i.e. it cannot be modified.

Take 1: Sorting the list, followed by a Linear Traversal

Lets forget time and space complexities for a second, and think about the most easiest way to solve it. Assume the following array:

Now if we sort the above array (lets not talk about what algorithm to use here), the array is looks like below:

After the sort, the duplicate numbers are grouped. Now we can simply iterate the list using two pointers, one at index i and the other at index i+1. If the values at both indexes are equal, we’ve found a duplicate.

  1. In first iteration, one pointer is at index 0, and the other at index 1. Since the values at those pointers are 1 and 3, we haven’t found duplicate.
  2. In second iteration, both pointers move one step ahead. The values now are 3 and 5.
  3. In third iteration, both pointers move one step ahead. The values now are 5 and 5. Since, they both are equal, we have found a duplicate. Break the algorithm here.

Below is a simple Python program that runs the entire algorithm:

def take1_sort_lt(input: list) -> int: # Sort the list first, use default Python sorting implementation sorted_list: list = sorted(input) # Now perform a linear pass for i in range(0, len(sorted_list)): if sorted_list[i] == sorted_list[i+1]: return sorted_list[i] return None
Code language: Python (python)

The default Python sorted routine has an average and worst case time complexity of θ(n*logn). The linear for loop below it has a worst case complexity of θ(n). So, we are well within our time complexity limits. However, since we require an additional array sorted_list, the space complexity grows with the input, thus it is θ(n), and well above our constraints.

Take 2: Index-based List splitting and Traversal

Consider an array of length n. If we are at an index i, we are guaranteed that if the value at index i is a duplicate, it exists in the subarray from index i+1 to array length n.

  1. The algorithm starts at index 0. It uses the subarray from index 1 to end of the list, and performs a linear sweep to check if the value at index 0 appears in the subarray.
  2. Since it doesn’t the index is moved one step ahead to 1. The subarray from index 2 to end of the list is checked.
  3. In the third iteration, the value at index 2 is in subarray. Since a duplicate is found, the routine stops.

See the Python program below that implements this algorithm:

def take2_list_inner(input: list) -> int: for i in range(len(input)): for j in range(i+1, len(input)): if input[i] == input[j]: return input[i] return None
Code language: Python (python)

The time complexity of this approach is θ(n*(n-1)/2) which is dramatically worse than θ(n*logn). The space complexity however, is θ(1) since we don’t have any variable in our code that grows linearly with input.

Take 3: Keeping track of integers encountered using a Hash Map and a single Linear Pass

Another way to approach the problem is to detect duplicates by keeping track of integers encountered in the array. First, we need to think of a data structure that can help keep us track. For this purpose, we can use a Hash Map. A hash map is a special structure that holds key-value pair information. In our case, the key is the integer in the list, while the value saves encounter information. We can store encounter either as a bool i.e. if the integer exists, or as a counter that keeps track of how many times an integer is found in the list. Value can also be completely ignored and we can simply check if the key exists in hash map or not. It must exist if the integer was encountered previously, thus finding the duplicate.

  1. The algorithm starts at index 0. It adds the value 5 as key and sets the value to 1.
  2. It moves to the next index 1, and repeats the process until it reaches index 3.
  3. At index 3, since the key-value pair already exists, it increments the encountered count. The duplicate is found and the algorithm stops.

A simple Python implementation using bool as value is below:

from collections import defaultdict def take3_keep_track(input: list) -> int: store: defaultdict = defaultdict(lambda: True) for inp in input: store[inp] = not store[inp] if store[inp] == True: return inp return None
Code language: Python (python)

Since there’s a single for loop, the time complexity is θ(n) which is the best we have seen so far, and well under our constraints. However, there’s a dictionary that keeps track of duplicates and it grows linearly with the input space. Thus, unfortunately, the space complexity is θ(n) which is worse than our constraint of θ(1). Still, a superior approach compared to our other takes.

Take 4: Using Floyd’s Tortoise and Hare Algorithm

Floyd’s Tortoise and Hare algorithm is originally used to detect loops in linked lists. It makes use of two pointers, where one pointer is traversing the list twice as fast as the other one. The intuition is that if there’s a loop in the list, the two pointers will point at the same value after some time. It works because:

  • If the slow pointer enters the loop, the fast pointer must be already in the loop.
  • Since the fast pointer is moving at twice the speed of slow pointer, the distance between them is increasing by some constant x every iteration. If both pointers are in the loop of length n, after some iterations the distance between the pointers will become n and at that time both pointers are pointing at the same value.

You can read some of the answers in the below link to get a better idea of why this works:

algorithm – How does finding a cycle start node in a cycle linked list work? – Stack Overflow
stackoverflow.com

How does this adapts to our finding duplicates problem? Well, instead of traversing the list linearly, we use the list values to make a decision on where to traverse next in the list. Since the pointers travel at constant speeds, duplicate values in the list will cause the pointers to traverse in a fixed pattern or a loop.

  1. Both tortoise and hare start at index 0. Tortoise takes 1 step i.e. it treats the value at index 0 as the index location to move to, which is 1. Hare takes 2 steps, it first moves to index 1, and then to index 3 (since value at index 1 is 3).
  2. In next iteration, tortoise moves to index 3, while hare takes another two steps, first to index 2, and then to index 4.
  3. In third iteration, tortoise takes one step to index 2. Hare takes two steps, first to index 3, and then to index 2.
  4. Since both hare and tortoise are at same index locations, we are in a loop, and thus we’ve guaranteed that a duplicate exists in the array. This is called an intersection point. An intersection can also occur if hare and tortoise are at different indexes, but the value at those indexes is same. In that case, the value is the duplicate too.

Once we’ve confirmed that a loop exists, the next and final step is to find the opening point of the loop. This opening point, is the duplicate value itself, since the traversing pattern originated from it. Finding the opening is not required if both pointers are pointing at different indexes having same value (see definition in point 4).

To find the opening point of the loop, we move hare back to the beginning, and keep tortoise in its current position. Both pointers will now move at same speed.

  1. Pointer1 is at index 0, while Pointer2 is at index 2.
  2. Both pointers take 1 step forward, so Pointer1 is now at index 1, and pointer 2 at index 4.
  3. Both pointers are now pointing at the same value. This value is the opening point of the loop, and thus the duplicate value too.

See the Python implementation below:

def take4_floyd_ht(input: list) -> int: # Find intersection point inside loop hare: int = input[0] tortoise: int = input[0] while True: hare = input[input[hare]] tortoise = input[tortoise] if hare == tortoise: break # Find Opening Point pointer1: int = input[0] pointer2: int = tortoise while True: pointer1 = input[pointer1] pointer2 = input[pointer2] if pointer1 == pointer2: return pointer1
Code language: Python (python)

The time complexity of Floyd’s algorithm is θ(n*logn). The space complexity of our code is θ(1) since it is independent of input size. Both complexities fully satisfy the constraints we mentioned at the start!

Conclusion

Finding duplicates might seem a trivial task. In this post we discussed a number of ways to efficiently detect duplicate in a list. We started off with simple iterations, and then ended up with a much robust approach using Floyd’s algorithm, that satisfied all of the constraints we set for the problem.

Total
0
Shares
Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Prev
3 Best Libraries for Fuzzy Matching In Python
Fuzzy matching in python

3 Best Libraries for Fuzzy Matching In Python

Learn about 3 of the most used fuzzy matching libraries in Python

Next
Deep Learning on the Web using Flask, Python and Jinja (Part 1/3)
Deep Learning on the Web

Deep Learning on the Web using Flask, Python and Jinja (Part 1/3)

Flask is a very powerful framework used to develop highly scalable and

You May Also Like