It is a simple problem, one that any first year Computer Science student should be able to solve with ease, let alone a fourth year graduating senior.
Problem: Develop an algorithm for a function that takes an array of integers and an integer containing the size of that array. Remove all the duplicate values from the array. Return the size of the array. That is, the number of unique values in the array.
Constraints:
- You may not use use any additional data structures. Local variables are allowed.
- Valid data must be in a form that is accessible using only the value returned. That is, using the number of unique values in the array, it must be possible to access only the valid data within the array.
Given this problem, I'm am ashamed to report that I panicked. I resorted to the embarrassingly obvious naive solution. Moreover, even for a naive solution, it was a particularly poor one.
That algorithm is as follows:
- Declare a variable to count the number of duplicate entries
- Starting with the first element in the array, compare it to each other element.
- If they are equal, then the second element is considered the duplicate.
- If a duplicate element is found, it is swapped with an element from the back of the array.
- Continue until the element you are comparing is a duplicate element. That is, the index of that element is equal to the size of the array minus number of swapped items minus one.
- Return the size of the array minus the number of duplicate items.
And then came my Great Fail. My algorithm is O(n^2) in the worst case, with n being the number of comparisons made between elements. The natural follow up question is, of course, "What can be done to improve this algorithm?" I was stunned. In the back of my mind I knew this question was coming. I would have asked it myself had our positions been reversed. I had no answer. Did I have an idea? Certainly. I thought of using a binary tree to reduce the number of comparisons necessary. A friend of mine later suggested using a hash table upon hearing this problem. Both of these ideas are foolish. Indeed, they clearly violate the first constraint. So I said nothing. Should I have said something? Talked through my thought process? Perhaps. Although in retrospect the only thing I would have done was suggest a binary tree to reduce the number of comparisons. Perhaps the interviewer would have let me talk and at the end said something to the effect of, "You're not allowed to use data structures." Perhaps he would have cut me off at the mere mention of the words "binary tree." In his place I certainly would have. Perhaps in talking I would have been inspired and come up with my current solution or something. We'll never know. Instead, silence permeated until finally the interviewer had had enough."Do you have any questions for me?" he asked. Still stunned by my inability to respond intelligently to the previous question, I gave a meek "No, not at the moment." "Thank you for your time," he said. And that was the end.
That was more than five hours ago. Now I've come up with another solution to this easy problem. Perhaps it doesn't work or my analysis is flawed. I don't care. I need to write this down.
To understand my solution it's necessary to know a little bit about heap sort. Heapsort is an "in-place" sorting algorithm. "In-place" means that it does not require additional memory in order to execute. This is key due to Constraint 1. Heapsort has a worst case run time of O(n log n). I won't go into the details of big O notation here. It is sufficient to say that O(n log n) is significantly better than my naive O(n^2) algorithm.
"Solution":
- Declare variables to count the number of duplicate items and the array position (index) of the last (furthest to the left) duplicate value. Initialize them to 0 and -1, respectively.
- Compare each element with its neighbor. If the first two elements are the same, the right one is considered the duplicate. The last duplicate is now at position 1.
- After this, continue comparing elements with their right neighbor.
- If a unique element is found, copy its data to the position of the last duplicate. If there is no last duplicate, do nothing. A unique element is one where it's right neighbor is does not have the same value.
- If a duplicate element is found, the current element is considered the duplicate and its index is stored as the last duplicate. Increment the count of the number of duplicate elements found.
- Continue this until all elements have been compared.
- Return the size of the array minus the number of duplicate items. The array from index 0 to index array size minus number of duplicates minus one contains sorted, unique values.
No comments:
Post a Comment