The Selection Problem

What is the Selection Problem?

The Selection Problem involves finding the kth smallest item in a collection of n items. This item is called an Order Statistic.

Selection Vs Sorting

If the collection is already sorted then you can just pull the order statistic out directly (assuming it's accessible by its index) and if it isn't you could sort it first and then retrieve the order statistic. Since Mergesort and Heapsort have a runtime of \(\Theta(n \log n)\) we know that this would be the runtime using this method, but since we are only choosing one item we don't need to sort the collection, so we can find the item with less computation, and thus selection is its own problem.

Familiar Cases

While you might search for any ranked item (the third item, say), there are three special cases that are common enough to have been given names.

  • the Minimum
  • the Maximum
  • the Median

Minimum and Maximum

The Minimum and Maximum are probably self-explanatory, but to be complete: the Minimum is the order statistic when \(k = 1\) and the Maximum is the order statistic when \(k=n\). As an example of the Selection Problem requiring fewer comparisons than the Sorting Problem we can look at finding the Minimum by traversing the array.

\begin{algorithm}
\caption{Minimum}
\begin{algorithmic}
\INPUT An array of comparable items
\OUTPUT The smallest item
\PROCEDURE{Minimum}{$A$}
  \STATE minimum $\gets$ A[0]
  \FOR{$i \in \{ 1 \ldots A.length - 1\}$}
   \IF {minimum > A[i]}
    \STATE minimum $\gets$ A[i]
   \ENDIF
  \ENDFOR
\ENDPROCEDURE
\end{algorithmic}
\end{algorithm}

Since we're traversing the collection once it will always have a runtime of \(\Theta(n)\). The Maximum can be found the same way but with the greater than sign in the if-statement.

The Median

While most people know what the median is (I once worked with an engineer who thought it was the middle item of the unsorted items, not the item whose value was in the middle, so it might not be true that all people know what it is) but since we're talking about Order Statistics and not Statistics like in "How to Lie With Statistics" we'll make it a little more confusing by using a similar but not exactly the same definition.

\[ k = \left \lceil \frac{n}{2} \right \rceil \]

This definition comes from Levitin. In his description he defines the median as the value with half the remaining values less than it and half the remaining values above it, which suggests that he's defining the case where there are an odd number of items only.

CLRS gives a two-part definition with the odd case:

\[ k = \frac{n + 1}{2} \]

and the case where there are an even number of items broken up into two:

\[ \textbf{Lower Median} = \left \lfloor \frac{n + 1}{2} \right \rfloor \\ \textbf{Upper Median} = \left \lceil \frac{n + 1}{2} \right \rceil \\ \]

With their median being the Lower Median. Despite the differences if you run through some numbers you can see that they're basically the same (adding 1 and then taking the floor after dividing by 2 is the same as the ceiling of dividing by 2 without adding 1).

def ceiling(a: int, b: int) -> int:
    return -(a//-b)

for n in range(1, 10):
    print(f"n={n}: \tLevitin: {ceiling(n, 2)}\tLower Median: {(n + 1)//2}")
n=1:    Levitin: 1      Lower Median: 1
n=2:    Levitin: 1      Lower Median: 1
n=3:    Levitin: 2      Lower Median: 2
n=4:    Levitin: 2      Lower Median: 2
n=5:    Levitin: 3      Lower Median: 3
n=6:    Levitin: 3      Lower Median: 3
n=7:    Levitin: 4      Lower Median: 4
n=8:    Levitin: 4      Lower Median: 4
n=9:    Levitin: 5      Lower Median: 5

So we've defined the median but actually finding the median is a little more complicated than finding the minimum or maximum so I'll save that for later.

Sources