Partial sorting

In computer science, partial sorting is a relaxed variant of the sorting problem. If sorting is the problem of permuting a list of items such that its elements all appear in order, then partial sorting amounts to finding a permutation such that some predetermined set of indexes appear in the position of the permutation that they would have in the sorted permutation; in other words, for each requested index i, element i of the partially sorted list contains order statistic i of the input list.

Selecting k smallest or largest elements

A special case of partial sorting is that of selecting the k smallest or k largest elements, which is particularly useful where we want to present just the "top k" of an unsorted list, such as the top 100 corporations by gross sales.

Direct application of the linear-time selection algorithm

The linear-time selection algorithm described above can be used to find the k smallest or the k largest elements in worst-case linear time O(n). To find the k smallest elements, find the kth smallest element using the linear-time median-of-medians selection algorithm. After that, partition the array with the kth smallest element as pivot. The k smallest elements will be the first k elements.

Data structure-based solutions

Another simple method is to add each element of the list into a priority queue data structure, such as a heap or self-balancing binary search tree, with at most k elements. Whenever the data structure has more than k elements, we remove the largest element, which can be done in O(log k) time. Each insertion operation also takes O(log k) time, resulting in O(nlog k) time overall. If heap is used, you may need to sort it further resulting in O((n + k)logk).

It is possible to transform the list into a binary heap in Θ(n) time, and then traverse the heap using a modified breadth-first search algorithm that places the elements in a Priority Queue (instead of the ordinary queue that is normally used in a BFS), and terminate the scan after traversing exactly k elements. As the queue size remains O(k) throughout the traversal, it would require O(k log k) time to complete, leading to a time bound of O(n + k log k) on this algorithm.

We can achieve an O(log n) time solution using skip lists. Skip lists are sorted data structures that allow insertion, deletion and indexed retrieval in O(log n) time. Thus, for any given percentile, we can insert a new element into (and possibly delete an old element from) the list in O(log n), calculate the corresponding index(es) and finally access the percentile value in O(log n) time. See, for example, this Python-based implementation for calculating running median.

Optimised sorting algorithms

More efficient than any of these are specialized partial sorting algorithms based on mergesort and quicksort. The simplest is the quicksort variation: there is no need to recursively sort partitions which only contain elements that would fall after the kth place in the end (starting from the "left" boundary). Thus, if the pivot falls in position k or later, we recurse only on the left partition:

 function quicksortFirstK(list, left, right, k)
     if right > left
         select pivotIndex between left and right
         pivotNewIndex := partition(list, left, right, pivotIndex)
         quicksortFirstK(list, left, pivotNewIndex-1, k)
         if pivotNewIndex < left + k
             quicksortFirstK(list, pivotNewIndex+1, right, k)

The resulting algorithm requires an expected time of only O(n + k log k), and is quite efficient in practice, especially if we substitute selection sort when k becomes small relative to n. However, the worst-case time complexity is still very bad, in the case of a bad pivot selection. Pivot selection along the lines of the worst-case linear time selection algorithm could be used to get better worst-case performance.

Even better is if we don't require those k items to be themselves sorted. Losing that requirement means we can ignore all partitions that fall entirely before or after the kth place. We recurse only into the partition that actually contains the kth element itself.

 function quickfindFirstK(list, left, right, k)
     if right > left
         select pivotIndex between left and right
         pivotNewIndex := partition(list, left, right, pivotIndex)
         if pivotNewIndex > left + k  // new condition
             quickfindFirstK(list, left, pivotNewIndex-1, k)
         if pivotNewIndex < left + k
             quickfindFirstK(list, pivotNewIndex+1, right, k+left-pivotNewIndex-1)

The resulting algorithm requires an expected time of only O(n), which is the best such an algorithm can hope for.

Tournament Algorithm

Another method is the tournament algorithm. The idea is to conduct a knockout minimal round tournament to decide the ranks. It first organises the games (comparisons) between adjacent pairs and moves the winners to next round until championship (the first best) is decided. It also constructs the tournament tree along the way. Now the second best element must be among the direct losers to winner and these losers can be found out by walking in the binary tree in O(log n) time. It organises another tournament to decide the second best among these potential elements. The third best must be one among the losers of the second best in either of the two tournament trees. The approach continues until we find k elements. This algorithm takes O(n + k log n) complexity, which for any fixed k independent of n is O(n).

Language/library support

The C++ standard specifies a library function called std::partial_sort.
The Python standard library includes functions nlargest and nsmallest in its heapq module.

External links

J.M. Chambers (1971). Partial sorting. CACM 14(5):357–358.
Partial quicksort