123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142 |
- [/===========================================================================
- Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
- Distributed under the Boost Software License, Version 1.0
- See accompanying file LICENSE_1_0.txt or copy at
- http://www.boost.org/LICENSE_1_0.txt
- =============================================================================/]
- [section:pdqsort 2.2.- pdqsort]
- [def __std_sort [@http://en.cppreference.com/w/cpp/algorithm/sort std::sort]]
- [def __std_stable_sort [@http://en.cppreference.com/w/cpp/algorithm/stable_sort std::stable_sort]]
- [def __std_less [@http://en.cppreference.com/w/cpp/utility/functional/less std::less]]
- [def __std_greater [@http://en.cppreference.com/w/cpp/utility/functional/greater std::greater]]
- [def __pdqsort_github [@https://github.com/orlp/pdqsort pdqsort]]
- [def __pdqsort [^[funcref boost::sort::pdqsort pdqsort]]]
- [def __pdqsort_branchless [^[funcref boost::sort::pdqsort_branchless pdqsort_branchless]]]
- [section:pdqsort_intro Introduction]
- Pattern-defeating quicksort (__pdqsort_github) is a novel sorting algorithm that combines the fast
- average case of randomized quicksort with the fast worst case of heapsort, while achieving linear
- time on inputs with certain patterns. pdqsort is an extension and improvement of David Mussers
- introsort. It is identical in usage to __std_sort.
- [endsect] [/section:pdqsort_intro Introduction]
- [section:pdqsort_usage Usage]
- Pattern-defeating quicksort is available through two calls, __pdqsort and __pdqsort_branchless. Both
- have identical guarantees, behavior and overloads as __std_sort, with the exception of the C++17
- parallel execution policy argument. Thus you can simply replace a call to __std_sort with __pdqsort,
- it is simply a faster implementation. It is almost always appropriate to simply call __pdqsort and
- ignore __pdqsort_branchless.
- __pdqsort_branchless runs significantly faster than __pdqsort for comparison functions that are
- branchless (for an explanation why see 'The Average Case' below), such as simple integer or float
- comparison, and slightly slower when the comparison function contains branches. When using
- an arithmetic type and a default comparison function (__std_less or __std_greater) __pdqsort
- automatically delegates to __pdqsort_branchless. If you know that your custom comparison function is
- branchless and wish to make use of this speedup call __pdqsort_branchless explicitly.
- [endsect] [/section:pdqsort_usage Usage]
- [section:pdqsort_benchmark Benchmark]
- A comparison of __pdqsort and GCC's __std_sort and __std_stable_sort with various input distributions:
- [$../images/pdqsort.png]
- Compiled with `g++ -std=c++11 -O2 -m64 -march=native`.
- [endsect] [/section:pdqsort_benchmark Benchmark]
- [section:pdqsort_best The Best Case]
- __pdqsort is designed to run in linear time for a couple of best-case patterns. Linear time is
- achieved for inputs that are in strictly ascending or descending order, only contain equal elements,
- or are strictly in ascending order followed by one out-of-place element. There are two seperate
- mechanisms at play to achieve this.
- For equal elements a smart partitioning scheme is used that always puts equal elements in the
- partition containing elements greater than the pivot. When a new pivot is chosen it's compared to
- the greatest element in the partition before it. If they compare equal we can derive that there are
- no elements smaller than the chosen pivot. When this happens we switch strategy for this partition,
- and filter out all elements equal to the pivot.
- To get linear time for the other patterns we check after every partition if any swaps were made. If
- no swaps were made and the partition was decently balanced we will optimistically attempt to use
- insertion sort. This insertion sort aborts if more than a constant amount of moves are required to
- sort.
- [endsect] [/section:pdqsort_best The Best Case]
- [section:pdqsort_avg The Average Case]
- On average case data where no patterns are detected __pdqsort is effectively a quicksort that uses
- median-of-3 pivot selection, switching to insertion sort if the number of elements to be
- (recursively) sorted is small. The overhead associated with detecting the patterns for the best case
- is so small it lies within the error of measurement.
- __pdqsort gets a great speedup over the traditional way of implementing quicksort when sorting large
- arrays (1000+ elements). This is due to a new technique described in "BlockQuicksort: How Branch
- Mispredictions don't affect Quicksort" by Stefan Edelkamp and Armin Weiss. In short, we bypass the
- branch predictor by using small buffers (entirely in L1 cache) of the indices of elements that need
- to be swapped. We fill these buffers in a branch-free way that's quite elegant (in pseudocode):
- buffer_num = 0; buffer_max_size = 64;
- for (int i = 0; i < buffer_max_size; ++i) {
- // With branch:
- if (elements[i] < pivot) { buffer[buffer_num] = i; buffer_num++; }
- // Without:
- buffer[buffer_num] = i; buffer_num += (elements[i] < pivot);
- }
- [endsect] [/section:pdqsort_avg The Average Case]
- [section:pdqsort_worst The Worst Case]
- Quicksort naturally performs bad on inputs that form patterns, due to it being a partition-based
- sort. Choosing a bad pivot will result in many comparisons that give little to no progress in the
- sorting process. If the pattern does not get broken up, this can happen many times in a row. Worse,
- real world data is filled with these patterns.
- Traditionally the solution to this is to randomize the pivot selection of quicksort. While this
- technically still allows for a quadratic worst case, the chances of it happening are astronomically
- small. Later, in introsort, pivot selection is kept deterministic, instead switching to the
- guaranteed O(n log n) heapsort if the recursion depth becomes too big. In __pdqsort we adopt a hybrid
- approach, (deterministically) shuffling some elements to break up patterns when we encounter a "bad"
- partition. If we encounter too many "bad" partitions we switch to heapsort.
- [endsect] [/section:pdqsort_worst The Worst Case]
- [section:pdqsort_bad_partitions Bad Partitions]
- A bad partition occurs when the position of the pivot after partitioning is under 12.5% (1/8th)
- percentile or over 87,5% percentile - the partition is highly unbalanced. When this happens we will
- shuffle four elements at fixed locations for both partitions. This effectively breaks up many
- patterns. If we encounter more than log(n) bad partitions we will switch to heapsort.
- The 1/8th percentile is not chosen arbitrarily. An upper bound of quicksorts worst case runtime can
- be approximated within a constant factor by the following recurrence:
- T(n, p) = n + T(p(n-1), p) + T((1-p)(n-1), p)
- Where n is the number of elements, and p is the percentile of the pivot after partitioning.
- `T(n, 1/2)` is the best case for quicksort. On modern systems heapsort is profiled to be
- approximately 1.8 to 2 times as slow as quicksort. Choosing p such that `T(n, 1/2) / T(n, p) ~= 1.9`
- as n gets big will ensure that we will only switch to heapsort if it would speed up the sorting.
- p = 1/8 is a reasonably close value and is cheap to compute on every platform using a bitshift.
- [endsect] [/section:pdqsort_bad_partitions Bad Partitions]
- [endsect] [/section:pdqsort 2.2.- pdqsort]
|