pdqsort.qbk 7.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142
  1. [/===========================================================================
  2. Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
  3. Distributed under the Boost Software License, Version 1.0
  4. See accompanying file LICENSE_1_0.txt or copy at
  5. http://www.boost.org/LICENSE_1_0.txt
  6. =============================================================================/]
  7. [section:pdqsort 2.2.- pdqsort]
  8. [def __std_sort [@http://en.cppreference.com/w/cpp/algorithm/sort std::sort]]
  9. [def __std_stable_sort [@http://en.cppreference.com/w/cpp/algorithm/stable_sort std::stable_sort]]
  10. [def __std_less [@http://en.cppreference.com/w/cpp/utility/functional/less std::less]]
  11. [def __std_greater [@http://en.cppreference.com/w/cpp/utility/functional/greater std::greater]]
  12. [def __pdqsort_github [@https://github.com/orlp/pdqsort pdqsort]]
  13. [def __pdqsort [^[funcref boost::sort::pdqsort pdqsort]]]
  14. [def __pdqsort_branchless [^[funcref boost::sort::pdqsort_branchless pdqsort_branchless]]]
  15. [section:pdqsort_intro Introduction]
  16. Pattern-defeating quicksort (__pdqsort_github) is a novel sorting algorithm that combines the fast
  17. average case of randomized quicksort with the fast worst case of heapsort, while achieving linear
  18. time on inputs with certain patterns. pdqsort is an extension and improvement of David Mussers
  19. introsort. It is identical in usage to __std_sort.
  20. [endsect] [/section:pdqsort_intro Introduction]
  21. [section:pdqsort_usage Usage]
  22. Pattern-defeating quicksort is available through two calls, __pdqsort and __pdqsort_branchless. Both
  23. have identical guarantees, behavior and overloads as __std_sort, with the exception of the C++17
  24. parallel execution policy argument. Thus you can simply replace a call to __std_sort with __pdqsort,
  25. it is simply a faster implementation. It is almost always appropriate to simply call __pdqsort and
  26. ignore __pdqsort_branchless.
  27. __pdqsort_branchless runs significantly faster than __pdqsort for comparison functions that are
  28. branchless (for an explanation why see 'The Average Case' below), such as simple integer or float
  29. comparison, and slightly slower when the comparison function contains branches. When using
  30. an arithmetic type and a default comparison function (__std_less or __std_greater) __pdqsort
  31. automatically delegates to __pdqsort_branchless. If you know that your custom comparison function is
  32. branchless and wish to make use of this speedup call __pdqsort_branchless explicitly.
  33. [endsect] [/section:pdqsort_usage Usage]
  34. [section:pdqsort_benchmark Benchmark]
  35. A comparison of __pdqsort and GCC's __std_sort and __std_stable_sort with various input distributions:
  36. [$../images/pdqsort.png]
  37. Compiled with `g++ -std=c++11 -O2 -m64 -march=native`.
  38. [endsect] [/section:pdqsort_benchmark Benchmark]
  39. [section:pdqsort_best The Best Case]
  40. __pdqsort is designed to run in linear time for a couple of best-case patterns. Linear time is
  41. achieved for inputs that are in strictly ascending or descending order, only contain equal elements,
  42. or are strictly in ascending order followed by one out-of-place element. There are two seperate
  43. mechanisms at play to achieve this.
  44. For equal elements a smart partitioning scheme is used that always puts equal elements in the
  45. partition containing elements greater than the pivot. When a new pivot is chosen it's compared to
  46. the greatest element in the partition before it. If they compare equal we can derive that there are
  47. no elements smaller than the chosen pivot. When this happens we switch strategy for this partition,
  48. and filter out all elements equal to the pivot.
  49. To get linear time for the other patterns we check after every partition if any swaps were made. If
  50. no swaps were made and the partition was decently balanced we will optimistically attempt to use
  51. insertion sort. This insertion sort aborts if more than a constant amount of moves are required to
  52. sort.
  53. [endsect] [/section:pdqsort_best The Best Case]
  54. [section:pdqsort_avg The Average Case]
  55. On average case data where no patterns are detected __pdqsort is effectively a quicksort that uses
  56. median-of-3 pivot selection, switching to insertion sort if the number of elements to be
  57. (recursively) sorted is small. The overhead associated with detecting the patterns for the best case
  58. is so small it lies within the error of measurement.
  59. __pdqsort gets a great speedup over the traditional way of implementing quicksort when sorting large
  60. arrays (1000+ elements). This is due to a new technique described in "BlockQuicksort: How Branch
  61. Mispredictions don't affect Quicksort" by Stefan Edelkamp and Armin Weiss. In short, we bypass the
  62. branch predictor by using small buffers (entirely in L1 cache) of the indices of elements that need
  63. to be swapped. We fill these buffers in a branch-free way that's quite elegant (in pseudocode):
  64. buffer_num = 0; buffer_max_size = 64;
  65. for (int i = 0; i < buffer_max_size; ++i) {
  66. // With branch:
  67. if (elements[i] < pivot) { buffer[buffer_num] = i; buffer_num++; }
  68. // Without:
  69. buffer[buffer_num] = i; buffer_num += (elements[i] < pivot);
  70. }
  71. [endsect] [/section:pdqsort_avg The Average Case]
  72. [section:pdqsort_worst The Worst Case]
  73. Quicksort naturally performs bad on inputs that form patterns, due to it being a partition-based
  74. sort. Choosing a bad pivot will result in many comparisons that give little to no progress in the
  75. sorting process. If the pattern does not get broken up, this can happen many times in a row. Worse,
  76. real world data is filled with these patterns.
  77. Traditionally the solution to this is to randomize the pivot selection of quicksort. While this
  78. technically still allows for a quadratic worst case, the chances of it happening are astronomically
  79. small. Later, in introsort, pivot selection is kept deterministic, instead switching to the
  80. guaranteed O(n log n) heapsort if the recursion depth becomes too big. In __pdqsort we adopt a hybrid
  81. approach, (deterministically) shuffling some elements to break up patterns when we encounter a "bad"
  82. partition. If we encounter too many "bad" partitions we switch to heapsort.
  83. [endsect] [/section:pdqsort_worst The Worst Case]
  84. [section:pdqsort_bad_partitions Bad Partitions]
  85. A bad partition occurs when the position of the pivot after partitioning is under 12.5% (1/8th)
  86. percentile or over 87,5% percentile - the partition is highly unbalanced. When this happens we will
  87. shuffle four elements at fixed locations for both partitions. This effectively breaks up many
  88. patterns. If we encounter more than log(n) bad partitions we will switch to heapsort.
  89. The 1/8th percentile is not chosen arbitrarily. An upper bound of quicksorts worst case runtime can
  90. be approximated within a constant factor by the following recurrence:
  91. T(n, p) = n + T(p(n-1), p) + T((1-p)(n-1), p)
  92. Where n is the number of elements, and p is the percentile of the pivot after partitioning.
  93. `T(n, 1/2)` is the best case for quicksort. On modern systems heapsort is profiled to be
  94. approximately 1.8 to 2 times as slow as quicksort. Choosing p such that `T(n, 1/2) / T(n, p) ~= 1.9`
  95. as n gets big will ensure that we will only switch to heapsort if it would speed up the sorting.
  96. p = 1/8 is a reasonably close value and is cheap to compute on every platform using a bitshift.
  97. [endsect] [/section:pdqsort_bad_partitions Bad Partitions]
  98. [endsect] [/section:pdqsort 2.2.- pdqsort]