123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916 |
- [/===========================================================================
- Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters
- Distributed under the Boost Software License, Version 1.0
- See accompanying file LICENSE_1_0.txt or copy at
- http://www.boost.org/LICENSE_1_0.txt
- =============================================================================/]
- [section:spreadsort 2.1.-Spreadsort]
- [/ Some composite templates]
- [template super[x]'''<superscript>'''[x]'''</superscript>''']
- [template sub[x]'''<subscript>'''[x]'''</subscript>''']
- [template floor[x]'''⌊'''[x]'''⌋''']
- [template floorlr[x][lfloor][x][rfloor]]
- [template ceil[x] '''⌈'''[x]'''⌉''']
- [/ Required for autoindexing]
- [import ../../../tools/auto_index/include/auto_index_helpers.qbk]
- [/ Must be first included file!]
- [/Files containing quickbook snippets]
- [import ../example/charstringsample.cpp]
- [import ../example/stringfunctorsample.cpp]
- [import ../example/reverseintsample.cpp]
- [import ../example/rightshiftsample.cpp]
- [import ../example/int64.cpp]
- [import ../example/floatfunctorsample.cpp]
- [import ../example/generalizedstruct.cpp]
- [import html4_symbols.qbk] [/ Provides various useful squiggles.]
- [def __spreadsort [@http://en.wikipedia.org/wiki/Spreadsort spreadsort]]
- [def __introsort [@http://en.wikipedia.org/wiki/Introsort introsort]]
- [def __stl_sort [@http://www.cplusplus.com/reference/algorithm/sort/ STL std::sort]]
- [def __big_o [@http://en.wikipedia.org/wiki/Big_O_notation Big O notation]]
- [def __radix_sort[@http://en.wikipedia.org/wiki/Radix_sort radix sort]]
- [def __adaptive_left_reflex [@http://www.nik.no/2002/Maus.pdf Arne Maus, Adaptive Left Reflex]]
- [def __american_flag [@http://en.wikipedia.org/wiki/American_flag_sort American flag sort]]
- [def __overloading [link sort.overview.overloading overloading]]
- [def __lookup [link sort.rationale.lookup lookup]]
- [def __random_access_iter [@http://en.cppreference.com/w/cpp/concept/RandomAccessIterator RandomAccessIterator]]
- [def __strictweakordering [@http://en.wikipedia.org/wiki/Weak_ordering#Strict_weak_orderings strict weak ordering]]
- [/ Links to functions for use in text]
- [def __integer_sort [^[funcref boost::sort::spreadsort::integer_sort integer_sort]]]
- [def __float_sort [^[funcref boost::sort::spreadsort::float_sort float_sort]]]
- [def __string_sort [^[funcref boost::sort::spreadsort::string_sort string_sort]]]
- [def __spreadsort [^[funcref boost::sort::spreadsort::spreadsort spreadsort]]] [/Note diff from Wiki link __spreadsort]
- [def __std_sort [@http://en.cppreference.com/w/cpp/algorithm/sort std::sort]]
- [section:overview Overview]
- [section:intro Introduction]
- Spreadsort combines generic implementations of multiple high-speed sorting algorithms
- that outperform those in the C++ standard in both average and worst case performance
- when there are over 1000 elements in the list to sort.
- They fall back to __stl_sort on small data sets.
- [warning These algorithms all only work on
- [@http://www.cplusplus.com/reference/iterator/RandomAccessIterator/ random access iterators].]
- They are hybrids using both radix and comparison-based sorting,
- specialized to sorting common data types, such as integers, floats, and strings.
- These algorithms are encoded in a generic fashion and accept functors,
- enabling them to sort any object that can be processed like these basic data types.
- In the case of __string_sort, this includes anything
- with a defined strict-weak-ordering that __std_sort can sort,
- but writing efficient functors for some complex key types
- may not be worth the additional effort relative to just using __std_sort,
- depending on how important speed is to your application.
- Sample usages are available in the example directory.
- Unlike many radix-based algorithms,
- the underlying __spreadsort algorithm is designed around [*worst-case performance].
- It performs better on chunky data (where it is not widely distributed),
- so that on real data it can perform substantially better than on random data.
- Conceptually, __spreadsort can sort any data for which an absolute ordering can be determined,
- and __string_sort is sufficiently flexible that this should be possible.
- Situations where __spreadsort is fastest relative to __std_sort:
- # Large number of elements to sort (['N] >= 10000).
- # Slow comparison function (such as floating-point numbers on x86 processors or strings).
- # Large data elements (such as key + data sorted on a key).
- # Completely sorted data when __spreadsort has an optimization to quit early in this case.
- Situations where __spreadsort is slower than __std_sort:
- # Data sorted in reverse order. Both __std_sort and __spreadsort are faster
- on reverse-ordered data than randomized data,
- but __std_sort speeds up more in this special case.
- # Very small amounts of data (< 1000 elements).
- For this reason there is a fallback in __spreadsort to __std_sort
- if the input size is less than 1000,
- so performance is identical for small amounts of data in practice.
- These functions are defined in `namespace boost::sort::spreadsort`.
- [endsect] [/section Introduction]
- [section:overloading Overloading]
- [tip In the Boost.Sort C++ Reference section, click on the appropriate overload, for example `float_sort(RandomAccessIter, RandomAccessIter, Right_shift, Compare);` to get full details of that overload.]
- Each of __integer_sort, __float_sort, and __string_sort have 3 main versions:
- The base version, which takes a first iterator and a last iterator, just like __std_sort:
- integer_sort(array.begin(), array.end());
- float_sort(array.begin(), array.end());
- string_sort(array.begin(), array.end());
- The version with an overridden shift functor, providing flexibility
- in case the `operator>>` already does something other than a bitshift.
- The rightshift functor takes two args, first the data type,
- and second a natural number of bits to shift right.
- For __string_sort this variant is slightly different;
- it needs a bracket functor equivalent to `operator`\[\],
- taking a number corresponding to the character offset,
- along with a second `getlength` functor to get the length of the string in characters.
- In all cases, this operator must return an integer type that compares with the
- `operator<` to provide the intended order
- (integers can be negated to reverse their order).
- In other words (aside from negative floats, which are inverted as ints):
- rightshift(A, n) < rightshift(B, n) -> A < B
- A < B -> rightshift(A, 0) < rightshift(B, 0)
- [rightshift_1]
- [bracket_1]
- See [@../../example/rightshiftsample.cpp rightshiftsample.cpp] for a working example of integer sorting with a rightshift functor.
- And a version with a comparison functor for maximum flexibility.
- This functor must provide the same sorting order as the integers returned by the rightshift (aside from negative floats):
- rightshift(A, n) < rightshift(B, n) -> compare(A, B)
- compare(A, B) -> rightshift(A, 0) < rightshift(B, 0)
- [reverse_int_2]
- Examples of functors are:
- [lessthan_functor]
- [bracket_functor]
- [getsize_functor]
- and these functors are used thus:
- [stringsort_functors_call]
- See [@../../example/stringfunctorsample.cpp stringfunctorsample.cpp] for a working example of sorting strings with all functors.
- [endsect] [/section:overloading Overloading]
- [section:performance Performance]
- The __spreadsort algorithm is a hybrid algorithm;
- when the number of elements being sorted is below a certain number,
- comparison-based sorting is used. Above it, radix sorting is used.
- The radix-based algorithm will thus cut up the problem into small pieces,
- and either completely sort the data based upon its radix if the data is clustered,
- or finish sorting the cut-down pieces with comparison-based sorting.
- The Spreadsort algorithm dynamically chooses
- either comparison-based or radix-based sorting when recursing,
- whichever provides better worst-case performance.
- This way worst-case performance is guaranteed to be the better of
- ['[bigo](N[sdot]log2(N))] comparisons and ['[bigo](N[sdot]log2(K/S + S))] operations where
- * ['N] is the number of elements being sorted,
- * ['K] is the length in bits of the key, and
- * ['S] is a constant.
- This results in substantially improved performance for large [' N];
- __integer_sort tends to be 50% to 2X faster than __std_sort,
- while __float_sort and _string_sort are roughly 2X faster than __std_sort.
- Performance graphs are provided for __integer_sort, __float_sort, and __string_sort in their description.
- Runtime Performance comparisons and graphs were made on a Core 2 Duo laptop
- running Windows Vista 64 with MSVC 8.0,
- and an old G4 laptop running Mac OSX with gcc.
- [@http://www.boost.org/build/doc/html/ Boost bjam/b2] was used to control compilation.
- Direct performance comparisons on a newer x86 system running Ubuntu,
- with the fallback to __std_sort at lower input sizes disabled are below.
- [note The fallback to __std_sort for smaller input sizes prevents
- the worse performance seen on the left sides of the first two graphs.]
- __integer_sort starts to become faster than __std_sort at about 1000 integers (4000 bytes),
- and __string_sort becomes faster than __std_sort at slightly fewer bytes (as few as 30 strings).
- [note The 4-threaded graph has 4 threads doing [*separate sorts simultaneously]
- (not splitting up a single sort)
- as a test for thread cache collision and other multi-threaded performance issues.]
- __float_sort times are very similar to __integer_sort times.
- [/ These provide links to the images, but currently graphs are shown - see below]
- [/@../../doc/images/single_threaded.png single_threaded.png] [/file:///I:/modular-boost/libs/sort/doc/images/single_threaded.png]
- [/@../../doc/images/4_threaded.png 4_threaded.png]
- [/@../../doc/images/entropy.png entropy.png]
- [/@../../doc/images/bits_per_byte.png bits_per_byte.png]
- [$../images/single_threaded.png] [/<img src="../images/single_threaded.png"> == file:///I:/modular-boost/libs/sort/doc/images/single_threaded.png]
- [$../images/4_threaded.png]
- [$../images/entropy.png]
- [$../images/bits_per_byte.png]
- Histogramming with a fixed maximum number of splits is used
- because it reduces the number of cache misses,
- thus improving performance relative to the approach described in detail
- in the [@http://en.wikipedia.org/wiki/Spreadsort original SpreadSort publication].
- The importance of cache-friendly histogramming is described
- in __adaptive_left_reflex,
- though without the worst-case handling described below.
- The time taken per radix iteration is:
- ['[bigo](N)] iterations over the data
- ['[bigo](N)] integer-type comparisons (even for _float_sort and __string_sort)
- ['[bigo](N)] swaps
- ['[bigo](2[super S])] bin operations.
- To obtain ['[bigo](N)] worst-case performance per iteration,
- the restriction ['S <= log2(N)] is applied, and ['[bigo](2[super S])] becomes ['[bigo](N)].
- For each such iteration, the number of unsorted bits log2(range)
- (referred to as ['K]) per element is reduced by ['S].
- As ['S] decreases depending upon the amount of elements being sorted,
- it can drop from a maximum of ['S[sub max]] to the minimum of ['S[sub min]].
- Assumption: __std_sort is assumed to be ['[bigo](N*log2(N))],
- as __introsort exists and is commonly used.
- (If you have a quibble with this please take it up with the implementor of your __std_sort;
- you're welcome to replace the recursive calls to __std_sort to calls
- to __introsort if your __std_sort library call is poorly implemented).
- [@http://en.wikipedia.org/wiki/Introsort Introsort] is not included with this algorithm for simplicity and
- because the implementor of the __std_sort call
- is assumed to know what they're doing.
- To maintain a minimum value for ['S (S[sub min])],
- comparison-based sorting has to be used to sort when
- ['n <= log2(meanbinsize)], where ['log2(meanbinsize) (lbs)] is a small constant,
- usually between 0 and 4, used to minimize bin overhead per element.
- There is a small corner-case where if ['K < S[sub min]] and ['n >= 2^K],
- then the data can be sorted in a single radix-based iteration with an ['S = K]
- (this bucketsorting special case is by default only applied to __float_sort).
- So for the final recursion, worst-case performance is:
- 1 radix-based iteration if ['K <= S[sub min]],
- or ['S[sub min] + lbs] comparison-based iterations if ['K > S[sub min]] but ['n <= 2[super (S[sub min] + lbs)]].
- So for the final iteration, worst-case runtime is ['[bigo](N*(S[sub min] + lbs))] but
- if ['K > S[sub min]] and ['N > 2[super (S[sub min] + lbs)]]
- then more than 1 radix recursion will be required.
- For the second to last iteration, ['K <= S[sub min] * 2 + 1] can be handled,
- (if the data is divided into ['2[super (S[sub min] + 1)]] pieces)
- or if ['N < 2[super (S[sub min] + lbs + 1)]],
- then it is faster to fallback to __std_sort.
- In the case of a radix-based sort plus recursion, it will take
- ['[bigo](N*(S[sub min] + lbs)) + [bigo](N) = [bigo](N*(S[sub min] + lbs + 1))] worst-case time,
- as
- ['K_remaining = K_start - (S[sub min] + 1)], and ['K_start <= S[sub min] * 2 + 1].
- Alternatively, comparison-based sorting is used if ['N < 2[super (S[sub min] + lbs + 1)]],
- which will take ['[bigo](N*(S[sub min] + lbs + 1))] time.
- So either way ['[bigo](N*(S[sub min] + lbs + 1))] is the worst-case time for the second to last iteration,
- which occurs if ['K <= S[sub min] * 2 + ]1 or ['N < 2[super (S[sub min] + lbs + 1)]].
- This continues as long as ['S[sub min] <= S <= S[sub max]],
- so that for ['K_m <= K_(m-1) + S[sub min] + m] where ['m]
- is the maximum number of iterations after this one has finished,
- or where ['N < 2[super (S[sub min] + lbs + m)]],
- then the worst-case runtime is ['[bigo](N*(S[sub min] + lbs + m))].
- [space][space]['K_m] at ['m <= (S[sub max] - S[sub min])] works out to:
- [space][space]['K_1 <= (S[sub min]) + S[sub min] + 1 <= 2S[sub min] + 1]
- [space][space]['K_2 <= (2S[sub min] + 1) + S[sub min] + 2]
- as the sum from 0 to ['m] is ['m(m + 1)/2]
- [space][space]['K_m <= (m + 1)S[sub min] + m(m + 1)/2 <= (S[sub min] + m/2)(m + 1)]
- substituting in S[sub max] - S[sub min] for m
- [space][space]['K_(S[sub max] - S[sub min]) <= (S[sub min] + (S[sub max] - S[sub min])/2)*(S[sub max] - S[sub min] + 1)]
- [space][space]['K_(S[sub max] - S[sub min]) <= (S[sub min] + S[sub max]) * (S[sub max] - S[sub min] + 1)/2]
- Since this involves ['S[sub max] - S[sub min] + 1] iterations,
- this works out to dividing ['K] into an average ['(S[sub min] + S[sub max])]/2
- pieces per iteration.
- To finish the problem from this point takes ['[bigo](N * (S[sub max] - S[sub min]))] for ['m] iterations,
- plus the worst-case of ['[bigo](N*(S[sub min] + lbs))] for the last iteration,
- for a total of ['[bigo](N *(S[sub max] + lbs))] time.
- When ['m > S[sub max] - S[sub min]], the problem is divided into ['S[sub max]] pieces per iteration,
- or __std_sort is called if ['N < 2^(m + S[sub min] + lbs)]. For this range:
- [space][space]['K_m <= K_(m - 1) + S[sub max]], providing runtime of
- [space][space]['[bigo](N *((K - K_(S[sub max] - S[sub min]))/S[sub max] + S[sub max] + lbs))] if recursive,
- or ['[bigo](N * log(2^(m + S[sub min] + lbs)))] if comparison-based,
- which simplifies to ['[bigo](N * (m + S[sub min] + lbs))],
- which substitutes to ['[bigo](N * ((m - (S[sub max] - S[sub min])) + S[sub max] + lbs))],
- which given that ['m - (S[sub max] - S[sub min]) <= (K - K_(S[sub max] - S[sub min]))/S[sub max]]
- (otherwise a lesser number of radix-based iterations would be used)
- also comes out to ['[bigo](N *((K - K_(S[sub max] - S[sub min]))/S[sub max] + S[sub max] + lbs))].
- Asymptotically, for large ['N] and large ['K], this simplifies to:
- [space][space]['[bigo](N * (K/S[sub max] + S[sub max] + lbs))],
- simplifying out the constants related to the ['S[sub max] - S[sub min]] range,
- providing an additional ['[bigo](N * (S[sub max] + lbs))] runtime on top of the
- ['[bigo](N * (K/S))] performance of LSD __radix_sort,
- but without the ['[bigo](N)] memory overhead.
- For simplicity, because ['lbs] is a small constant
- (0 can be used, and performs reasonably),
- it is ignored when summarizing the performance in further discussions.
- By checking whether comparison-based sorting is better,
- Spreadsort is also ['[bigo](N*log(N))], whichever is better,
- and unlike LSD __radix_sort, can perform much better than the worst-case
- if the data is either evenly distributed or highly clustered.
- This analysis was for __integer_sort and __float_sort.
- __string_sort differs in that ['S[sub min] = S[sub max] = sizeof(Char_type) * 8],
- ['lbs] is 0, and that __std_sort's comparison is not a constant-time operation,
- so strictly speaking __string_sort runtime is
- [space][space]['[bigo](N * (K/S[sub max] + (S[sub max] comparisons)))].
- Worst-case, this ends up being ['[bigo](N * K)]
- (where ['K] is the mean string length in bytes),
- as described for __american_flag, which is better than the
- [space][space]['[bigo](N * K * log(N))]
- worst-case for comparison-based sorting.
- [endsect] [/section:performance Performance]
- [section:tuning Tuning]
- __integer_sort and __float_sort have tuning constants that control
- how the radix-sorting portion of those algorithms work.
- The ideal constant values for __integer_sort and __float_sort vary depending on
- the platform, compiler, and data being sorted.
- By far the most important constant is ['max_splits],
- which defines how many pieces the radix-sorting portion splits
- the data into per iteration.
- The ideal value of ['max_splits] depends upon the size of the L1 processor cache,
- and is between 10 and 13 on many systems.
- A default value of 11 is used. For mostly-sorted data, a much larger value is better,
- as swaps (and thus cache misses) are rare,
- but this hurts runtime severely for unsorted data, so is not recommended.
- On some x86 systems, when the total number of elements being sorted is small
- ( less than 1 million or so), the ideal ['max_splits] can be substantially larger,
- such as 17. This is suspected to be because all the data fits into the L2 cache,
- and misses from L1 cache to L2 cache do not impact performance
- as severely as misses to main memory.
- Modifying tuning constants other than ['max_splits] is not recommended,
- as the performance improvement for changing other constants is usually minor.
- If you can afford to let it run for a day, and have at least 1GB of free memory,
- the perl command: `./tune.pl -large -tune` (UNIX)
- or `perl tune.pl -large -tune -windows` (Windows)
- can be used to automatically tune these constants.
- This should be run from the `libs/sort directory` inside the boost home directory.
- This will work to identify the `ideal constants.hpp` settings for your system,
- testing on various distributions in a 20 million element (80MB) file,
- and additionally verifies that all sorting routines sort correctly
- across various data distributions.
- Alternatively, you can test with the file size you're most concerned with
- `./tune.pl number -tune` (UNIX) or `perl tune.pl number -tune -windows` (Windows).
- Substitute the number of elements you want to test with for `number`.
- Otherwise, just use the options it comes with, they're decent.
- With default settings `./tune.pl -tune` (UNIX) `perl tune.pl -tune -windows` (Windows),
- the script will take hours to run (less than a day),
- but may not pick the correct ['max_splits] if it is over 10.
- Alternatively, you can add the `-small` option to make it take just a few minutes,
- tuning for smaller vector sizes (one hundred thousand elements),
- but the resulting constants may not be good for large files
- (see above note about ['max_splits] on Windows).
- The tuning script can also be used just to verify that sorting works correctly
- on your system, and see how much of a speedup it gets,
- by omiting the "-tune" option. This runs at the end of tuning runs.
- Default args will take about an hour to run and give accurate results
- on decent-sized test vectors. `./tune.pl -small` (UNIX) `perl tune.pl -small -windows` (Windows)
- is a faster option, that tests on smaller vectors and isn't as accurate.
- If any differences are encountered during tuning, please call `tune.pl` with `-debug > log_file_name`.
- If the resulting log file contains compilation or permissions issues,
- it is likely an issue with your setup.
- If some other type of error is encountered (or result differences),
- please send them to the library author at spreadsort@gmail.com.
- Including the zipped `input.txt` that was being used is also helpful.
- [endsect] [/section:tuning Tuning]
- [endsect] [/section Overview]
- [section:sort_hpp Spreadsort]
- [section:header_spreadsort Header `<boost/sort/spreadsort/spreadsort.hpp>`]
- __spreadsort checks whether the data-type provided is an integer,
- castable float, string, or wstring.
- * If data-type is an integer, __integer_sort is used.
- * If data-type is a float, __float_sort is used.
- * If data-type is a string or wstring, __string_sort is used.
- * Sorting other data-types requires picking between
- __integer_sort, __float_sort and __string_sort directly,
- as __spreadsort won't accept types that don't have the appropriate type traits.
- Overloading variants are provided that permit use of user-defined right-shift functors and comparison functors.
- Each function is optimized for its set of arguments; default functors are not provided to avoid the risk of any reduction of performance.
- See __overloading section.
- [h5 Rationale:]
- __spreadsort function provides a wrapper that calls the fastest sorting algorithm
- available for a data-type, enabling faster generic programming.
- [section:spreadsort_examples Spreadsort Examples]
- See [@../../example/ example] folder for all examples.
- See [@../../example/sample.cpp sample.cpp] for a simple working example.
- For an example of 64-bit integer sorting, see [@../../example/int64.cpp int64.cpp].
- This example sets the element type of a vector to 64-bit integer
- [int64bit_1]
- and calls the sort
- [int64bit_2]
- For a simple example sorting `float`s,
- vector<float> vec;
- vec.push_back(1.0);
- vec.push_back(2.3);
- vec.push_back(1.3);
- ...
- spreadsort(vec.begin(), vec.end());
- //The sorted vector contains "1.0 1.3 2.3 ..."
- See also [@../../example/floatsample.cpp floatsample.cpp] which checks for abnormal values.
- [endsect] [/section:spreadsort_examples Spreadsort Examples]
- [endsect] [/section:header_spreadsort Header `<boost/sort/spreadsort/spreadsort.hpp>`]
- [section:integer_sort Integer Sort]
- __integer_sort is a fast templated in-place hybrid radix/comparison algorithm,
- which in testing tends to be roughly 50% to 2X faster than
- __std_sort for large tests (>=100kB).
- Worst-case performance is ['[bigo](N * (log2(range)/s + s))],
- so __integer_sort is asymptotically faster than pure comparison-based algorithms.
- ['s] is ['max_splits], which defaults to 11,
- so its worst-case with default settings for 32-bit integers is ['[bigo](N * ((32/11)]
- slow radix-based iterations + 11 fast comparison-based iterations).
- Some performance plots of runtime vs. n and log2(range) are provided below:
- [@../../doc/graph/windows_integer_sort.htm Windows Integer Sort]
- [@../../doc/graph/osx_integer_sort.htm OSX integer Sort]
- [section:integersort_examples Integer Sort Examples]
- See [@../../example/rightshiftsample.cpp rightshiftsample.cpp] for a working example of using rightshift, using a user-defined functor:
- [rightshift_int_functor]
- Other examples:
- [@../../example/keyplusdatasample.cpp Sort structs using an integer key.]
- [@../../example/reverseintsample.cpp Sort integers in reverse order.]
- [@../../example/mostlysorted.cpp Simple sorting of integers; this case is a performance test for integers that are already mostly sorted.]
- [endsect] [/section:integersort_examples Integer Sort Examples]
- [endsect] [/section:integer_sort Integer Sort]
- [section:float_sort Float Sort]
- __float_sort is a fast templated in-place hybrid radix/comparison algorithm much like __integer_sort, but sorts IEEE floating-point numbers (positive, zero, NaN, and negative) into ascending order by casting them to integers. This works because positive IEEE floating-point numbers sort like integers with the same bits, and negative IEEE floating-point numbers sort in the reverse order of integers with the same bits. float_sort is roughly 2X as fast as std::sort.
- -0.0 vs. 0.0 and NaN are given definitive ordered positions by the radix-based portion of this algorithm, where comparison-based sorting does not guarantee their relative position. The included tests avoid creating NaN and -0.0 so that results match std::sort, which is not consistent in how it handles these numbers, as they compare as equal to numbers with different values.
- float_sort checks the size of the data type and whether it is castable, picking
- an integer type to cast to, if a casting functor isn't provided by the user.
- float_mem_cast casts IEEE floating-point numbers (positive, zero, NaN, and negative) into integers. This is an essential utility for creating a custom rightshift functor for float_sort, when one is needed. Only IEEE floating-point numbers of the same size as the integer type being cast to should be used in this specialized method call.
- Worst-case performance is ['[bigo](N * (log2(range)/s + s))],
- so __float_sort is asymptotically faster than pure comparison-based algorithms.
- ['s] is ['max_splits], which defaults to 11,
- so its worst-case with default settings for 32-bit integers is ['[bigo](N * ((32/11)]
- slow radix-based iterations + 11 fast comparison-based iterations).
- Some performance plots of runtime vs. n and log2(range) are provided below:
- [@../../doc/graph/windows_float_sort.htm Windows Float Sort]
- [@../../doc/graph/osx_float_sort.htm OSX Float Sort]
- [section:floatsort_examples Float Sort Examples]
- See [@../../example/floatfunctorsample.cpp floatfunctorsample.cpp] for a working example of how to sort structs with a float key:
- [float_functor_types]
- [float_functor_datatypes]
- Right-shift functor:
- [float_functor_rightshift]
- Comparison lessthan `operator<` functor:
- [float_functor_lessthan]
- Other examples:
- [@../../example/double.cpp Sort doubles.]
- [@../../example/shiftfloatsample.cpp Sort floats using a rightshift functor.]
- [endsect] [/section:floatsort_examples Float Sort Examples]
- [endsect] [/section:float_sort Float Sort]
- [section:string_sort String Sort]
- __string_sort is a hybrid radix-based/comparison-based algorithm that sorts strings of characters (or arrays of binary data) in ascending order.
- The simplest version (no functors) sorts strings of items that can cast to an unsigned data type (such as an unsigned char), have a < operator, have a size function, and have a data() function that returns a pointer to an array of characters, such as a std::string. The functor version can sort any data type that has a strict weak ordering, via templating, but requires definitions of a get_char (acts like x[offset] on a string or a byte array), get_length (returns length of the string being sorted), and a comparison functor. Individual characters returned by get_char must support the != operator and have an unsigned value that defines their lexicographical order.
- This algorithm is not efficient for character types larger than 2 bytes each, and is optimized for one-byte character strings. For this reason, __std_sort will be called instead if the character type is of size > 2.
- __string_sort has a special optimization for identical substrings. This adds some overhead on random data, but identical substrings are common in real strings.
- reverse_string_sort sorts strings in reverse (decending) order, but is otherwise identical. __string_sort is sufficiently flexible that it should sort any data type that __std_sort can, assuming the user provides appropriate functors that index into a key.
- [@../../doc/graph/windows_string_sort.htm Windows String Sort]
- [@../../doc/graph/osx_string_sort.htm OSX String Sort]
- [section:stringsort_examples String Sort Examples]
- See [@../../example/stringfunctorsample.cpp stringfunctorsample.cpp] for an example of how to sort structs using a string key and all functors:
- [lessthan_functor]
- [bracket_functor]
- [getsize_functor]
- and these functors are used thus:
- [stringsort_functors_call]
- See [@../../example/generalizedstruct.cpp generalizedstruct.cpp] for a working example of a generalized approach to sort structs by a sequence of integer, float, and multiple string keys using string_sort:
- [generalized_functors]
- [generalized_functors_call]
- Other examples:
- [@../../example/stringsample.cpp String sort.]
- [@../../example/reversestringsample.cpp Reverse string sort.]
- [@../../example/wstringsample.cpp Wide character string sort.]
- [@../../example/caseinsensitive.cpp Case insensitive string sort.]
- [@../../example/charstringsample.cpp Sort structs using a string key and indexing functors.]
- [@../../example/reversestringfunctorsample.cpp Sort structs using a string keynd all functors in reverse order.]
- [endsect] [/section:stringsort_examples String Sort Examples]
- [endsect] [/section:string_sort String Sort]
- [section:rationale Rationale]
- [section:radix_sorting Radix Sorting]
- Radix-based sorting allows the data to be divided up into more than 2 pieces per iteration,
- and for cache-friendly versions, it normally cuts the data up into around a thousand pieces per iteration.
- This allows many fewer iterations to be used to complete sorting the data,
- enabling performance superior to the ['[bigo](N*log(N))] comparison-based sorting limit.
- [endsect] [/section:radix_sorting Radix Sorting]
- [section:hybrid_radix Hybrid Radix]
- There a two primary types of radix-based sorting:
- Most-significant-digit Radix sorting (MSD) divides the data recursively
- based upon the top-most unsorted bits.
- This approach is efficient for even distributions that divide nicely,
- and can be done in-place (limited additional memory used).
- There is substantial constant overhead for each iteration to deal
- with the splitting structure.
- The algorithms provided here use MSD Radix Sort for their radix-sorting portion.
- The main disadvantage of MSD Radix sorting is that when the data is cut up into small
- pieces, the overhead for additional recursive calls starts to dominate runtime,
- and this makes worst-case behavior substantially worse than ['[bigo](N*log(N))].
- By contrast, __integer_sort, __float_sort, and __string_sort all check to see
- whether Radix-based or Comparison-based sorting have better worst-case runtime,
- and make the appropriate recursive call.
- Because Comparison-based sorting algorithms are efficient on small pieces,
- the tendency of MSD __radix_sort to cut the problem up small is turned into
- an advantage by these hybrid sorts. It is hard to conceive of a common usage case
- where pure MSD __radix_sort would have any significant advantage
- over hybrid algorithms.
- Least-significant-digit __radix_sort (LSD) sorts based upon
- the least-significant bits first. This requires a complete copy of the data being sorted,
- using substantial additional memory. The main advantage of LSD Radix Sort
- is that aside from some constant overhead and the memory allocation,
- it uses a fixed amount of time per element to sort, regardless of distribution or
- size of the list. This amount of time is proportional to the length of the radix.
- The other advantage of LSD Radix Sort is that it is a stable sorting algorithm,
- so elements with the same key will retain their original order.
- One disadvantage is that LSD Radix Sort uses the same amount of time
- to sort "easy" sorting problems as "hard" sorting problems,
- and this time spent may end up being greater than an efficient ['[bigo](N*log(N))]
- algorithm such as __introsort spends sorting "hard" problems on large data sets,
- depending on the length of the datatype, and relative speed of comparisons,
- memory allocation, and random accesses.
- The other main disadvantage of LSD Radix Sort is its memory overhead.
- It's only faster for large data sets, but large data sets use significant memory,
- which LSD Radix Sort needs to duplicate. LSD Radix Sort doesn't make sense for items
- of variable length, such as strings; it could be implemented by starting at the end
- of the longest element, but would be extremely inefficient.
- All that said, there are places where LSD Radix Sort is the appropriate and
- fastest solution, so it would be appropriate to create a templated LSD Radix Sort
- similar to __integer_sort and __float_sort. This would be most appropriate in cases where
- comparisons are extremely slow.
- [endsect] [/section:hybrid_radix Hybrid Radix]
- [section:why_spreadsort Why spreadsort?]
- The __spreadsort algorithm used in this library is designed to provide best possible
- worst-case performance, while still being cache-friendly.
- It provides the better of ['[bigo](N*log(K/S + S))] and ['[bigo](N*log(N))] worst-case time,
- where ['K] is the log of the range. The log of the range is normally the length in bits
- of the data type; 32 for a 32-bit integer.
- `flash_sort` (another hybrid algorithm), by comparison is ['[bigo](N)]
- for evenly distributed lists. The problem is, `flash_sort` is merely an MSD __radix_sort
- combined with ['[bigo](N*N)] insertion sort to deal with small subsets where
- the MSD Radix Sort is inefficient, so it is inefficient with chunks of data
- around the size at which it switches to `insertion_sort`, and ends up operating
- as an enhanced MSD Radix Sort.
- For uneven distributions this makes it especially inefficient.
- __integer_sort and __float_sort use __introsort instead, which provides ['[bigo](N*log(N))]
- performance for these medium-sized pieces. Also, `flash_sort`'s ['[bigo](N)]
- performance for even distributions comes at the cost of cache misses,
- which on modern architectures are extremely expensive, and in testing
- on modern systems ends up being slower than cutting up the data in multiple,
- cache-friendly steps. Also worth noting is that on most modern computers,
- `log2(available RAM)/log2(L1 cache size)` is around 3,
- where a cache miss takes more than 3 times as long as an in-cache random-access,
- and the size of ['max_splits] is tuned to the size of the cache.
- On a computer where cache misses aren't this expensive, ['max_splits]
- could be increased to a large value, or eliminated entirely,
- and `integer_sort/float_sort` would have the same ['[bigo](N)] performance
- on even distributions.
- Adaptive Left Radix (ALR) is similar to `flash_sort`, but more cache-friendly.
- It still uses insertion_sort. Because ALR uses ['[bigo](N*N)] `insertion_sort`,
- it isn't efficient to use the comparison-based fallback sort on large lists,
- and if the data is clustered in small chunks just over the fallback size
- with a few outliers, radix-based sorting iterates many times doing little sorting
- with high overhead. Asymptotically, ALR is still ['[bigo](N*log(K/S + S))],
- but with a very small ['S] (about 2 in the worst case),
- which compares unfavorably with the 11 default value of ['max_splits] for
- Spreadsort.
- ALR also does not have the ['[bigo](N*log(N))] fallback, so for small lists
- that are not evenly distributed it is extremely inefficient.
- See the `alrbreaker` and `binaryalrbreaker` testcases for examples;
- either replace the call to sort with a call to ALR and update the ALR_THRESHOLD
- at the top, or as a quick comparison make `get_max_count return ALR_THRESHOLD`
- (20 by default based upon the paper).
- These small tests take 4-10 times as long with ALR as __std_sort
- in the author's testing, depending on the test system,
- because they are trying to sort a highly uneven distribution.
- Normal Spreadsort does much better with them, because `get_max_count`
- is designed around minimizing worst-case runtime.
- `burst_sort` is an efficient hybrid algorithm for strings that
- uses substantial additional memory.
- __string_sort uses minimal additional memory by comparison.
- Speed comparisons between the two haven't been made,
- but the better memory efficiency makes __string_sort more general.
- `postal_sort` and __string_sort are similar. A direct performance comparison
- would be welcome, but an efficient version of `postal_sort` was not found
- in a search for source.
- __string_sort is most similar to the __american_flag algorithm.
- The main difference is that it doesn't bother trying to optimize how empty
- buckets/piles are handled, instead just checking to see if all characters
- at the current index are equal. Other differences are using __std_sort
- as the fallback algorithm, and a larger fallback size (256 vs. 16),
- which makes empty pile handling less important.
- Another difference is not applying the stack-size restriction.
- Because of the equality check in __string_sort, it would take ['m*m] memory
- worth of strings to force __string_sort to create a stack of depth ['m].
- This problem isn't a realistic one on modern systems with multi-megabyte stacksize
- limits, where main memory would be exhausted holding the long strings necessary
- to exceed the stacksize limit. __string_sort can be thought of as modernizing
- __american_flag to take advantage of __introsort as a fallback algorithm.
- In the author's testing, __american_flag (on `std::strings`) had comparable runtime
- to __introsort, but making a hybrid of the two allows reduced overhead and
- substantially superior performance.
- [endsect] [/section:why_spreadsort]
- [section:unstable_sort Unstable Sorting]
- Making a __radix_sort stable requires the usage of an external copy of the data.
- A stable hybrid algorithm also requires a stable comparison-based algorithm,
- and these are generally slow. LSD __radix_sort uses an external copy of the data,
- and provides stability, along with likely being faster (than a stable hybrid sort),
- so that's probably a better way to go for integer and floating-point types.
- It might make sense to make a stable version of __string_sort using external memory,
- but for simplicity this has been left out for now.
- [endsect] [/section:unstable_sort Unstable Sorting]
- [section:optimization Unused X86 optimization]
- Though the ideal ['max_splits] for `n < 1 million` (or so) on x86
- ['seems] to be substantially larger, enabling a roughly 15% speedup for such tests,
- this optimization isn't general, and doesn't apply for `n > 1 million`.
- A too large ['max_splits] can cause sort to take more than twice as long,
- so it should be set on the low end of the reasonable range, where it is right now.
- [endsect] [/section:optimization Unused X86 optimization]
- [section:lookup Lookup Table?]
- The ideal way to optimize the constants would be to have a carefully-tuned
- lookup-table instead of the `get_max_count` function, but 4 tuning variables
- is simpler, `get_max_count` enforces worst-case performance minimization rules,
- and such a lookup table would be difficult to optimize
- for cross-platform performance.
- Alternatively, `get_max_count` could be used to generate a static lookup table.
- This hasn't been done due to concerns about cross-platform compatibility
- and flexibility.
- [endsect] [/section:lookup]
- [endsect] [/section:rationale Rationale]
- [endsect] [/section:sort_hpp Spreadsort]
- [section:definitions Definitions]
- [h4 stable sort]
- A sorting approach that preserves pre-existing order.
- If there are two elements with identical keys in a list that is later stably sorted,
- whichever came first in the initial list will come first in a stably sorted list.
- The algorithms provided here provide no such guarantee; items with identical keys
- will have arbitrary resulting order relative to each other.
- [endsect] [/section:definitions Definitions]
- [section:faq Frequently Asked Questions]
- There are no FAQs yet.
- [endsect] [/section:faq Frequently asked Questions]
- [section:acks Acknowledgements]
- * The author would like to thank his wife Mary for her patience and support
- during the long process of converting this from a piece of C code
- to a template library.
- * The author would also like to thank Phil Endecott and Frank Gennari
- for the improvements they've suggested and for testing.
- Without them this would have taken longer to develop or performed worse.
- * `float_mem_cast` was fixed to be safe and fast thanks to Scott McMurray.
- That fix was critical for a high-performance cross-platform __float_sort.
- * Thanks also for multiple helpful suggestions provided by Steven Watanabe,
- Edouard Alligand, and others.
- * Initial documentation was refactored to use Quickbook by Paul A. Bristow.
- [endsect] [/section:acknowledgements Acknowledgements]
- [section:bibliog Bibliography]
- [h4 Standard Template Library Sort Algorithms]
- [@http://www.cplusplus.com/reference/algorithm/sort/ C++ STL sort algorithms].
- [h4 Radix Sort]
- A type of algorithm that sorts based upon distribution instead of by comparison.
- Wikipedia has an article about Radix Sorting.
- A more detailed description of various Radix Sorting algorithms is provided here:
- Donald Knuth. The Art of Computer Programming,
- Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, 1998.
- ISBN 0-201-89685-0. Section 5.2.5: Sorting by Distribution, pp.168-179.
- [h4 Introsort]
- A high-speed comparison-based sorting algorithm that takes ['[bigo](N * log(N))] time.
- See __introsort and
- Musser, David R. (1997). "Introspective Sorting and Selection Algorithms",
- Software: Practice and Experience (Wiley) 27 (8), pp 983-993,
- available at [@http://www.cs.rpi.edu/~musser/gp/introsort.ps Musser Introsort].
- [h4 American Flag Sort]
- A high-speed hybrid string sorting algorithm that __string_sort is partially based
- upon. See __american_flag and Peter M. McIlroy, Keith Bostic, M. Douglas McIlroy. Engineering Radix Sort, Computing Systems 1993.
- [h4 Adaptive Left Radix (ARL)]
- ARL (Adaptive Left Radix) is a hybrid cache-friendly integer sorting algorithm
- with comparable speed on random data to __integer_sort,
- but does not have the optimizations for worst-case performance,
- causing it to perform poorly on certain types of unevenly distributed data.
- Arne Maus, [@http://www.nik.no/2002/Maus.pdf ARL, a faster in-place, cache friendly sorting algorithm],
- presented at NIK2002, Norwegian Informatics Conference, Kongsberg, 2002. Tapir, ISBN 82-91116-45-8.
- [h4 Original Spreadsort]
- The algorithm that __integer_sort was originally based on.
- __integer_sort uses a smaller number of key bits at a time for better cache efficiency
- than the method described in the paper.
- The importance of cache efficiency grew as CPU clock speeds increased
- while main memory latency stagnated.
- See Steven J. Ross,
- The Spreadsort High-performance General-case Sorting Algorithm,
- Parallel and Distributed Processing Techniques and Applications, Volume 3, pp.1100-1106. Las Vegas Nevada. 2002. See
- [@../../doc/papers/original_spreadsort06_2002.pdf Steven Ross spreadsort_2002].
- [endsect] [/section:bibliography Bibliography]
- [section:history History]
- * First release following review in Boost 1.58.
- * [@http://permalink.gmane.org/gmane.comp.lib.boost.devel/255194 Review of Boost.Sort/Spreadsort library]
- [endsect] [/section:history]
- [xinclude autodoc.xml] [/ Using Doxygen reference documentation.]
- [endsect] [/section:spreadsort]
|