[/===========================================================================
 Copyright (c) 2017 Steven Ross, Francisco Tapia, Orson Peters


 Distributed under the Boost Software License, Version 1.0
 See accompanying file LICENSE_1_0.txt or copy at
 http://www.boost.org/LICENSE_1_0.txt
=============================================================================/]

[section:block_indirect_sort 3.1- block_indirect_sort]


[section:block_description Description]
[:

[*BLOCK_INDIRECT_SORT] is a new unstable parallel sort, conceived and implemented by Francisco Jose Tapia for the Boost Library.

The most important characteristics of this algorithm are the *speed* and the *low memory consumption*.

[*[teletype]
``
                          |       |                        |                              |
    Algorithm             |Stable |   Additional memory    |Best, average, and worst case |
    ----------------------+-------+------------------------+------------------------------+
    block_indirect_sort   |  no   |block_size * num_threads|     N, N LogN, N LogN        |
                          |       |                        |                              |
``
]

The block_size is an internal parameter of the algorithm, which in order to achieve the
highest speed, changes according with the size of the objects to sort according to the next table.The strings use a block_size of 128.


[*[teletype]
``
                                |        |         |         |         |         |         |          |
object size (bytes)             | 1 - 15 | 16 - 31 | 32 - 63 | 64 - 127|128 - 255|256 - 511| 512 -    |
--------------------------------+--------+---------+---------+---------+---------+---------+----------+
block_size (number of elements) |  4096  |  2048   |   1024  |   768   |   512   |   256   |  128     |
                                |        |         |         |         |         |         |          |
``
]
]
[endsect]
[br]
[section:block_benchmark Benchmark]
[:
Sorting 100 000 000  64 bits numbers, the measured memory used was:
[*[teletype]
``
                                      |             |             |
    Algorithm                         | Time (secs) | Memory used |
    ----------------------------------+-------------+-------------+
    Open MP Parallel Sort             |   1.1990    |  1564 MB    |
    Threading Building Blocks (TBB)   |   1.6411    |   789 MB    |
    Block Indirect Sort               |   0.9270    |   790 MB    |
                                      |             |             |
``
]
]
[endsect]
[br]

[section:block_programming Programming]
[:
[h4[_Thread specification]]
[:
This algorithm has an integer parameter indicating the *number of threads* to use in the sorting process,
which always is the last value in the call. The default value (if left unspecified) is the number of HW threads of
the machine where the program is running provided by std::thread::hardware_concurrency().

If the number is 1 or 0, the algorithm runs with only 1 thread.

The number of threads is not a fixed number, but is calculated in each execution. The number of threads passed can be greater
than the number of hardware threads on the machine. We can pass 100 threads in a machine with 4 HW threads,
and in the same mode we can pass a function as (std::thread::hardware_concurrency() / 4 ).
If this value is 0, the program is executed with 1 thread.
]
[h4[_Programming]]
[:
You only need to include the file boost/sort/sort.hpp to use this algorithm

The algorithm runs in the namespace boost::sort

[c++]
``
    #include <boost/sort/sort.hpp>


    template <class iter_t>
    void block_indirect_sort (iter_t first, iter_t last);

    template <class iter_t, typename compare>
    void block_indirect_sort (iter_t first, iter_t last, compare comp);

    template <class iter_t>
    void block_indirect_sort (iter_t first, iter_t last, uint32_t num_thread);

    template <class iter_t, typename compare>
    void block_indirect_sort (iter_t first, iter_t last, compare comp, uint32_t num_thread);
``

This algorithm needs a *C++11 compliant compiler*. You don't need any other code or library. With older compilers correct operation is not guaranteed.

If the number of threads is unspecified, use the result of std::thread::hardware_concurrency()

This algorithm uses a *comparison object*, in the same way as the standard library sort
algorithms. If not defined, the comparison object is std::less, which uses
the < operator internally.

This algorithm is [*exception safe],  meaning that, the exceptions generated by the algorithm
guarantee the integrity of the objects to sort, but not their relative order. If the exception
is generated inside the objects (in the move or in the copy constructor.. ) the results can be
unpredictable.
]
]
[endsect]
[br]
[section:block_internal Internal Description]
[:

There are two primary categories of parallelization in sorting algorithms.


[h4[_Subdivision Algorithms]]


[: Filter the data and generate two or more parts. Each part obtained is
filtered and divided by other threads. This filter and division process is done
until the size of the part is smaller than a predefined size, then is sorted by a single thread.

The algorithm most frequently used in the filter and sorting is quick sort

These algorithms are fast with a small number of threads, but are inefficient
with a great number of threads. Examples of this category are

*Intel Threading Building Blocks (TBB)

*Microsoft PPL Parallel Sort.

]
[h4[_Merging Algorithms]]

[:

Divide the data in parts, and each part is sorted by a thread. When
the parts are sorted, they are merged to obtain the final results. These algorithms need additional memory for the
merge, usually the same size as the data.

With a small number of threads, these algorithms have similar speed to
the subdivision algorithms, but with many threads are much faster.

Examples of this category are

*GCC Parallel Sort (based on OpenMP)

*Microsoft PPL Parallel Buffered Sort

This generates an *undesirable duality*. With a small number of threads the optimal algorithm is not the optimal for a big number of threads.

For this reason, the SW designed for a *small machine* is *inadequate* for a *big machine* and vice versa.

Using only *merging algorithms*, has the *problem of the additional memory* used, usually of the same size as the data.
]

[h4[_New Parallel Sort Algorithm (Block Indirect Sort)]]
[:

This algorithm, named Block Indirect Sort, created for processors connected with shared memory, is a hybrid algorithm.

*With small number of threads, it is a subdivision algorithm.

*With many threads it is a merging algorithm, with a small auxiliary memory ( block_size * number of threads).

This algorithm *eliminates the duality*. The same code has *optimal performance* with a small and a big number of threads.

The number of threads to use is evaluated in each execution.
When the program runs  with a *small number of threads* the algorithm
internally uses a *subdivision algorithm* and has similar performance to  TBB, and when run with *many threads*,
it internally uses the *new algorithm* and has the performance of GCC Parallel Sort, with the additional advantage of *reduced memory consumption*.

]
]
[endsect]
[br]

[section:design_process Design Process]

[:
[h4[_Initial Idea]]
[:
The *initial idea*, was to build a *merge algorithm*, to be *fast with many threads, with a low additional memory*.

This algortihm is *not based in any other idea or algorithm*. The technique used in the algorithm (indirect blocks) *is new, and had been conceived and implemented for this algorithm*.

As sample of their results, we can see the the sorting 100 000 000  64 bits numbers, ramdomly generated,
in a machine with 12 threads.

[*[teletype]
``
                                      |             |             |
    Algorithm                         | Time (secs) | Memory used |
    ----------------------------------+-------------+-------------+
    Open MP Parallel Sort             |   1.1990    |  1564 MB    |
    Threading Building Blocks (TBB)   |   1.6411    |   789 MB    |
    Block Indirect Sort               |   0.9270    |   790 MB    |
                                      |             |             |
``
]


The best words about this algorithm are expressed by their [*[link sort.parallel.linux_parallel Benchmarks results]]

]
[h4[_Design process]]
[:
The process had been *long and very hard*, mainly, by the uncertainty about if the ideas are correct and run
so fast as need for to be useful.  This is complicated by the fact that we can’t be sure of the efficiency until the last part
of the code is done and the first benchmark has run.

But it had been a *very exciting process*, each time a problem is resolved, a new internal algorithm is designed,
tested …,  and  see, step by step, the advance of the process.

I discovered *many new problems* during this process, unknown until now, which forced me to design *new internal algorithms* to resolve them,
and divide the work in many parts to execute in parallel mode. Due to this, you can find many nice algorithms inside the sorting algorithm
written to resolve and parallelize the internal problems.


If you are interested in a detailed description of the algorithm, you can find it here : [* [@../papers/block_indirect_sort_en.pdf block_indirect_sort_en.pdf]].

]

]
[endsect]
[endsect]