tuning.qbk 8.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226
  1. [/
  2. Copyright Oliver Kowalke 2017.
  3. Distributed under the Boost Software License, Version 1.0.
  4. (See accompanying file LICENSE_1_0.txt or copy at
  5. http://www.boost.org/LICENSE_1_0.txt
  6. ]
  7. [#tuning]
  8. [section:tuning Tuning]
  9. [heading Disable synchronization]
  10. With [link cross_thread_sync `BOOST_FIBERS_NO_ATOMICS`] defined at the
  11. compiler[s] command line, synchronization between fibers (in different
  12. threads) is disabled. This is acceptable if the application is single threaded
  13. and/or fibers are not synchronized between threads.
  14. [heading Memory allocation]
  15. Memory allocation algorithm is significant for performance in a multithreaded
  16. environment, especially for __boost_fiber__ where fiber stacks are allocated on
  17. the heap. The default user-level memory allocator (UMA) of glibc is ptmalloc2
  18. but it can be replaced by another UMA that fit better for the concret work-load
  19. For instance Google[s]
  20. [@http://goog-perftools.sourceforge.net/doc/tcmalloc.html TCmalloc] enables a
  21. better performance at the ['skynet] microbenchmark than glibc[s] default memory
  22. allocator.
  23. [heading Scheduling strategies]
  24. The fibers in a thread are coordinated by a fiber manager. Fibers trade control
  25. cooperatively, rather than preemptively.
  26. Depending on the work-load several strategies of scheduling the fibers are
  27. possible [footnote 1024cores.net:
  28. [@http://www.1024cores.net/home/scalable-architecture/task-scheduling-strategies Task Scheduling Strategies]]
  29. that can be implmented on behalf of __algo__.
  30. * work-stealing: ready fibers are hold in a local queue, when the
  31. fiber-scheduler's local queue runs out of ready fibers, it randomly
  32. selects another fiber-scheduler and tries to steal a ready fiber from the
  33. victim (implemented in __work_stealing__ and __numa_work_stealing__)
  34. * work-requesting: ready fibers are hold in a local queue, when the
  35. fiber-scheduler's local queue runs out of ready fibers, it randomly
  36. selects another fiber-scheduler and requests for a ready fibers, the victim
  37. fiber-scheduler sends a ready-fiber back
  38. * work-sharing: ready fibers are hold in a global queue, fiber-scheduler
  39. concurrently push and pop ready fibers to/from the global queue
  40. (implemented in __shared_work__)
  41. * work-distribution: fibers that became ready are proactivly distributed to
  42. idle fiber-schedulers or fiber-schedulers with low load
  43. * work-balancing: a dedicated (helper) fiber-scheduler periodically collects
  44. informations about all fiber-scheduler running in other threads and
  45. re-distributes ready fibers among them
  46. [heading TTAS locks]
  47. Boost.Fiber uses internally spinlocks to protect critical regions if fibers
  48. running on different threads interact.
  49. Spinlocks are implemented as TTAS (test-test-and-set) locks, i.e. the spinlock
  50. tests the lock before calling an atomic exchange. This strategy helps to reduce
  51. the cache line invalidations triggered by acquiring/releasing the lock.
  52. [heading Spin-wait loop]
  53. A lock is considered under contention, if a thread repeatedly fails to acquire
  54. the lock because some other thread was faster.
  55. Waiting for a short time lets other threads finish before trying to enter the
  56. critical section again. While busy waiting on the lock, relaxing the CPU (via
  57. pause/yield mnemonic) gives the CPU a hint that the code is in a spin-wait loop.
  58. * prevents expensive pipeline flushes (speculatively executed load and compare
  59. instructions are not pushed to pipeline)
  60. * another hardware thread (simultaneous multithreading) can get time slice
  61. * it does delay a few CPU cycles, but this is necessary to prevent starvation
  62. It is obvious that this strategy is useless on single core systems because the
  63. lock can only released if the thread gives up its time slice in order to let
  64. other threads run. The macro BOOST_FIBERS_SPIN_SINGLE_CORE replaces the CPU
  65. hints (pause/yield mnemonic) by informing the operating system
  66. (via `std::this_thread_yield()`) that the thread gives up its time slice and
  67. the operating system switches to another thread.
  68. [heading Exponential back-off]
  69. The macro BOOST_FIBERS_RETRY_THRESHOLD determines how many times the CPU
  70. iterates in the spin-wait loop before yielding the thread or blocking in
  71. futex-wait.
  72. The spinlock tracks how many times the thread failed to acquire the lock.
  73. The higher the contention, the longer the thread should back-off.
  74. A ["Binary Exponential Backoff] algorithm together with a randomized contention
  75. window is utilized for this purpose.
  76. BOOST_FIBERS_CONTENTION_WINDOW_THRESHOLD determines the upper limit of the
  77. contention window (expressed as the exponent for basis of two).
  78. [heading Speculative execution (hardware transactional memory)]
  79. Boost.Fiber uses spinlocks to protect critical regions that can be used
  80. together with transactional memory (see section [link speculation Speculative
  81. execution]).
  82. [note TXS is enabled if property `htm=tsx` is specified at b2 command-line and
  83. `BOOST_USE_TSX` is applied to the compiler.]
  84. [note A TSX-transaction will be aborted if the floating point state is modified
  85. inside a critical region. As a consequence floating point operations, e.g.
  86. tore/load of floating point related registers during a fiber (context) switch
  87. are disabled.]
  88. [heading NUMA systems]
  89. Modern multi-socket systems are usually designed as [link numa NUMA systems].
  90. A suitable fiber scheduler like __numa_work_stealing__ reduces
  91. remote memory access (latence).
  92. [heading Parameters]
  93. [table Parameters that migh be defiend at compiler's command line
  94. [
  95. [Parameter]
  96. [Default value]
  97. [Effect on Boost.Fiber]
  98. ]
  99. [
  100. [BOOST_FIBERS_NO_ATOMICS]
  101. [-]
  102. [no multithreading support, all atomics removed, no synchronization
  103. between fibers running in different threads]
  104. ]
  105. [
  106. [BOOST_FIBERS_SPINLOCK_STD_MUTEX]
  107. [-]
  108. [`std::mutex` used inside spinlock]
  109. ]
  110. [
  111. [BOOST_FIBERS_SPINLOCK_TTAS]
  112. [+]
  113. [spinlock with test-test-and-swap on shared variable]
  114. ]
  115. [
  116. [BOOST_FIBERS_SPINLOCK_TTAS_ADAPTIVE]
  117. [-]
  118. [spinlock with test-test-and-swap on shared variable, adaptive retries
  119. while busy waiting]
  120. ]
  121. [
  122. [BOOST_FIBERS_SPINLOCK_TTAS_FUTEX]
  123. [-]
  124. [spinlock with test-test-and-swap on shared variable, suspend on
  125. futex after certain number of retries]
  126. ]
  127. [
  128. [BOOST_FIBERS_SPINLOCK_TTAS_ADAPTIVE_FUTEX]
  129. [-]
  130. [spinlock with test-test-and-swap on shared variable, while busy
  131. waiting adaptive retries, suspend on futex certain amount of retries]
  132. ]
  133. [
  134. [BOOST_FIBERS_SPINLOCK_TTAS + BOOST_USE_TSX]
  135. [-]
  136. [spinlock with test-test-and-swap and speculative execution (Intel TSX
  137. required)]
  138. ]
  139. [
  140. [BOOST_FIBERS_SPINLOCK_TTAS_ADAPTIVE + BOOST_USE_TSX]
  141. [-]
  142. [spinlock with test-test-and-swap on shared variable, adaptive retries
  143. while busy waiting and speculative execution (Intel TSX required)]
  144. ]
  145. [
  146. [BOOST_FIBERS_SPINLOCK_TTAS_FUTEX + BOOST_USE_TSX]
  147. [-]
  148. [spinlock with test-test-and-swap on shared variable, suspend on
  149. futex after certain number of retries and speculative execution
  150. (Intel TSX required)]
  151. ]
  152. [
  153. [BOOST_FIBERS_SPINLOCK_TTAS_ADAPTIVE_FUTEX + BOOST_USE_TSX]
  154. [-]
  155. [spinlock with test-test-and-swap on shared variable, while busy
  156. waiting adaptive retries, suspend on futex certain amount of retries
  157. and speculative execution (Intel TSX required)]
  158. ]
  159. [
  160. [BOOST_FIBERS_SPIN_SINGLE_CORE]
  161. [-]
  162. [on single core machines with multiple threads, yield thread
  163. (`std::this_thread::yield()`) after collisions]
  164. ]
  165. [
  166. [BOOST_FIBERS_RETRY_THRESHOLD]
  167. [64]
  168. [max number of retries while busy spinning, the use fallback]
  169. ]
  170. [
  171. [BOOST_FIBERS_CONTENTION_WINDOW_THRESHOLD]
  172. [16]
  173. [max size of collisions window, expressed as exponent for the basis of
  174. two]
  175. ]
  176. [
  177. [BOOST_FIBERS_SPIN_BEFORE_SLEEP0]
  178. [32]
  179. [max number of retries that relax the processor before the thread
  180. sleeps for 0s]
  181. ]
  182. [
  183. [BOOST_FIBERS_SPIN_BEFORE_YIELD]
  184. [64]
  185. [max number of retries where the thread sleeps for 0s before yield
  186. thread (`std::this_thread::yield()`)]
  187. ]
  188. ]
  189. [endsect]