performance.qbk 3.3 KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
  1. [/
  2. Copyright Oliver Kowalke 2016.
  3. Distributed under the Boost Software License, Version 1.0.
  4. (See accompanying file LICENSE_1_0.txt or copy at
  5. http://www.boost.org/LICENSE_1_0.txt
  6. ]
  7. [section:performance Performance]
  8. Performance measurements were taken using `std::chrono::highresolution_clock`,
  9. with overhead corrections.
  10. The code was compiled with gcc-6.3.1, using build options:
  11. variant = release, optimization = speed.
  12. Tests were executed on dual Intel XEON E5 2620v4 2.2GHz, 16C/32T, 64GB RAM,
  13. running Linux (x86_64).
  14. Measurements headed 1C/1T were run in a single-threaded process.
  15. The [@https://github.com/atemerev/skynet microbenchmark ['syknet]] from
  16. Alexander Temerev was ported and used for performance measurements.
  17. At the root the test spawns 10 threads-of-execution (ToE), e.g.
  18. actor/goroutine/fiber etc.. Each spawned ToE spawns additional 10 ToEs ...
  19. until *1,000,000* ToEs are created. ToEs return back their ordinal numbers
  20. (0 ... 999,999), which are summed on the previous level and sent back upstream,
  21. until reaching the root. The test was run 10-20 times, producing a range of
  22. values for each measurement.
  23. [table time per actor/erlang process/goroutine (other languages) (average over 1,000,000)
  24. [
  25. [Haskell | stack-1.4.0/ghc-8.0.1]
  26. [Go | go1.8.1]
  27. [Erlang | erts-8.3]
  28. ]
  29. [
  30. [0.05 \u00b5s - 0.06 \u00b5s]
  31. [0.42 \u00b5s - 0.49 \u00b5s]
  32. [0.63 \u00b5s - 0.73 \u00b5s]
  33. ]
  34. ]
  35. Pthreads are created with a stack size of 8kB while `std::thread`'s use the
  36. system default (1MB - 2MB). The microbenchmark could *not* be *run* with 1,000,000
  37. threads because of *resource exhaustion* (pthread and std::thread).
  38. Instead the test runs only at *10,000* threads.
  39. [table time per thread (average over 10,000 - unable to spawn 1,000,000 threads)
  40. [
  41. [pthread]
  42. [`std::thread`]
  43. [`std::async`]
  44. ]
  45. [
  46. [54 \u00b5s - 73 \u00b5s]
  47. [52 \u00b5s - 73 \u00b5s]
  48. [106 \u00b5s - 122 \u00b5s]
  49. ]
  50. ]
  51. The test utilizes 16 cores with Symmetric MultiThreading enabled (32 logical
  52. CPUs). The fiber stacks are allocated by __fixedsize_stack__.
  53. As the benchmark shows, the memory allocation algorithm is significant for
  54. performance in a multithreaded environment. The tests use glibc[s] memory
  55. allocation algorithm (based on ptmalloc2) as well as Google[s]
  56. [@http://goog-perftools.sourceforge.net/doc/tcmalloc.html TCmalloc] (via
  57. linkflags="-ltcmalloc").[footnote
  58. Tais B. Ferreira, Rivalino Matias, Autran Macedo, Lucio B. Araujo
  59. ["An Experimental Study on Memory Allocators in Multicore and
  60. Multithreaded Applications], PDCAT [,]11 Proceedings of the 2011 12th
  61. International Conference on Parallel and Distributed Computing, Applications
  62. and Technologies, pages 92-98]
  63. In the __work_stealing__ scheduling algorithm, each thread has its own local
  64. queue. Fibers that are ready to run are pushed to and popped from the local
  65. queue. If the queue runs out of ready fibers, fibers are stolen from the local
  66. queues of other participating threads.
  67. [table time per fiber (average over 1.000.000)
  68. [
  69. [fiber (16C/32T, work stealing, tcmalloc)]
  70. [fiber (1C/1T, round robin, tcmalloc)]
  71. ]
  72. [
  73. [0.05 \u00b5s - 0.09 \u00b5s]
  74. [1.69 \u00b5s - 1.79 \u00b5s]
  75. ]
  76. ]
  77. [endsect]