diff --git a/readme.md b/readme.md index 059724fe..9f285358 100644 --- a/readme.md +++ b/readme.md @@ -40,12 +40,14 @@ Notable aspects of the design include: randomized allocation, encoded free lists, etc. to protect against various heap vulnerabilities. The performance penalty is only around 3% on average over our benchmarks. +- __first-class heaps__: efficiently create and use multiple heaps to allocate across different regions. + A heap can be destroyed at once instead of deallocating each object separately. - __bounded__: it does not suffer from _blowup_ \[1\], has bounded worst-case allocation times (_wcat_), bounded space overhead (~0.2% meta-data, with at most 16.7% waste in allocation sizes), and has no internal points of contention using atomic operations almost everywhere. -You can read more on the design of mimalloc in the upcoming technical report. +You can read more on the design of _mimalloc_ in the upcoming technical report. Enjoy! @@ -222,53 +224,143 @@ gcc -o myprogram mimalloc-override.o myfile1.c ... # Performance -_Tldr_: In our benchmarks, mimalloc always outperforms -all other leading allocators (jemalloc, tcmalloc, hoard, and glibc), and usually -uses less memory (with less then 25% more in the worst case) (as of Jan 2019). -A nice property is that it does consistently well over a wide range of benchmarks. +We tested _mimalloc_ against many other top allocators over a wide +range of benchmarks, ranging from various real world programs to +synthetic benchmarks that see how the allocator behaves under more +extreme circumstances. -Disclaimer: allocators are interesting as there is no optimal algorithm -- for -a given allocator one can always construct a workload where it does not do so well. -The goal is thus to find an allocation strategy that performs well over a wide -range of benchmarks without suffering from underperformance in less -common situations (which is what our second benchmark set tests for). +Allocators are interesting as there exists no algorithm that is generally +optimal -- for a given allocator one can usually construct a workload +where it does not do so well. The goal is thus to find an allocation +strategy that performs well over a wide range of benchmarks without +suffering from underperformance in less common situations (which is what +the second half of our benchmark set tests for). + +In our benchmarks, _mimalloc_ always outperforms all other leading +allocators (_jemalloc_, _tcmalloc_, _Hoard_, etc), and usually uses less +memory (up to 25% more in the worst case). A nice property is that it +does *consistently* well over the wide range of benchmarks. + +The benchmark suite is scripted and available separately +as [mimalloc-bench](https://github.com/daanx/mimalloc-bench). -## Benchmarking +## Tested Allocators -We tested _mimalloc_ with 5 other allocators over 11 benchmarks. -The tested allocators are: +We tested _mimalloc_ with 9 leading allocators over 12 benchmarks +and the SpecMark benchmarks. The tested allocators are: -- **mi**: The mimalloc allocator (version tag `v1.0.0`). -- **je**: [jemalloc](https://github.com/jemalloc/jemalloc), by [Jason Evans](https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919) (Facebook); - currently (2018) one of the leading allocators and is widely used, for example - in BSD, Firefox, and at Facebook. Installed as package `libjemalloc-dev:amd64/bionic 3.6.0-11`. -- **tc**: [tcmalloc](https://github.com/gperftools/gperftools), by Google as part of the performance tools. - Highly performant and used in the Chrome browser. Installed as package `libgoogle-perftools-dev:amd64/bionic 2.5-2.2ubuntu3`. -- **jx**: A compiled version of a more recent instance of [jemalloc](https://github.com/jemalloc/jemalloc). - Using commit ` 7a815c1b` ([dev](https://github.com/jemalloc/jemalloc/tree/dev), 2019-01-15). -- **hd**: [Hoard](https://github.com/emeryberger/Hoard), by Emery Berger \[1]. - One of the first multi-thread scalable allocators. - ([master](https://github.com/emeryberger/Hoard), 2019-01-01, version tag `3.13`) -- **mc**: The system allocator. Here we use the LibC allocator (which is originally based on - PtMalloc). Using version 2.27. (Note that version 2.26 significantly improved scalability over - earlier versions). +- **mi**: The _mimalloc_ allocator, using version tag `v1.0.0`. + We also test a secure version of _mimalloc_ as **smi** which uses + the techniques described in Section [#sec-secure]. +- **tc**: The [_tcmalloc_](https://github.com/gperftools/gperftools) + allocator which comes as part of + the Google performance tools and is used in the Chrome browser. + Installed as package `libgoogle-perftools-dev` version + `2.5-2.2ubuntu3`. +- **je**: The [_jemalloc_](https://github.com/jemalloc/jemalloc) + allocator by Jason Evans is developed at Facebook + and widely used in practice, for example in FreeBSD and Firefox. + Using version tag 5.2.0. +- **sn**: The [_snmalloc_](https://github.com/microsoft/snmalloc) allocator + is a recent concurrent message passing + allocator by Liétar et al. \[8]. Using `git-0b64536b`. +- **rp**: The [_rpmalloc_](https://github.com/rampantpixels/rpmalloc) allocator + uses 32-byte aligned allocations and is developed by Mattias Jansson at Rampant Pixels. + Using version tag 1.3.1. +- **hd**: The [_Hoard_](https://github.com/emeryberger/Hoard) allocator by + Emery Berger \[1]. This is one of the first + multi-thread scalable allocators. Using version tag 3.13. +- **glibc**: The system allocator. Here we use the _glibc_ allocator (which is originally based on + _Ptmalloc2_), using version 2.27.0. Note that version 2.26 significantly improved scalability over + earlier versions. +- **sm**: The [_Supermalloc_](https://github.com/kuszmaul/SuperMalloc) allocator by + Bradley Kuszmaul uses hardware transactional memory + to speed up parallel operations. Using version `git-709663fb`. +- **tbb**: The Intel [TBB](https://github.com/intel/tbb) allocator that comes with + the Thread Building Blocks (TBB) library + [@kukanov2007foundations;@hudson2006mcrt]. + Installed as package `libtbb-dev`, version `2017~U7-8`. + +All allocators run exactly the same benchmark programs on Ubuntu 18.04.1 +and use `LD_PRELOAD` to override the default allocator. The wall-clock +elapsed time and peak resident memory (_rss_) are measured with the +`time` program. The average scores over 5 runs are used. Performance is +reported relative to _mimalloc_, e.g. a time of 1.5× means that +the program took 1.5× longer than _mimalloc_. + +[_snmalloc_]: https://github.com/Microsoft/_snmalloc_ +[_rpmalloc_]: https://github.com/rampantpixels/_rpmalloc_ + + +## Benchmarks + +The first set of benchmarks are real world programs and consist of: + +- __cfrac__: by Dave Barrett, implementation of continued fraction factorization which + uses many small short-lived allocations -- exactly the workload + we are targeting for Koka and Lean. +- __espresso__: a programmable logic array analyzer, described by + Grunwald, Zorn, and Henderson \[3]. in the context of cache aware memory allocation. +- __barnes__: a hierarchical n-body particle solver \[4] which uses relatively few + allocations compared to `cfrac` and `espresso`. Simulates the gravitational forces + between 163840 particles. +- __leanN__: The [Lean](https://github.com/leanprover/lean) compiler by + de Moura _et al_, version 3.4.1, + compiling its own standard library concurrently using N threads + (`./lean --make -j N`). Big real-world workload with intensive + allocation. +- __redis__: running the [redis](https://redis.io/) 5.0.3 server on + 1 million requests pushing 10 new list elements and then requesting the + head 10 elements. Measures the requests handled per second. +- __larsonN__: by Larson and Krishnan \[2]. Simulates a server workload using 100 separate + threads which each allocate and free many objects but leave some + objects to be freed by other threads. Larson and Krishnan observe this + behavior (which they call _bleeding_) in actual server applications, + and the benchmark simulates this. + +The second set of benchmarks are stress tests and consist of: + +- __alloc-test__: a modern allocator test developed by + OLogN Technologies AG ([ITHare.com](http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/)) + Simulates intensive allocation workloads with a Pareto size + distribution. The _alloc-testN_ benchmark runs on N cores doing + 100·10^6^ allocations per thread with objects up to 1KiB + in size. Using commit `94f6cb` + ([master](https://github.com/node-dot-cpp/alloc-test), 2018-07-04) +- __sh6bench__: by [MicroQuill](http://www.microquill.com/) as part of SmartHeap. Stress test + where some of the objects are freed in a + usual last-allocated, first-freed (LIFO) order, but others are freed + in reverse order. Using the + public [source](http://www.microquill.com/smartheap/shbench/bench.zip) + (retrieved 2019-01-02) +- __sh8benchN__: by [MicroQuill](http://www.microquill.com/) as part of SmartHeap. Stress test for + multi-threaded allocation (with N threads) where, just as in _larson_, + some objects are freed by other threads, and some objects freed in + reverse (as in _sh6bench_). Using the + public [source](http://www.microquill.com/smartheap/SH8BENCH.zip) + (retrieved 2019-01-02) +- __xmalloc-testN__: by Lever and Boreham \[5] and Christian Eder. We use the updated + version from the SuperMalloc repository. This is a more + extreme version of the _larson_ benchmark with 100 purely allocating threads, + and 100 purely deallocating threads with objects of various sizes migrating + between them. This asymmetric producer/consumer pattern is usually difficult + to handle by allocators with thread-local caches. +- __cache-scratch__: by Emery Berger \[1]. Introduced with the Hoard + allocator to test for _passive-false_ sharing of cache lines: first + some small objects are allocated and given to each thread; the threads + free that object and allocate immediately another one, and access that + repeatedly. If an allocator allocates objects from different threads + close to each other this will lead to cache-line contention. -All allocators run exactly the same benchmark programs and use `LD_PRELOAD` to override the system allocator. -The wall-clock elapsed time and peak resident memory (_rss_) are -measured with the `time` program. The average scores over 5 runs are used -(variation between runs is very low though). -Performance is reported relative to mimalloc, e.g. a time of 106% means that -the program took 6% longer to finish than with mimalloc. ## On a 16-core AMD EPYC running Linux Testing on a big Amazon EC2 instance ([r5a.4xlarge](https://aws.amazon.com/ec2/instance-types/)) consisting of a 16-core AMD EPYC 7000 at 2.5GHz with 128GB ECC memory, running Ubuntu 18.04.1 with LibC 2.27 and GCC 7.3.0. - - -The first benchmark set consists of programs that allocate a lot: +We excluded SuperMalloc here as it use transactional memory instructions +that are usually not supported in a virtualized environment. ![bench-r5a-1](doc/bench-r5a-1.svg) ![bench-r5a-2](doc/bench-r5a-2.svg) @@ -278,88 +370,97 @@ Memory usage: ![bench-r5a-rss-1](doc/bench-r5a-rss-1.svg) ![bench-r5a-rss-1](doc/bench-r5a-rss-2.svg) -The benchmarks above are (with N=16 in our case): +In the first five benchmarks we can see _mimalloc_ outperforms the other +allocators moderately, but we also see that all these modern allocators +perform well -- the times of large performance differences in regular +workloads are over. In +_cfrac_ and _espresso_, _mimalloc_ is a tad faster than _tcmalloc_ and +_jemalloc_, but a solid 10\% faster than all other allocators on +_espresso_. The _tbb_ allocator does not do so well here and lags more than +20\% behind _mimalloc_. The _cfrac_ and _espresso_ programs do not use much +memory (~1.5MB) so it does not matter too much, but still _mimalloc_ uses +about half the resident memory of _tcmalloc_. -- __cfrac__: by Dave Barrett, implementation of continued fraction factorization: - uses many small short-lived allocations. Factorizes as `./cfrac 175451865205073170563711388363274837927895`. -- __espresso__: a programmable logic array analyzer \[3]. -- __barnes__: a hierarchical n-body particle solver \[4]. Simulates 163840 particles. -- __leanN__: by Leonardo de Moura _et al_, the [lean](https://github.com/leanprover/lean) - compiler, version 3.4.1, compiling its own standard library concurrently using N cores (`./lean --make -j N`). - Big real-world workload with intensive allocation, takes about 1:40s when running on a - single high-end core. -- __redis__: running the [redis](https://redis.io/) 5.0.3 server on - 1 million requests pushing 10 new list elements and then requesting the - head 10 elements. Measures the requests handled per second. -- __alloc-test__: a modern [allocator test](http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/) - developed by by OLogN Technologies AG at [ITHare.com](http://ithare.com). Simulates intensive allocation workloads with a Pareto - size distribution. The `alloc-testN` benchmark runs on N cores doing 100×106 - allocations per thread with objects up to 1KB in size. - Using commit `94f6cb` ([master](https://github.com/node-dot-cpp/alloc-test), 2018-07-04) +The _leanN_ program is most interesting as a large realistic and +concurrent workload and there is a 8% speedup over _tcmalloc_. This is +quite significant: if Lean spends 20% of its time in the +allocator that means that _mimalloc_ is 1.3× faster than _tcmalloc_ +here. This is surprising as that is *not* measured in a pure +allocation benchmark like _alloc-test_. We conjecture that we see this +outsized improvement here because _mimalloc_ has better locality in +the allocation which improves performance for the *other* computations +in a program as well. -We can see mimalloc outperforms the other allocators moderately but all -these modern allocators perform well. -In `cfrac`, mimalloc is about 13% -faster than jemalloc for many small and short-lived allocations. -The `cfrac` and `espresso` programs do not use much -memory (~1.5MB) so it does not matter too much, but still mimalloc uses about half the resident -memory of tcmalloc (and 4× less than Hoard on `espresso`). +The _redis_ benchmark shows more differences between the allocators where +_mimalloc_ is 14\% faster than _jemalloc_. On this benchmark _tbb_ (and _Hoard_) do +not do well and are over 40\% slower. -_The `leanN` program is most interesting as a large realistic and concurrent -workload and there is a 6% speedup over both tcmalloc and jemalloc._ (This is -quite significant: if Lean spends (optimistically) 20% of its time in the allocator -that implies a 1.5× speedup with mimalloc). -The large `redis` benchmark shows a similar speedup. - -The `alloc-test` is very allocation intensive and we see the largest -diffrerences here when running with 16 cores in parallel. - -The second benchmark tests specific aspects of the allocators and -shows more extreme differences between allocators: +The _larson_ server workload which allocates and frees objects between +many threads shows even larger differences, where _mimalloc_ is more than +2.5× faster than _tcmalloc_ and _jemalloc_ which is quite surprising +for these battle tested allocators -- probably due to the object +migration between different threads. This is a difficult benchmark for +other allocators too where _mimalloc_ is still 48% faster than the next +fastest (_snmalloc_). -The benchmarks in the second set are (again with N=16): +The second benchmark set tests specific aspects of the allocators and +shows even more extreme differences between them. -- __larson__: by Larson and Krishnan \[2]. Simulates a server workload using 100 - separate threads where - they allocate and free many objects but leave some objects to - be freed by other threads. Larson and Krishnan observe this behavior - (which they call _bleeding_) in actual server applications, and the - benchmark simulates this. -- __sh6bench__: by [MicroQuill](http://www.microquill.com) as part of SmartHeap. Stress test for - single-threaded allocation where some of the objects are freed - in a usual last-allocated, first-freed (LIFO) order, but others - are freed in reverse order. Using the public [source](http://www.microquill.com/smartheap/shbench/bench.zip) (retrieved 2019-01-02) -- __sh8bench__: by [MicroQuill](http://www.microquill.com) as part of SmartHeap. Stress test for - multithreaded allocation (with N threads) where, just as in `larson`, some objects are freed - by other threads, and some objects freed in reverse (as in `sh6bench`). - Using the public [source](http://www.microquill.com/smartheap/SH8BENCH.zip) (retrieved 2019-01-02) -- __cache-scratch__: by Emery Berger _et al_ \[1]. Introduced with the Hoard - allocator to test for _passive-false_ sharing of cache lines: first some - small objects are allocated and given to each thread; the threads free that - object and allocate another one and access that repeatedly. If an allocator - allocates objects from different threads close to each other this will - lead to cache-line contention. +The _alloc-test_ is very allocation intensive doing millions of +allocations in various size classes. The test is scaled such that when an +allocator performs almost identically on _alloc-test1_ as _alloc-testN_ it +means that it scales linearly. Here, _tcmalloc_, _snmalloc_, and +_Hoard_ seem to scale less well and do more than 10% worse on the +multi-core version. Even the best allocators (_tcmalloc_ and _jemalloc_) are +more than 10% slower as _mimalloc_ here. -In the `larson` server workload mimalloc is 2.5× faster than -tcmalloc and jemalloc which is quite surprising -- probably due to the object -migration between different threads. Also in `sh6bench` mimalloc does much -better than the others (more than 4× faster than jemalloc). -We cannot explain this well but believe it may be -caused in part by the "reverse" free-ing in `sh6bench`. Again in `sh8bench` -the mimalloc allocator handles object migration between threads much better . +Also in _sh6bench_ _mimalloc_ does much +better than the others (more than 2× faster than _jemalloc_). +We cannot explain this well but believe it is +caused in part by the "reverse" free-ing pattern in _sh6bench_. -The `cache-scratch` benchmark also demonstrates the different architectures -of the allocators nicely. With a single thread they all perform the same, but when -running with multiple threads the allocator induced false sharing of the -cache lines causes large run-time differences, where mimalloc is -20× faster than tcmalloc here. Only the original jemalloc does almost -as well (but the most recent version, jxmalloc, regresses). The -Hoard allocator is specifically designed to avoid this false sharing and we -are not sure why it is not doing well here (although it still runs almost 5× -faster than tcmalloc and jxmalloc). +Again in _sh8bench_ the _mimalloc_ allocator handles object migration +between threads much better and is over 36% faster than the next best +allocator, _snmalloc_. Whereas _tcmalloc_ did well on _sh6bench_, the +addition of object migration caused it to be almost 3 times slower +than before. -## Benchmarks on a 4-core Intel workstation +The _xmalloc-testN_ benchmark simulates an asymmetric workload where +some threads only allocate, and others only free. The _snmalloc_ +allocator was especially developed to handle this case well as it +often occurs in concurrent message passing systems. Here we see that +the _mimalloc_ technique of having non-contended sharded thread free +lists pays off and it even outperforms _snmalloc_. Only _jemalloc_ +also handles this reasonably well, while the others underperform by +a large margin. The optimization on _mimalloc_ to do a *delayed free* +only once for full pages is quite important -- without it _mimalloc_ +is almost twice as slow (as then all frees contend again on the +single heap delayed free list). + + +The _cache-scratch_ benchmark also demonstrates the different +architectures of the allocators nicely. With a single thread they all +perform the same, but when running with multiple threads the allocator +induced false sharing of the cache lines causes large run-time +differences, where _mimalloc_ is more than 18× faster than _jemalloc_ and +_tcmalloc_! Crundal \[6] describes in detail why the false cache line +sharing occurs in the _tcmalloc_ design, and also discusses how this +can be avoided with some small implementation changes. +Only _snmalloc_ and _tbb_ also avoid the +cache line sharing like _mimalloc_. Kukanov and Voss \[7] describe in detail +how the design of _tbb_ avoids the false cache line sharing. +The _Hoard_ allocator is also specifically +designed to avoid this false sharing and we are not sure why it is not +doing well here (although it runs still 5× as fast as _tcmalloc_). + + + +## On a 4-core Intel Xeon workstation + +Below are the benchmark results on an HP +Z4-G4 workstation with a 4-core Intel® Xeon® W2123 at 3.6 GHz with 16GB +ECC memory, running Ubuntu 18.04.1 with LibC 2.27 and GCC 7.3.0. ![bench-z4-1](doc/bench-z4-1.svg) ![bench-z4-2](doc/bench-z4-2.svg) @@ -367,6 +468,23 @@ faster than tcmalloc and jxmalloc). ![bench-z4-rss-1](doc/bench-z4-rss-1.svg) ![bench-z4-rss-2](doc/bench-z4-rss-2.svg) +This time SuperMalloc (_sm_) is included as this platform supports +hardware transactional memory. Unfortunately, +there are no entries for _SuperMalloc_ in the _leanN_ and _xmalloc-testN_ benchmarks +as it faulted on those. We also added the secure version of +_mimalloc_ as **smi**. + +Overall, the relative results are quite similar as before. Most +allocators fare better on the _larsonN_ benchmark now -- either due to +architectural changes (AMD vs. Intel) or because there is just less +concurrency. Unfortunately, the SuperMalloc faulted on the _leanN_ +and _xmalloc-testN_ benchmarks. + +The secure mimalloc version uses guard pages around each (_mimalloc_) page, +encodes the free lists and uses randomized initial free lists, and we +expected it would perform quite a bit worse -- but on the first benchmark set +it performed only about 3% slower on average, and is second best overall. + # References @@ -385,3 +503,19 @@ faster than tcmalloc and jxmalloc). [pdf](http://citeseemi.ist.psu.edu/viewdoc/download?doi=10.1.1.43.6621&rep=rep1&type=pdf) - \[4] J. Barnes and P. Hut. _A hierarchical O(n*log(n)) force-calculation algorithm_. Nature, 324:446-449, 1986. + +- \[5] C. Lever, and D. Boreham. _Malloc() Performance in a Multithreaded Linux Environment._ + In USENIX Annual Technical Conference, Freenix Session. San Diego, CA. Jun. 2000. + Available at + +- \[6] Timothy Crundal. _Reducing Active-False Sharing in TCMalloc._ + 2016. . CS16S1 project at the Australian National University. + +- \[7] Alexey Kukanov, and Michael J Voss. + _The Foundations for Scalable Multi-Core Software in Intel Threading Building Blocks._ + Intel Technology Journal 11 (4). 2007 + +- \[8] Paul Liétar, Theodore Butler, Sylvan Clebsch, Sophia Drossopoulou, Juliana Franco, Matthew J Parkinson, + Alex Shamis, Christoph M Wintersteiger, and David Chisnall. + _Snmalloc: A Message Passing Allocator._ + In Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management, 122–135. ACM. 2019.