mirror of
https://github.com/microsoft/mimalloc.git
synced 2025-05-06 23:39:31 +03:00
update readme with references
This commit is contained in:
parent
b30d17d250
commit
ad19dfe062
1 changed files with 243 additions and 109 deletions
352
readme.md
352
readme.md
|
@ -40,12 +40,14 @@ Notable aspects of the design include:
|
||||||
randomized allocation, encoded free lists, etc. to protect against various
|
randomized allocation, encoded free lists, etc. to protect against various
|
||||||
heap vulnerabilities. The performance penalty is only around 3% on average
|
heap vulnerabilities. The performance penalty is only around 3% on average
|
||||||
over our benchmarks.
|
over our benchmarks.
|
||||||
|
- __first-class heaps__: efficiently create and use multiple heaps to allocate across different regions.
|
||||||
|
A heap can be destroyed at once instead of deallocating each object separately.
|
||||||
- __bounded__: it does not suffer from _blowup_ \[1\], has bounded worst-case allocation
|
- __bounded__: it does not suffer from _blowup_ \[1\], has bounded worst-case allocation
|
||||||
times (_wcat_), bounded space overhead (~0.2% meta-data, with at most 16.7% waste in allocation sizes),
|
times (_wcat_), bounded space overhead (~0.2% meta-data, with at most 16.7% waste in allocation sizes),
|
||||||
and has no internal points of contention using atomic operations almost
|
and has no internal points of contention using atomic operations almost
|
||||||
everywhere.
|
everywhere.
|
||||||
|
|
||||||
You can read more on the design of mimalloc in the upcoming technical report.
|
You can read more on the design of _mimalloc_ in the upcoming technical report.
|
||||||
|
|
||||||
Enjoy!
|
Enjoy!
|
||||||
|
|
||||||
|
@ -222,53 +224,143 @@ gcc -o myprogram mimalloc-override.o myfile1.c ...
|
||||||
|
|
||||||
# Performance
|
# Performance
|
||||||
|
|
||||||
_Tldr_: In our benchmarks, mimalloc always outperforms
|
We tested _mimalloc_ against many other top allocators over a wide
|
||||||
all other leading allocators (jemalloc, tcmalloc, hoard, and glibc), and usually
|
range of benchmarks, ranging from various real world programs to
|
||||||
uses less memory (with less then 25% more in the worst case) (as of Jan 2019).
|
synthetic benchmarks that see how the allocator behaves under more
|
||||||
A nice property is that it does consistently well over a wide range of benchmarks.
|
extreme circumstances.
|
||||||
|
|
||||||
Disclaimer: allocators are interesting as there is no optimal algorithm -- for
|
Allocators are interesting as there exists no algorithm that is generally
|
||||||
a given allocator one can always construct a workload where it does not do so well.
|
optimal -- for a given allocator one can usually construct a workload
|
||||||
The goal is thus to find an allocation strategy that performs well over a wide
|
where it does not do so well. The goal is thus to find an allocation
|
||||||
range of benchmarks without suffering from underperformance in less
|
strategy that performs well over a wide range of benchmarks without
|
||||||
common situations (which is what our second benchmark set tests for).
|
suffering from underperformance in less common situations (which is what
|
||||||
|
the second half of our benchmark set tests for).
|
||||||
|
|
||||||
|
In our benchmarks, _mimalloc_ always outperforms all other leading
|
||||||
|
allocators (_jemalloc_, _tcmalloc_, _Hoard_, etc), and usually uses less
|
||||||
|
memory (up to 25% more in the worst case). A nice property is that it
|
||||||
|
does *consistently* well over the wide range of benchmarks.
|
||||||
|
|
||||||
|
The benchmark suite is scripted and available separately
|
||||||
|
as [mimalloc-bench](https://github.com/daanx/mimalloc-bench).
|
||||||
|
|
||||||
|
|
||||||
## Benchmarking
|
## Tested Allocators
|
||||||
|
|
||||||
We tested _mimalloc_ with 5 other allocators over 11 benchmarks.
|
We tested _mimalloc_ with 9 leading allocators over 12 benchmarks
|
||||||
The tested allocators are:
|
and the SpecMark benchmarks. The tested allocators are:
|
||||||
|
|
||||||
- **mi**: The mimalloc allocator (version tag `v1.0.0`).
|
- **mi**: The _mimalloc_ allocator, using version tag `v1.0.0`.
|
||||||
- **je**: [jemalloc](https://github.com/jemalloc/jemalloc), by [Jason Evans](https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919) (Facebook);
|
We also test a secure version of _mimalloc_ as **smi** which uses
|
||||||
currently (2018) one of the leading allocators and is widely used, for example
|
the techniques described in Section [#sec-secure].
|
||||||
in BSD, Firefox, and at Facebook. Installed as package `libjemalloc-dev:amd64/bionic 3.6.0-11`.
|
- **tc**: The [_tcmalloc_](https://github.com/gperftools/gperftools)
|
||||||
- **tc**: [tcmalloc](https://github.com/gperftools/gperftools), by Google as part of the performance tools.
|
allocator which comes as part of
|
||||||
Highly performant and used in the Chrome browser. Installed as package `libgoogle-perftools-dev:amd64/bionic 2.5-2.2ubuntu3`.
|
the Google performance tools and is used in the Chrome browser.
|
||||||
- **jx**: A compiled version of a more recent instance of [jemalloc](https://github.com/jemalloc/jemalloc).
|
Installed as package `libgoogle-perftools-dev` version
|
||||||
Using commit ` 7a815c1b` ([dev](https://github.com/jemalloc/jemalloc/tree/dev), 2019-01-15).
|
`2.5-2.2ubuntu3`.
|
||||||
- **hd**: [Hoard](https://github.com/emeryberger/Hoard), by Emery Berger \[1].
|
- **je**: The [_jemalloc_](https://github.com/jemalloc/jemalloc)
|
||||||
One of the first multi-thread scalable allocators.
|
allocator by Jason Evans is developed at Facebook
|
||||||
([master](https://github.com/emeryberger/Hoard), 2019-01-01, version tag `3.13`)
|
and widely used in practice, for example in FreeBSD and Firefox.
|
||||||
- **mc**: The system allocator. Here we use the LibC allocator (which is originally based on
|
Using version tag 5.2.0.
|
||||||
PtMalloc). Using version 2.27. (Note that version 2.26 significantly improved scalability over
|
- **sn**: The [_snmalloc_](https://github.com/microsoft/snmalloc) allocator
|
||||||
earlier versions).
|
is a recent concurrent message passing
|
||||||
|
allocator by Liétar et al. \[8]. Using `git-0b64536b`.
|
||||||
|
- **rp**: The [_rpmalloc_](https://github.com/rampantpixels/rpmalloc) allocator
|
||||||
|
uses 32-byte aligned allocations and is developed by Mattias Jansson at Rampant Pixels.
|
||||||
|
Using version tag 1.3.1.
|
||||||
|
- **hd**: The [_Hoard_](https://github.com/emeryberger/Hoard) allocator by
|
||||||
|
Emery Berger \[1]. This is one of the first
|
||||||
|
multi-thread scalable allocators. Using version tag 3.13.
|
||||||
|
- **glibc**: The system allocator. Here we use the _glibc_ allocator (which is originally based on
|
||||||
|
_Ptmalloc2_), using version 2.27.0. Note that version 2.26 significantly improved scalability over
|
||||||
|
earlier versions.
|
||||||
|
- **sm**: The [_Supermalloc_](https://github.com/kuszmaul/SuperMalloc) allocator by
|
||||||
|
Bradley Kuszmaul uses hardware transactional memory
|
||||||
|
to speed up parallel operations. Using version `git-709663fb`.
|
||||||
|
- **tbb**: The Intel [TBB](https://github.com/intel/tbb) allocator that comes with
|
||||||
|
the Thread Building Blocks (TBB) library
|
||||||
|
[@kukanov2007foundations;@hudson2006mcrt].
|
||||||
|
Installed as package `libtbb-dev`, version `2017~U7-8`.
|
||||||
|
|
||||||
|
All allocators run exactly the same benchmark programs on Ubuntu 18.04.1
|
||||||
|
and use `LD_PRELOAD` to override the default allocator. The wall-clock
|
||||||
|
elapsed time and peak resident memory (_rss_) are measured with the
|
||||||
|
`time` program. The average scores over 5 runs are used. Performance is
|
||||||
|
reported relative to _mimalloc_, e.g. a time of 1.5× means that
|
||||||
|
the program took 1.5× longer than _mimalloc_.
|
||||||
|
|
||||||
|
[_snmalloc_]: https://github.com/Microsoft/_snmalloc_
|
||||||
|
[_rpmalloc_]: https://github.com/rampantpixels/_rpmalloc_
|
||||||
|
|
||||||
|
|
||||||
|
## Benchmarks
|
||||||
|
|
||||||
|
The first set of benchmarks are real world programs and consist of:
|
||||||
|
|
||||||
|
- __cfrac__: by Dave Barrett, implementation of continued fraction factorization which
|
||||||
|
uses many small short-lived allocations -- exactly the workload
|
||||||
|
we are targeting for Koka and Lean.
|
||||||
|
- __espresso__: a programmable logic array analyzer, described by
|
||||||
|
Grunwald, Zorn, and Henderson \[3]. in the context of cache aware memory allocation.
|
||||||
|
- __barnes__: a hierarchical n-body particle solver \[4] which uses relatively few
|
||||||
|
allocations compared to `cfrac` and `espresso`. Simulates the gravitational forces
|
||||||
|
between 163840 particles.
|
||||||
|
- __leanN__: The [Lean](https://github.com/leanprover/lean) compiler by
|
||||||
|
de Moura _et al_, version 3.4.1,
|
||||||
|
compiling its own standard library concurrently using N threads
|
||||||
|
(`./lean --make -j N`). Big real-world workload with intensive
|
||||||
|
allocation.
|
||||||
|
- __redis__: running the [redis](https://redis.io/) 5.0.3 server on
|
||||||
|
1 million requests pushing 10 new list elements and then requesting the
|
||||||
|
head 10 elements. Measures the requests handled per second.
|
||||||
|
- __larsonN__: by Larson and Krishnan \[2]. Simulates a server workload using 100 separate
|
||||||
|
threads which each allocate and free many objects but leave some
|
||||||
|
objects to be freed by other threads. Larson and Krishnan observe this
|
||||||
|
behavior (which they call _bleeding_) in actual server applications,
|
||||||
|
and the benchmark simulates this.
|
||||||
|
|
||||||
|
The second set of benchmarks are stress tests and consist of:
|
||||||
|
|
||||||
|
- __alloc-test__: a modern allocator test developed by
|
||||||
|
OLogN Technologies AG ([ITHare.com](http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/))
|
||||||
|
Simulates intensive allocation workloads with a Pareto size
|
||||||
|
distribution. The _alloc-testN_ benchmark runs on N cores doing
|
||||||
|
100·10^6^ allocations per thread with objects up to 1KiB
|
||||||
|
in size. Using commit `94f6cb`
|
||||||
|
([master](https://github.com/node-dot-cpp/alloc-test), 2018-07-04)
|
||||||
|
- __sh6bench__: by [MicroQuill](http://www.microquill.com/) as part of SmartHeap. Stress test
|
||||||
|
where some of the objects are freed in a
|
||||||
|
usual last-allocated, first-freed (LIFO) order, but others are freed
|
||||||
|
in reverse order. Using the
|
||||||
|
public [source](http://www.microquill.com/smartheap/shbench/bench.zip)
|
||||||
|
(retrieved 2019-01-02)
|
||||||
|
- __sh8benchN__: by [MicroQuill](http://www.microquill.com/) as part of SmartHeap. Stress test for
|
||||||
|
multi-threaded allocation (with N threads) where, just as in _larson_,
|
||||||
|
some objects are freed by other threads, and some objects freed in
|
||||||
|
reverse (as in _sh6bench_). Using the
|
||||||
|
public [source](http://www.microquill.com/smartheap/SH8BENCH.zip)
|
||||||
|
(retrieved 2019-01-02)
|
||||||
|
- __xmalloc-testN__: by Lever and Boreham \[5] and Christian Eder. We use the updated
|
||||||
|
version from the SuperMalloc repository. This is a more
|
||||||
|
extreme version of the _larson_ benchmark with 100 purely allocating threads,
|
||||||
|
and 100 purely deallocating threads with objects of various sizes migrating
|
||||||
|
between them. This asymmetric producer/consumer pattern is usually difficult
|
||||||
|
to handle by allocators with thread-local caches.
|
||||||
|
- __cache-scratch__: by Emery Berger \[1]. Introduced with the Hoard
|
||||||
|
allocator to test for _passive-false_ sharing of cache lines: first
|
||||||
|
some small objects are allocated and given to each thread; the threads
|
||||||
|
free that object and allocate immediately another one, and access that
|
||||||
|
repeatedly. If an allocator allocates objects from different threads
|
||||||
|
close to each other this will lead to cache-line contention.
|
||||||
|
|
||||||
All allocators run exactly the same benchmark programs and use `LD_PRELOAD` to override the system allocator.
|
|
||||||
The wall-clock elapsed time and peak resident memory (_rss_) are
|
|
||||||
measured with the `time` program. The average scores over 5 runs are used
|
|
||||||
(variation between runs is very low though).
|
|
||||||
Performance is reported relative to mimalloc, e.g. a time of 106% means that
|
|
||||||
the program took 6% longer to finish than with mimalloc.
|
|
||||||
|
|
||||||
## On a 16-core AMD EPYC running Linux
|
## On a 16-core AMD EPYC running Linux
|
||||||
|
|
||||||
Testing on a big Amazon EC2 instance ([r5a.4xlarge](https://aws.amazon.com/ec2/instance-types/))
|
Testing on a big Amazon EC2 instance ([r5a.4xlarge](https://aws.amazon.com/ec2/instance-types/))
|
||||||
consisting of a 16-core AMD EPYC 7000 at 2.5GHz
|
consisting of a 16-core AMD EPYC 7000 at 2.5GHz
|
||||||
with 128GB ECC memory, running Ubuntu 18.04.1 with LibC 2.27 and GCC 7.3.0.
|
with 128GB ECC memory, running Ubuntu 18.04.1 with LibC 2.27 and GCC 7.3.0.
|
||||||
|
We excluded SuperMalloc here as it use transactional memory instructions
|
||||||
|
that are usually not supported in a virtualized environment.
|
||||||
The first benchmark set consists of programs that allocate a lot:
|
|
||||||
|
|
||||||

|

|
||||||

|

|
||||||
|
@ -278,88 +370,97 @@ Memory usage:
|
||||||

|

|
||||||

|

|
||||||
|
|
||||||
The benchmarks above are (with N=16 in our case):
|
In the first five benchmarks we can see _mimalloc_ outperforms the other
|
||||||
|
allocators moderately, but we also see that all these modern allocators
|
||||||
|
perform well -- the times of large performance differences in regular
|
||||||
|
workloads are over. In
|
||||||
|
_cfrac_ and _espresso_, _mimalloc_ is a tad faster than _tcmalloc_ and
|
||||||
|
_jemalloc_, but a solid 10\% faster than all other allocators on
|
||||||
|
_espresso_. The _tbb_ allocator does not do so well here and lags more than
|
||||||
|
20\% behind _mimalloc_. The _cfrac_ and _espresso_ programs do not use much
|
||||||
|
memory (~1.5MB) so it does not matter too much, but still _mimalloc_ uses
|
||||||
|
about half the resident memory of _tcmalloc_.
|
||||||
|
|
||||||
- __cfrac__: by Dave Barrett, implementation of continued fraction factorization:
|
The _leanN_ program is most interesting as a large realistic and
|
||||||
uses many small short-lived allocations. Factorizes as `./cfrac 175451865205073170563711388363274837927895`.
|
concurrent workload and there is a 8% speedup over _tcmalloc_. This is
|
||||||
- __espresso__: a programmable logic array analyzer \[3].
|
quite significant: if Lean spends 20% of its time in the
|
||||||
- __barnes__: a hierarchical n-body particle solver \[4]. Simulates 163840 particles.
|
allocator that means that _mimalloc_ is 1.3× faster than _tcmalloc_
|
||||||
- __leanN__: by Leonardo de Moura _et al_, the [lean](https://github.com/leanprover/lean)
|
here. This is surprising as that is *not* measured in a pure
|
||||||
compiler, version 3.4.1, compiling its own standard library concurrently using N cores (`./lean --make -j N`).
|
allocation benchmark like _alloc-test_. We conjecture that we see this
|
||||||
Big real-world workload with intensive allocation, takes about 1:40s when running on a
|
outsized improvement here because _mimalloc_ has better locality in
|
||||||
single high-end core.
|
the allocation which improves performance for the *other* computations
|
||||||
- __redis__: running the [redis](https://redis.io/) 5.0.3 server on
|
in a program as well.
|
||||||
1 million requests pushing 10 new list elements and then requesting the
|
|
||||||
head 10 elements. Measures the requests handled per second.
|
|
||||||
- __alloc-test__: a modern [allocator test](http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/)
|
|
||||||
developed by by OLogN Technologies AG at [ITHare.com](http://ithare.com). Simulates intensive allocation workloads with a Pareto
|
|
||||||
size distribution. The `alloc-testN` benchmark runs on N cores doing 100×10<sup>6</sup>
|
|
||||||
allocations per thread with objects up to 1KB in size.
|
|
||||||
Using commit `94f6cb` ([master](https://github.com/node-dot-cpp/alloc-test), 2018-07-04)
|
|
||||||
|
|
||||||
We can see mimalloc outperforms the other allocators moderately but all
|
The _redis_ benchmark shows more differences between the allocators where
|
||||||
these modern allocators perform well.
|
_mimalloc_ is 14\% faster than _jemalloc_. On this benchmark _tbb_ (and _Hoard_) do
|
||||||
In `cfrac`, mimalloc is about 13%
|
not do well and are over 40\% slower.
|
||||||
faster than jemalloc for many small and short-lived allocations.
|
|
||||||
The `cfrac` and `espresso` programs do not use much
|
|
||||||
memory (~1.5MB) so it does not matter too much, but still mimalloc uses about half the resident
|
|
||||||
memory of tcmalloc (and 4× less than Hoard on `espresso`).
|
|
||||||
|
|
||||||
_The `leanN` program is most interesting as a large realistic and concurrent
|
The _larson_ server workload which allocates and frees objects between
|
||||||
workload and there is a 6% speedup over both tcmalloc and jemalloc._ (This is
|
many threads shows even larger differences, where _mimalloc_ is more than
|
||||||
quite significant: if Lean spends (optimistically) 20% of its time in the allocator
|
2.5× faster than _tcmalloc_ and _jemalloc_ which is quite surprising
|
||||||
that implies a 1.5× speedup with mimalloc).
|
for these battle tested allocators -- probably due to the object
|
||||||
The large `redis` benchmark shows a similar speedup.
|
migration between different threads. This is a difficult benchmark for
|
||||||
|
other allocators too where _mimalloc_ is still 48% faster than the next
|
||||||
The `alloc-test` is very allocation intensive and we see the largest
|
fastest (_snmalloc_).
|
||||||
diffrerences here when running with 16 cores in parallel.
|
|
||||||
|
|
||||||
The second benchmark tests specific aspects of the allocators and
|
|
||||||
shows more extreme differences between allocators:
|
|
||||||
|
|
||||||
|
|
||||||
The benchmarks in the second set are (again with N=16):
|
The second benchmark set tests specific aspects of the allocators and
|
||||||
|
shows even more extreme differences between them.
|
||||||
|
|
||||||
- __larson__: by Larson and Krishnan \[2]. Simulates a server workload using 100
|
The _alloc-test_ is very allocation intensive doing millions of
|
||||||
separate threads where
|
allocations in various size classes. The test is scaled such that when an
|
||||||
they allocate and free many objects but leave some objects to
|
allocator performs almost identically on _alloc-test1_ as _alloc-testN_ it
|
||||||
be freed by other threads. Larson and Krishnan observe this behavior
|
means that it scales linearly. Here, _tcmalloc_, _snmalloc_, and
|
||||||
(which they call _bleeding_) in actual server applications, and the
|
_Hoard_ seem to scale less well and do more than 10% worse on the
|
||||||
benchmark simulates this.
|
multi-core version. Even the best allocators (_tcmalloc_ and _jemalloc_) are
|
||||||
- __sh6bench__: by [MicroQuill](http://www.microquill.com) as part of SmartHeap. Stress test for
|
more than 10% slower as _mimalloc_ here.
|
||||||
single-threaded allocation where some of the objects are freed
|
|
||||||
in a usual last-allocated, first-freed (LIFO) order, but others
|
|
||||||
are freed in reverse order. Using the public [source](http://www.microquill.com/smartheap/shbench/bench.zip) (retrieved 2019-01-02)
|
|
||||||
- __sh8bench__: by [MicroQuill](http://www.microquill.com) as part of SmartHeap. Stress test for
|
|
||||||
multithreaded allocation (with N threads) where, just as in `larson`, some objects are freed
|
|
||||||
by other threads, and some objects freed in reverse (as in `sh6bench`).
|
|
||||||
Using the public [source](http://www.microquill.com/smartheap/SH8BENCH.zip) (retrieved 2019-01-02)
|
|
||||||
- __cache-scratch__: by Emery Berger _et al_ \[1]. Introduced with the Hoard
|
|
||||||
allocator to test for _passive-false_ sharing of cache lines: first some
|
|
||||||
small objects are allocated and given to each thread; the threads free that
|
|
||||||
object and allocate another one and access that repeatedly. If an allocator
|
|
||||||
allocates objects from different threads close to each other this will
|
|
||||||
lead to cache-line contention.
|
|
||||||
|
|
||||||
In the `larson` server workload mimalloc is 2.5× faster than
|
Also in _sh6bench_ _mimalloc_ does much
|
||||||
tcmalloc and jemalloc which is quite surprising -- probably due to the object
|
better than the others (more than 2× faster than _jemalloc_).
|
||||||
migration between different threads. Also in `sh6bench` mimalloc does much
|
We cannot explain this well but believe it is
|
||||||
better than the others (more than 4× faster than jemalloc).
|
caused in part by the "reverse" free-ing pattern in _sh6bench_.
|
||||||
We cannot explain this well but believe it may be
|
|
||||||
caused in part by the "reverse" free-ing in `sh6bench`. Again in `sh8bench`
|
|
||||||
the mimalloc allocator handles object migration between threads much better .
|
|
||||||
|
|
||||||
The `cache-scratch` benchmark also demonstrates the different architectures
|
Again in _sh8bench_ the _mimalloc_ allocator handles object migration
|
||||||
of the allocators nicely. With a single thread they all perform the same, but when
|
between threads much better and is over 36% faster than the next best
|
||||||
running with multiple threads the allocator induced false sharing of the
|
allocator, _snmalloc_. Whereas _tcmalloc_ did well on _sh6bench_, the
|
||||||
cache lines causes large run-time differences, where mimalloc is
|
addition of object migration caused it to be almost 3 times slower
|
||||||
20× faster than tcmalloc here. Only the original jemalloc does almost
|
than before.
|
||||||
as well (but the most recent version, jxmalloc, regresses). The
|
|
||||||
Hoard allocator is specifically designed to avoid this false sharing and we
|
|
||||||
are not sure why it is not doing well here (although it still runs almost 5×
|
|
||||||
faster than tcmalloc and jxmalloc).
|
|
||||||
|
|
||||||
## Benchmarks on a 4-core Intel workstation
|
The _xmalloc-testN_ benchmark simulates an asymmetric workload where
|
||||||
|
some threads only allocate, and others only free. The _snmalloc_
|
||||||
|
allocator was especially developed to handle this case well as it
|
||||||
|
often occurs in concurrent message passing systems. Here we see that
|
||||||
|
the _mimalloc_ technique of having non-contended sharded thread free
|
||||||
|
lists pays off and it even outperforms _snmalloc_. Only _jemalloc_
|
||||||
|
also handles this reasonably well, while the others underperform by
|
||||||
|
a large margin. The optimization on _mimalloc_ to do a *delayed free*
|
||||||
|
only once for full pages is quite important -- without it _mimalloc_
|
||||||
|
is almost twice as slow (as then all frees contend again on the
|
||||||
|
single heap delayed free list).
|
||||||
|
|
||||||
|
|
||||||
|
The _cache-scratch_ benchmark also demonstrates the different
|
||||||
|
architectures of the allocators nicely. With a single thread they all
|
||||||
|
perform the same, but when running with multiple threads the allocator
|
||||||
|
induced false sharing of the cache lines causes large run-time
|
||||||
|
differences, where _mimalloc_ is more than 18× faster than _jemalloc_ and
|
||||||
|
_tcmalloc_! Crundal \[6] describes in detail why the false cache line
|
||||||
|
sharing occurs in the _tcmalloc_ design, and also discusses how this
|
||||||
|
can be avoided with some small implementation changes.
|
||||||
|
Only _snmalloc_ and _tbb_ also avoid the
|
||||||
|
cache line sharing like _mimalloc_. Kukanov and Voss \[7] describe in detail
|
||||||
|
how the design of _tbb_ avoids the false cache line sharing.
|
||||||
|
The _Hoard_ allocator is also specifically
|
||||||
|
designed to avoid this false sharing and we are not sure why it is not
|
||||||
|
doing well here (although it runs still 5× as fast as _tcmalloc_).
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## On a 4-core Intel Xeon workstation
|
||||||
|
|
||||||
|
Below are the benchmark results on an HP
|
||||||
|
Z4-G4 workstation with a 4-core Intel® Xeon® W2123 at 3.6 GHz with 16GB
|
||||||
|
ECC memory, running Ubuntu 18.04.1 with LibC 2.27 and GCC 7.3.0.
|
||||||
|
|
||||||

|

|
||||||

|

|
||||||
|
@ -367,6 +468,23 @@ faster than tcmalloc and jxmalloc).
|
||||||

|

|
||||||

|

|
||||||
|
|
||||||
|
This time SuperMalloc (_sm_) is included as this platform supports
|
||||||
|
hardware transactional memory. Unfortunately,
|
||||||
|
there are no entries for _SuperMalloc_ in the _leanN_ and _xmalloc-testN_ benchmarks
|
||||||
|
as it faulted on those. We also added the secure version of
|
||||||
|
_mimalloc_ as **smi**.
|
||||||
|
|
||||||
|
Overall, the relative results are quite similar as before. Most
|
||||||
|
allocators fare better on the _larsonN_ benchmark now -- either due to
|
||||||
|
architectural changes (AMD vs. Intel) or because there is just less
|
||||||
|
concurrency. Unfortunately, the SuperMalloc faulted on the _leanN_
|
||||||
|
and _xmalloc-testN_ benchmarks.
|
||||||
|
|
||||||
|
The secure mimalloc version uses guard pages around each (_mimalloc_) page,
|
||||||
|
encodes the free lists and uses randomized initial free lists, and we
|
||||||
|
expected it would perform quite a bit worse -- but on the first benchmark set
|
||||||
|
it performed only about 3% slower on average, and is second best overall.
|
||||||
|
|
||||||
|
|
||||||
# References
|
# References
|
||||||
|
|
||||||
|
@ -385,3 +503,19 @@ faster than tcmalloc and jxmalloc).
|
||||||
[pdf](http://citeseemi.ist.psu.edu/viewdoc/download?doi=10.1.1.43.6621&rep=rep1&type=pdf)
|
[pdf](http://citeseemi.ist.psu.edu/viewdoc/download?doi=10.1.1.43.6621&rep=rep1&type=pdf)
|
||||||
|
|
||||||
- \[4] J. Barnes and P. Hut. _A hierarchical O(n*log(n)) force-calculation algorithm_. Nature, 324:446-449, 1986.
|
- \[4] J. Barnes and P. Hut. _A hierarchical O(n*log(n)) force-calculation algorithm_. Nature, 324:446-449, 1986.
|
||||||
|
|
||||||
|
- \[5] C. Lever, and D. Boreham. _Malloc() Performance in a Multithreaded Linux Environment._
|
||||||
|
In USENIX Annual Technical Conference, Freenix Session. San Diego, CA. Jun. 2000.
|
||||||
|
Available at <https://github.com/kuszmaul/SuperMalloc/tree/master/tests>
|
||||||
|
|
||||||
|
- \[6] Timothy Crundal. _Reducing Active-False Sharing in TCMalloc._
|
||||||
|
2016. <http://courses.cecs.anu.edu.au/courses/CSPROJECTS/16S1/Reports/Timothy*Crundal*Report.pdf>. CS16S1 project at the Australian National University.
|
||||||
|
|
||||||
|
- \[7] Alexey Kukanov, and Michael J Voss.
|
||||||
|
_The Foundations for Scalable Multi-Core Software in Intel Threading Building Blocks._
|
||||||
|
Intel Technology Journal 11 (4). 2007
|
||||||
|
|
||||||
|
- \[8] Paul Liétar, Theodore Butler, Sylvan Clebsch, Sophia Drossopoulou, Juliana Franco, Matthew J Parkinson,
|
||||||
|
Alex Shamis, Christoph M Wintersteiger, and David Chisnall.
|
||||||
|
_Snmalloc: A Message Passing Allocator._
|
||||||
|
In Proceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management, 122–135. ACM. 2019.
|
||||||
|
|
Loading…
Add table
Reference in a new issue