mirror of
https://github.com/microsoft/mimalloc.git
synced 2025-05-06 23:39:31 +03:00
update readme to be more short
This commit is contained in:
parent
5b6e04774a
commit
5cc8ae4f43
1 changed files with 130 additions and 7 deletions
137
readme.md
137
readme.md
|
@ -67,6 +67,7 @@ in the entire program.
|
||||||
|
|
||||||
We use [`cmake`](https://cmake.org)<sup>1</sup> as the build system:
|
We use [`cmake`](https://cmake.org)<sup>1</sup> as the build system:
|
||||||
|
|
||||||
|
- `mkdir -p out/release`
|
||||||
- `cd out/release`
|
- `cd out/release`
|
||||||
- `cmake ../..` (generate the make file)
|
- `cmake ../..` (generate the make file)
|
||||||
- `make` (and build)
|
- `make` (and build)
|
||||||
|
@ -81,16 +82,22 @@ We use [`cmake`](https://cmake.org)<sup>1</sup> as the build system:
|
||||||
You can build the debug version which does many internal checks and
|
You can build the debug version which does many internal checks and
|
||||||
maintains detailed statistics as:
|
maintains detailed statistics as:
|
||||||
|
|
||||||
|
- `mkdir -p out/debug`
|
||||||
- `cd out/debug`
|
- `cd out/debug`
|
||||||
- `cmake -DCMAKE_BUILD_TYPE=Debug ../..`
|
- `cmake -DCMAKE_BUILD_TYPE=Debug ../..`
|
||||||
- `make`
|
- `make`
|
||||||
|
|
||||||
This will name the shared library as `libmimalloc-debug.so`.
|
This will name the shared library as `libmimalloc-debug.so`.
|
||||||
|
|
||||||
Or build with `clang`:
|
Finally, you can build a _secure_ version that uses guard pages, encrypted
|
||||||
|
free lists, etc, as:
|
||||||
|
|
||||||
- `CC=clang cmake ../..`
|
- `mkdir -p out/secure`
|
||||||
|
- `cd out/secure`
|
||||||
|
- `cmake -DSECURE=ON ../..`
|
||||||
|
- `make`
|
||||||
|
|
||||||
|
This will name the shared library as `libmimalloc-secure.so`.
|
||||||
Use `ccmake`<sup>2</sup> instead of `cmake`
|
Use `ccmake`<sup>2</sup> instead of `cmake`
|
||||||
to see and customize all the available build options.
|
to see and customize all the available build options.
|
||||||
|
|
||||||
|
@ -99,6 +106,7 @@ Notes:
|
||||||
2. Install CCMake: `sudo apt-get install cmake-curses-gui`
|
2. Install CCMake: `sudo apt-get install cmake-curses-gui`
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
# Using the library
|
# Using the library
|
||||||
|
|
||||||
The preferred usage is including `<mimalloc.h>`, linking with
|
The preferred usage is including `<mimalloc.h>`, linking with
|
||||||
|
@ -214,7 +222,7 @@ the `mimalloc-override-test` project for an example.
|
||||||
|
|
||||||
On Unix systems, you can also statically link with _mimalloc_ to override the standard
|
On Unix systems, you can also statically link with _mimalloc_ to override the standard
|
||||||
malloc interface. The recommended way is to link the final program with the
|
malloc interface. The recommended way is to link the final program with the
|
||||||
_mimalloc_ single object file (`mimalloc-override.o` (or `.obj`)). We use
|
_mimalloc_ single object file (`mimalloc-override.o`). We use
|
||||||
an object file instead of a library file as linkers give preference to
|
an object file instead of a library file as linkers give preference to
|
||||||
that over archives to resolve symbols. To ensure that the standard
|
that over archives to resolve symbols. To ensure that the standard
|
||||||
malloc interface resolves to the _mimalloc_ library, link it as the first
|
malloc interface resolves to the _mimalloc_ library, link it as the first
|
||||||
|
@ -232,6 +240,11 @@ range of benchmarks, ranging from various real world programs to
|
||||||
synthetic benchmarks that see how the allocator behaves under more
|
synthetic benchmarks that see how the allocator behaves under more
|
||||||
extreme circumstances.
|
extreme circumstances.
|
||||||
|
|
||||||
|
In our benchmarks, _mimalloc_ always outperforms all other leading
|
||||||
|
allocators (_jemalloc_, _tcmalloc_, _Hoard_, etc), and usually uses less
|
||||||
|
memory (up to 25% more in the worst case). A nice property is that it
|
||||||
|
does *consistently* well over the wide range of benchmarks.
|
||||||
|
|
||||||
Allocators are interesting as there exists no algorithm that is generally
|
Allocators are interesting as there exists no algorithm that is generally
|
||||||
optimal -- for a given allocator one can usually construct a workload
|
optimal -- for a given allocator one can usually construct a workload
|
||||||
where it does not do so well. The goal is thus to find an allocation
|
where it does not do so well. The goal is thus to find an allocation
|
||||||
|
@ -239,15 +252,124 @@ strategy that performs well over a wide range of benchmarks without
|
||||||
suffering from underperformance in less common situations (which is what
|
suffering from underperformance in less common situations (which is what
|
||||||
the second half of our benchmark set tests for).
|
the second half of our benchmark set tests for).
|
||||||
|
|
||||||
In our benchmarks, _mimalloc_ always outperforms all other leading
|
We show here only the results on an AMD EPYC system (Apr 2019) -- for
|
||||||
allocators (_jemalloc_, _tcmalloc_, _Hoard_, etc), and usually uses less
|
specific details and further benchmarks we refer to the technical report.
|
||||||
memory (up to 25% more in the worst case). A nice property is that it
|
|
||||||
does *consistently* well over the wide range of benchmarks.
|
|
||||||
|
|
||||||
The benchmark suite is scripted and available separately
|
The benchmark suite is scripted and available separately
|
||||||
as [mimalloc-bench](https://github.com/daanx/mimalloc-bench).
|
as [mimalloc-bench](https://github.com/daanx/mimalloc-bench).
|
||||||
|
|
||||||
|
|
||||||
|
## On a 16-core AMD EPYC running Linux
|
||||||
|
|
||||||
|
Testing on a big Amazon EC2 instance ([r5a.4xlarge](https://aws.amazon.com/ec2/instance-types/))
|
||||||
|
consisting of a 16-core AMD EPYC 7000 at 2.5GHz
|
||||||
|
with 128GB ECC memory, running Ubuntu 18.04.1 with LibC 2.27 and GCC 7.3.0.
|
||||||
|
The measured allocators are _mimalloc_ (**mi**),
|
||||||
|
Google's [_tcmalloc_](https://github.com/gperftools/gperftools) (**tc**) used in Chrome,
|
||||||
|
[_jemalloc_](https://github.com/jemalloc/jemalloc) (**je**) by Jason Evans used in Firefox and FreeBSD,
|
||||||
|
[_snmalloc_](https://github.com/microsoft/snmalloc) (**sn**) by Liétar et al. \[8], [_rpmalloc_](https://github.com/rampantpixels/rpmalloc) (**rp**) by Mattias Jansson at Rampant Pixels,
|
||||||
|
[_Hoard_](https://github.com/emeryberger/Hoard) by Emery Berger \[1],
|
||||||
|
the system allocator (**glibc**) (based on _PtMalloc2_), and the Intel thread
|
||||||
|
building blocks [allocator](https://github.com/intel/tbb) (**tbb**).
|
||||||
|
|
||||||
|

|
||||||
|

|
||||||
|
|
||||||
|
Memory usage:
|
||||||
|
|
||||||
|

|
||||||
|

|
||||||
|
|
||||||
|
(note: the _xmalloc-testN_ memory usage should be disregarded is it
|
||||||
|
allocates more the faster the program runs).
|
||||||
|
|
||||||
|
In the first five benchmarks we can see _mimalloc_ outperforms the other
|
||||||
|
allocators moderately, but we also see that all these modern allocators
|
||||||
|
perform well -- the times of large performance differences in regular
|
||||||
|
workloads are over :-).
|
||||||
|
In _cfrac_ and _espresso_, _mimalloc_ is a tad faster than _tcmalloc_ and
|
||||||
|
_jemalloc_, but a solid 10\% faster than all other allocators on
|
||||||
|
_espresso_. The _tbb_ allocator does not do so well here and lags more than
|
||||||
|
20\% behind _mimalloc_. The _cfrac_ and _espresso_ programs do not use much
|
||||||
|
memory (~1.5MB) so it does not matter too much, but still _mimalloc_ uses
|
||||||
|
about half the resident memory of _tcmalloc_.
|
||||||
|
|
||||||
|
The _leanN_ program is most interesting as a large realistic and
|
||||||
|
concurrent workload of the [Lean](https://github.com/leanprover/lean) theorem prover
|
||||||
|
compiling its own standard library, and there is a 8% speedup over _tcmalloc_. This is
|
||||||
|
quite significant: if Lean spends 20% of its time in the
|
||||||
|
allocator that means that _mimalloc_ is 1.3× faster than _tcmalloc_
|
||||||
|
here. This is surprising as that is *not* measured in a pure
|
||||||
|
allocation benchmark like _alloc-test_. We conjecture that we see this
|
||||||
|
outsized improvement here because _mimalloc_ has better locality in
|
||||||
|
the allocation which improves performance for the *other* computations
|
||||||
|
in a program as well.
|
||||||
|
|
||||||
|
The _redis_ benchmark shows more differences between the allocators where
|
||||||
|
_mimalloc_ is 14\% faster than _jemalloc_. On this benchmark _tbb_ (and _Hoard_) do
|
||||||
|
not do well and are over 40\% slower.
|
||||||
|
|
||||||
|
The _larson_ server workload allocates and frees objects between
|
||||||
|
many threads. Larson and Krishnan \[2] observe this
|
||||||
|
behavior (which they call _bleeding_) in actual server applications, and the
|
||||||
|
benchmark simulates this.
|
||||||
|
Here, _mimalloc_ is more than 2.5× faster than _tcmalloc_ and _jemalloc_
|
||||||
|
due to the object migration between different threads. This is a difficult
|
||||||
|
benchmark for other allocators too where _mimalloc_ is still 48% faster than the next
|
||||||
|
fastest (_snmalloc_).
|
||||||
|
|
||||||
|
|
||||||
|
The second benchmark set tests specific aspects of the allocators and
|
||||||
|
shows even more extreme differences between them.
|
||||||
|
|
||||||
|
The _alloc-test_, by
|
||||||
|
[OLogN Technologies AG](http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/), is a very allocation intensive benchmark doing millions of
|
||||||
|
allocations in various size classes. The test is scaled such that when an
|
||||||
|
allocator performs almost identically on _alloc-test1_ as _alloc-testN_ it
|
||||||
|
means that it scales linearly. Here, _tcmalloc_, _snmalloc_, and
|
||||||
|
_Hoard_ seem to scale less well and do more than 10% worse on the
|
||||||
|
multi-core version. Even the best allocators (_tcmalloc_ and _jemalloc_) are
|
||||||
|
more than 10% slower as _mimalloc_ here.
|
||||||
|
|
||||||
|
The _sh6bench_ and _sh8bench_ benchmarks are
|
||||||
|
developed by [MicroQuill](http://www.microquill.com/) as part of SmartHeap.
|
||||||
|
In _sh6bench_ _mimalloc_ does much
|
||||||
|
better than the others (more than 2× faster than _jemalloc_).
|
||||||
|
We cannot explain this well but believe it is
|
||||||
|
caused in part by the "reverse" free-ing pattern in _sh6bench_.
|
||||||
|
Again in _sh8bench_ the _mimalloc_ allocator handles object migration
|
||||||
|
between threads much better and is over 36% faster than the next best
|
||||||
|
allocator, _snmalloc_. Whereas _tcmalloc_ did well on _sh6bench_, the
|
||||||
|
addition of object migration caused it to be almost 3 times slower
|
||||||
|
than before.
|
||||||
|
|
||||||
|
The _xmalloc-testN_ benchmark by Lever and Boreham \[5] and Christian Eder,
|
||||||
|
simulates an asymmetric workload where
|
||||||
|
some threads only allocate, and others only free. The _snmalloc_
|
||||||
|
allocator was especially developed to handle this case well as it
|
||||||
|
often occurs in concurrent message passing systems (like the [Pony] language
|
||||||
|
for which _snmalloc_ was initially developed). Here we see that
|
||||||
|
the _mimalloc_ technique of having non-contended sharded thread free
|
||||||
|
lists pays off as it even outperforms _snmalloc_ here.
|
||||||
|
Only _jemalloc_ also handles this reasonably well, while the
|
||||||
|
others underperform by a large margin.
|
||||||
|
|
||||||
|
The _cache-scratch_ benchmark by Emery Berger \[1], and introduced with the Hoard
|
||||||
|
allocator to test for _passive-false_ sharing of cache lines. With a single thread they all
|
||||||
|
perform the same, but when running with multiple threads the potential allocator
|
||||||
|
induced false sharing of the cache lines causes large run-time
|
||||||
|
differences, where _mimalloc_ is more than 18× faster than _jemalloc_ and
|
||||||
|
_tcmalloc_! Crundal \[6] describes in detail why the false cache line
|
||||||
|
sharing occurs in the _tcmalloc_ design, and also discusses how this
|
||||||
|
can be avoided with some small implementation changes.
|
||||||
|
Only _snmalloc_ and _tbb_ also avoid the
|
||||||
|
cache line sharing like _mimalloc_. Kukanov and Voss \[7] describe in detail
|
||||||
|
how the design of _tbb_ avoids the false cache line sharing.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
<!--
|
||||||
|
|
||||||
## Tested Allocators
|
## Tested Allocators
|
||||||
|
|
||||||
We tested _mimalloc_ with 9 leading allocators over 12 benchmarks
|
We tested _mimalloc_ with 9 leading allocators over 12 benchmarks
|
||||||
|
@ -495,6 +617,7 @@ encodes the free lists and uses randomized initial free lists, and we
|
||||||
expected it would perform quite a bit worse -- but on the first benchmark set
|
expected it would perform quite a bit worse -- but on the first benchmark set
|
||||||
it performed only about 3% slower on average, and is second best overall.
|
it performed only about 3% slower on average, and is second best overall.
|
||||||
|
|
||||||
|
-->
|
||||||
|
|
||||||
# References
|
# References
|
||||||
|
|
||||||
|
|
Loading…
Add table
Reference in a new issue