diff --git a/readme.md b/readme.md
index dcfdbd53..59050790 100644
--- a/readme.md
+++ b/readme.md
@@ -67,6 +67,7 @@ in the entire program.
We use [`cmake`](https://cmake.org)1 as the build system:
+- `mkdir -p out/release`
- `cd out/release`
- `cmake ../..` (generate the make file)
- `make` (and build)
@@ -81,16 +82,22 @@ We use [`cmake`](https://cmake.org)1 as the build system:
You can build the debug version which does many internal checks and
maintains detailed statistics as:
+- `mkdir -p out/debug`
- `cd out/debug`
- `cmake -DCMAKE_BUILD_TYPE=Debug ../..`
- `make`
This will name the shared library as `libmimalloc-debug.so`.
-Or build with `clang`:
+Finally, you can build a _secure_ version that uses guard pages, encrypted
+free lists, etc, as:
-- `CC=clang cmake ../..`
+- `mkdir -p out/secure`
+- `cd out/secure`
+- `cmake -DSECURE=ON ../..`
+- `make`
+This will name the shared library as `libmimalloc-secure.so`.
Use `ccmake`2 instead of `cmake`
to see and customize all the available build options.
@@ -99,6 +106,7 @@ Notes:
2. Install CCMake: `sudo apt-get install cmake-curses-gui`
+
# Using the library
The preferred usage is including ``, linking with
@@ -214,7 +222,7 @@ the `mimalloc-override-test` project for an example.
On Unix systems, you can also statically link with _mimalloc_ to override the standard
malloc interface. The recommended way is to link the final program with the
-_mimalloc_ single object file (`mimalloc-override.o` (or `.obj`)). We use
+_mimalloc_ single object file (`mimalloc-override.o`). We use
an object file instead of a library file as linkers give preference to
that over archives to resolve symbols. To ensure that the standard
malloc interface resolves to the _mimalloc_ library, link it as the first
@@ -232,6 +240,11 @@ range of benchmarks, ranging from various real world programs to
synthetic benchmarks that see how the allocator behaves under more
extreme circumstances.
+In our benchmarks, _mimalloc_ always outperforms all other leading
+allocators (_jemalloc_, _tcmalloc_, _Hoard_, etc), and usually uses less
+memory (up to 25% more in the worst case). A nice property is that it
+does *consistently* well over the wide range of benchmarks.
+
Allocators are interesting as there exists no algorithm that is generally
optimal -- for a given allocator one can usually construct a workload
where it does not do so well. The goal is thus to find an allocation
@@ -239,15 +252,124 @@ strategy that performs well over a wide range of benchmarks without
suffering from underperformance in less common situations (which is what
the second half of our benchmark set tests for).
-In our benchmarks, _mimalloc_ always outperforms all other leading
-allocators (_jemalloc_, _tcmalloc_, _Hoard_, etc), and usually uses less
-memory (up to 25% more in the worst case). A nice property is that it
-does *consistently* well over the wide range of benchmarks.
+We show here only the results on an AMD EPYC system (Apr 2019) -- for
+specific details and further benchmarks we refer to the technical report.
The benchmark suite is scripted and available separately
as [mimalloc-bench](https://github.com/daanx/mimalloc-bench).
+## On a 16-core AMD EPYC running Linux
+
+Testing on a big Amazon EC2 instance ([r5a.4xlarge](https://aws.amazon.com/ec2/instance-types/))
+consisting of a 16-core AMD EPYC 7000 at 2.5GHz
+with 128GB ECC memory, running Ubuntu 18.04.1 with LibC 2.27 and GCC 7.3.0.
+The measured allocators are _mimalloc_ (**mi**),
+Google's [_tcmalloc_](https://github.com/gperftools/gperftools) (**tc**) used in Chrome,
+[_jemalloc_](https://github.com/jemalloc/jemalloc) (**je**) by Jason Evans used in Firefox and FreeBSD,
+[_snmalloc_](https://github.com/microsoft/snmalloc) (**sn**) by LiƩtar et al. \[8], [_rpmalloc_](https://github.com/rampantpixels/rpmalloc) (**rp**) by Mattias Jansson at Rampant Pixels,
+[_Hoard_](https://github.com/emeryberger/Hoard) by Emery Berger \[1],
+the system allocator (**glibc**) (based on _PtMalloc2_), and the Intel thread
+building blocks [allocator](https://github.com/intel/tbb) (**tbb**).
+
+
+
+
+Memory usage:
+
+
+
+
+(note: the _xmalloc-testN_ memory usage should be disregarded is it
+allocates more the faster the program runs).
+
+In the first five benchmarks we can see _mimalloc_ outperforms the other
+allocators moderately, but we also see that all these modern allocators
+perform well -- the times of large performance differences in regular
+workloads are over :-).
+In _cfrac_ and _espresso_, _mimalloc_ is a tad faster than _tcmalloc_ and
+_jemalloc_, but a solid 10\% faster than all other allocators on
+_espresso_. The _tbb_ allocator does not do so well here and lags more than
+20\% behind _mimalloc_. The _cfrac_ and _espresso_ programs do not use much
+memory (~1.5MB) so it does not matter too much, but still _mimalloc_ uses
+about half the resident memory of _tcmalloc_.
+
+The _leanN_ program is most interesting as a large realistic and
+concurrent workload of the [Lean](https://github.com/leanprover/lean) theorem prover
+compiling its own standard library, and there is a 8% speedup over _tcmalloc_. This is
+quite significant: if Lean spends 20% of its time in the
+allocator that means that _mimalloc_ is 1.3× faster than _tcmalloc_
+here. This is surprising as that is *not* measured in a pure
+allocation benchmark like _alloc-test_. We conjecture that we see this
+outsized improvement here because _mimalloc_ has better locality in
+the allocation which improves performance for the *other* computations
+in a program as well.
+
+The _redis_ benchmark shows more differences between the allocators where
+_mimalloc_ is 14\% faster than _jemalloc_. On this benchmark _tbb_ (and _Hoard_) do
+not do well and are over 40\% slower.
+
+The _larson_ server workload allocates and frees objects between
+many threads. Larson and Krishnan \[2] observe this
+behavior (which they call _bleeding_) in actual server applications, and the
+benchmark simulates this.
+Here, _mimalloc_ is more than 2.5× faster than _tcmalloc_ and _jemalloc_
+due to the object migration between different threads. This is a difficult
+benchmark for other allocators too where _mimalloc_ is still 48% faster than the next
+fastest (_snmalloc_).
+
+
+The second benchmark set tests specific aspects of the allocators and
+shows even more extreme differences between them.
+
+The _alloc-test_, by
+[OLogN Technologies AG](http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/), is a very allocation intensive benchmark doing millions of
+allocations in various size classes. The test is scaled such that when an
+allocator performs almost identically on _alloc-test1_ as _alloc-testN_ it
+means that it scales linearly. Here, _tcmalloc_, _snmalloc_, and
+_Hoard_ seem to scale less well and do more than 10% worse on the
+multi-core version. Even the best allocators (_tcmalloc_ and _jemalloc_) are
+more than 10% slower as _mimalloc_ here.
+
+The _sh6bench_ and _sh8bench_ benchmarks are
+developed by [MicroQuill](http://www.microquill.com/) as part of SmartHeap.
+In _sh6bench_ _mimalloc_ does much
+better than the others (more than 2× faster than _jemalloc_).
+We cannot explain this well but believe it is
+caused in part by the "reverse" free-ing pattern in _sh6bench_.
+Again in _sh8bench_ the _mimalloc_ allocator handles object migration
+between threads much better and is over 36% faster than the next best
+allocator, _snmalloc_. Whereas _tcmalloc_ did well on _sh6bench_, the
+addition of object migration caused it to be almost 3 times slower
+than before.
+
+The _xmalloc-testN_ benchmark by Lever and Boreham \[5] and Christian Eder,
+simulates an asymmetric workload where
+some threads only allocate, and others only free. The _snmalloc_
+allocator was especially developed to handle this case well as it
+often occurs in concurrent message passing systems (like the [Pony] language
+for which _snmalloc_ was initially developed). Here we see that
+the _mimalloc_ technique of having non-contended sharded thread free
+lists pays off as it even outperforms _snmalloc_ here.
+Only _jemalloc_ also handles this reasonably well, while the
+others underperform by a large margin.
+
+The _cache-scratch_ benchmark by Emery Berger \[1], and introduced with the Hoard
+allocator to test for _passive-false_ sharing of cache lines. With a single thread they all
+perform the same, but when running with multiple threads the potential allocator
+induced false sharing of the cache lines causes large run-time
+differences, where _mimalloc_ is more than 18× faster than _jemalloc_ and
+_tcmalloc_! Crundal \[6] describes in detail why the false cache line
+sharing occurs in the _tcmalloc_ design, and also discusses how this
+can be avoided with some small implementation changes.
+Only _snmalloc_ and _tbb_ also avoid the
+cache line sharing like _mimalloc_. Kukanov and Voss \[7] describe in detail
+how the design of _tbb_ avoids the false cache line sharing.
+
+
+
+
# References