ptmalloc3 - a multi-thread malloc implementation================================================Wolfram Gloger (wg@malloc.de)Jan 2006Thanks======This release was partly funded by Pixar Animation Studios. I wouldlike to thank David Baraff of Pixar for his support and Doug Lea(dl@cs.oswego.edu) for the great original malloc implementation.Introduction============This package is a modified version of Doug Lea's malloc-2.8.3implementation (available seperately from ftp://g.oswego.edu/pub/misc)that I adapted for multiple threads, while trying to avoid lockcontention as much as possible.As part of the GNU C library, the source files may be available underthe GNU Library General Public License (see the comments in thefiles). But as part of this stand-alone package, the code is alsoavailable under the (probably less restrictive) conditions describedin the file 'COPYRIGHT'. In any case, there is no warranty whatsoeverfor this package.The current distribution should be available from:http://www.malloc.de/malloc/ptmalloc3.tar.gzCompilation===========It should be possible to build ptmalloc3 on any UN*X-like system thatimplements the sbrk(), mmap(), munmap() and mprotect() calls. Sincethere are now several source files, a library (libptmalloc3.a) isgenerated. See the Makefile for examples of the compile-time options.Note that support for non-ANSI compilers is no longer there.Several example targets are provided in the Makefile: o Posix threads (pthreads), compile with "make posix" o Posix threads with explicit initialization, compile with "make posix-explicit" (known to be required on HPUX) o Posix threads without "tsd data hack" (see below), compile with "make posix-with-tsd" o Solaris threads, compile with "make solaris" o SGI sproc() threads, compile with "make sproc" o no threads, compile with "make nothreads" (currently out of order?)For Linux: o make "linux-pthread" (almost the same as "make posix") or make "linux-shared"Note that some compilers need special flags for multi-threaded code,e.g. with Solaris cc with Posix threads, one should use:% make posix SYS_FLAGS='-mt'Some additional targets, ending in `-libc', are also provided in theMakefile, to compare performance of the test programs to the case whenlinking with the standard malloc implementation in libc.A potential problem remains: If any of the system-specific functionsfor getting/setting thread-specific data or for locking a mutex callone of the malloc-related functions internally, the implementationcannot work at all due to infinite recursion. One example seems to beSolaris 2.4. I would like to hear if this problem occurs on othersystems, and whether similar workarounds could be applied.For Posix threads, too, an optional hack like that has been integrated(activated when defining USE_TSD_DATA_HACK) which depends on`pthread_t' being convertible to an integral type (which is of coursenot generally guaranteed). USE_TSD_DATA_HACK is now the defaultbecause I haven't yet found a non-glibc pthreads system where thishack is _not_ needed.*NEW* and _important_: In (currently) one place in the ptmalloc3source, a write memory barrier is needed, namedatomic_write_barrier(). This macro needs to be defined at the end ofmalloc-machine.h. For gcc, a fallback in the form of a full memorybarrier is already defined, but you may need to add another definitionif you don't use gcc.Usage=====Just link libptmalloc3 into your application.Some wicked systems (e.g. HPUX apparently) won't let malloc call _any_thread-related functions before main(). On these systems,USE_STARTER=2 must be defined during compilation (see "makeposix-explicit" above) and the global initialization functionptmalloc_init() must be called explicitly, preferably at the start ofmain().Otherwise, when using ptmalloc3, no special precautions are necessary.Link order is important=======================On some systems, when overriding malloc and linking against sharedlibraries, the link order becomes very important. E.g., when linkingC++ programs on Solaris with Solaris threads [this is probably nowobsolete], don't rely on libC being included by default, but insteadput `-lthread' behind `-lC' on the command line: CC ... libptmalloc3.a -lC -lthreadThis is because there are global constructors in libC that needmalloc/ptmalloc, which in turn needs to have the thread library to bealready initialized.Debugging hooks===============All calls to malloc(), realloc(), free() and memalign() are routedthrough the global function pointers __malloc_hook, __realloc_hook,__free_hook and __memalign_hook if they are not NULL (see the malloc.hheader file for declarations of these pointers). Therefore the mallocimplementation can be changed at runtime, if care is taken not to callfree() or realloc() on pointers obtained with a differentimplementation than the one currently in effect. (The easiest way toguarantee this is to set up the hooks before any malloc call, e.g.with a function pointed to by the global variable__malloc_initialize_hook).You can now also tune other malloc parameters (normally adjused viamallopt() calls from the application) with environment variables: MALLOC_TRIM_THRESHOLD_ for deciding to shrink the heap (in bytes) MALLOC_GRANULARITY_ The unit for allocating and deallocating MALLOC_TOP_PAD_ memory from the system. The default is 64k and this parameter _must_ be a power of 2. MALLOC_MMAP_THRESHOLD_ min. size for chunks allocated via mmap() (in bytes)Tests=====Two testing applications, t-test1 and t-test2, are included in thissource distribution. Both perform pseudo-random sequences ofallocations/frees, and can be given numeric arguments (all argumentsare optional):% t-test[12] <n-total> <n-parallel> <n-allocs> <size-max> <bins> n-total = total number of threads executed (default 10) n-parallel = number of threads running in parallel (2) n-allocs = number of malloc()'s / free()'s per thread (10000) size-max = max. size requested with malloc() in bytes (10000) bins = number of bins to maintainThe first test `t-test1' maintains a completely seperate pool ofallocated bins for each thread, and should therefore show fullparallelism. On the other hand, `t-test2' creates only a single poolof bins, and each thread randomly allocates/frees any bin. Some lockcontention is to be expected in this case, as the threads frequentlycross each others arena.Performance results from t-test1 should be quite repeatable, while thebehaviour of t-test2 depends on scheduling variations.Conclusion==========I'm always interested in performance data and feedback, just send mailto ptmalloc@malloc.de.Good luck!