malloc() Performance in a
Multithreaded Linux Environment
Chuck Lever, Netscape Communications Corp.
David Boreham, Netscape Communications Corp.
linux-scalability@umich.edu
$Id: malloc.html,v 1.9 1999/11/12 20:12:54 cel Exp $
Abstract
|
This report describes a simple malloc() benchmark that can
drive multithreaded loads in order to test the ability of
malloc() to divide its work efficiently among multiple
threads and processors.
We discuss initial results of the benchmarks, and analyze the results
with an eye towards identifying areas of the implementation of
malloc() and Linux itself that can be improved.
|
This document is Copyright © 1999
Netscape Communications Corp.,
all rights reserved.
Trademarked material referenced in this document is copyright
by its respective owner.
|
Introduction
Many modern network servers use multithreading to take advantage of
I/O concurrency and multiple CPUs.
As network services scale up to tens of thousands of clients per server,
their architecture depends more and more on the ability of the underlying
operating system to support multithreading efficiently, especially in
library routines that are provided not by the applications' designers,
but by the OS.
An example of a heavily-used programming interface that will need to
scale well with the number of threads is the memory allocator, known
in Unix as malloc().
Malloc() makes use of several important system facilities,
including mutex locking and virtual memory page allocation.
Thus, analyzing the performance of malloc() in a multithreaded
and multi-CPU environment can provide important information about
potential system inefficiency.
Finding ways to
improve the performance of malloc() can
benefit the performance of any sophisticated multithreaded
application, such as network servers.
This report describes a simple malloc() benchmark that can
drive multithreaded loads in order to test the ability of
malloc() to divide its work efficiently among multiple
threads and processors.
We discuss initial results of the benchmarks, and analyze the results
with an eye towards identifying areas of the implementation of
malloc() and Linux itself that can be improved.
A look at glibc's
malloc()
Most modern distributions of Linux use glibc version 2.0
as their C library.
glibc's implementors have adopted Wolfgang Gloger's
ptmalloc as the glibc implementation of
malloc().
ptmalloc has many desirable properties,
including multiple heaps to reduce heap contention
among threads sharing a single C library invocation.
ptmalloc is based on Doug Lea's original implementation
of malloc(). Lea's malloc() had several goals,
including improving portability, space and time utilization, and
adding tunable parameters to control allocation behavior.
Lea was also greatly concerned about software re-use, because
very often, application developers, frustrated by inappropriate
behavior of then extant memory allocators, would write yet
another specialized memory allocation scheme rather than
re-use one that already existed.
Gloger's update to Lea's original retains these desirable behaviors,
and adds multithreading ability and some nice debugging extensions.
As the C library is built on most Linux distributions, the
debugging extensions and tunability are compiled out, so it is
necessary to rebuild the C library, or pre-load a separate version
of malloc(), in order to take advantage of these features.
ptmalloc also makes use of both mmap() and
sbrk() on Linux when allocating arenas.
Of course, both of these system calls are essentially the
same under the covers, using anonymous maps to provide large
pageable areas of virtual memory to processes.
Optimizing the allocation of anonymous maps and reducing the
overhead of these calls by having malloc() ask for
larger chunks at a time are two possible ways of helping
performance in this area.
Benchmark description
We've written a simple multithreaded program that invokes
malloc() and
free() in a loop, and times
the results.
For a listing of the benchmark program, see
Appendix B.
Benchmark data is contained in tables in
Appendix C.
The benchmark host is a dual processor 200Mhz Pentium Pro with 128Mb
of RAM and an Intel i440FX mainboard.
The operating system is Red Hat's 5.1 Linux distribution,
which comes with glibc 2.0.6.
We replaced the 5.1 distribution's kernel with kernel version 2.2.0-pre4.
gettimeofday()'s resolution on this
hardware is 2-3 microseconds.
During the tests, the machine was at runlevel 5, but was
otherwise quiescent.
To measure the effects of multithreading on heap accesses,
we'll compare the results of running this program on a
single process with the results of two processes running
this program on a dual processor, and one process running
this test in two threads on a dual processor.
This tells us several things:
-
How well thread scheduling compares with process scheduling
-
How well more than one thread in a process can utilize multiple CPUs
-
How well malloc() scales with multiple threads
accessing the same library and heaps
-
How heavyweight are thread mutexes
We expect that, if a
malloc() implementation is
efficient, then the two thread run will work as hard as
the two process run.
If it's not efficient, the two process run may perform well,
but the two thread run will perform badly.
Typically we've found that a poorly performing implementation
will use a significant amount of kernel time with a high
context switch count as a result of contention for mutexes
protecting the heap and other shared resources.
We are also interested in the behavior of malloc()
and the system on which it's running as we increase the number
of threads past the number of physical CPUs present in the system.
It is conjectured that the most efficient way to run heavily
loaded servers is to keep the ratio of busy threads to physical CPUs
as close to 1:1 as possible.
We'd like to know what is the penalty as the ratio increases.
For each test, the benchmark makes 10 million balanced malloc()
and free() requests.
It does this because:
-
Increasing the sample size increases the statistical significance
of the average results.
-
Running the test over a longer time allows elapsed time measurements
with greater precision, since short timings are hard to measure precisely.
-
Start-up costs (e.g. library initialization) are amortized
over a larger number of requests, and thus disappear into the noise.
Specific tests and results
Series One compares the performance of two threads sharing the same
C library with the performance of two threads using their own
separate instances of the C library.
As discussed above, we hope to find out if sharing a C library (and
thus "sharing" the heap) scales as well as using separate instances
of the C library.
On our host, the threaded test did almost as well as the process
test, losing only about 10% of elapsed time.
We therefore expect
malloc() to scale well as the number
of threads sharing the same C library increases.
For Series Two, we'll be examining the behavior of malloc()
as we increase the number of working threads past the number of physical
CPUs in the system.
We used the same hardware configuration and operating system version
for this series as for Series One above.
The results are summarized in the following graph:
Figure one. Series Two benchmark results
So far, so good: the average elapsed time is increasing linearly
with the number of threads, with a constant slope of 1/N * M,
where N is the number of processors (N = 2 in our case)
and M is the number of seconds for a single thread run
(23 seconds in our case).
Now, for fun, let's try something a little more intense.
In Series Three, we measure the linearity of the relationship
we discovered in Series Two over a much greater number of
threads.
This will tell us how the library will scale with increasing
thread count.
The results are summarized in the following graph:
Figure two. Series Three benchmark results
As the graph shows, the increase in elapsed time is fairly linear
with increasing thread count, for counts much larger than the
number of configured physical CPUs on the system.
Conclusion and Future Work
We are happy to report that, so far,
the
malloc() implementation used in glibc 2.0
has effectively handled increasing numbers of threads while
adding little overhead, even for a comparatively large number of
threads.
We found expected performance curves as offered load increased.
In future work,
we'd like to compare this implementation of malloc()
with a commercial vendor's implementation
(for instance, Solaris 2.6 or Solaris 7) in terms of its multithread
capabilities.
Several quick tries on single and dual processor Ultra 2s
show that the Solaris pthread implementation serializes
all of the threads created by the malloc-test program.
This makes all results we've obtained on Solaris valueless.
Gloger, Wilson, Lea, and others have already spent considerable
effort optimizing the basic allocation algorithms.
Thus, we believe that another paper measuring these would be a waste of time.
However, a close examination of the performance relationship between
the C library's memory allocator and OS primitives such as mutexes
and sbrk() could have some interesting pay-off.
We should also study the effects of allocating in one thread while
freeing the same area in another.
Finally, memory allocator performance might benefit from some special
knowledge about level-2 cache and main memory characteristics.
Page coloring or alignment refinements could help heap storage
behave in cache-friendly ways.
Appendix A: Bibliography
-
W. Gloger, "Dynamic memory allocator implementations in
Linux system libraries",
http://www.dent.med.uni-muenchen.de/~wmglo/malloc-slides.html.
The ptmalloc implementation can be found at
ftp://ftp.dent.med.uni-muenchen.de/pub/wmglo/ptmalloc.tar.gz.
-
K. Lai, M. Baker,
"A Performance Comparison of UNIX Operating Systems on the Pentium,"
Proceedings of USENIX Technical Conference, January 1996.
-
D. Lea, "A Memory Allocator,"
unix/mail, December 1996.
See also
http://g.oswego.edu/dl/html/malloc.html.
-
X. Leroy is the author of the Linuxthreads implementation, which
can be found at
http://pauillac.inria.fr/~xleroy/linuxthreads/.
-
C. Staelin, L. McVoy, "lmbench: Portable tools for performance analysis,"
Proceedings of USENIX Technical Conference, pp. 279-284, January 1996.
-
S. Walton, "Linux Threads Frequently Asked Questions (FAQ),"
http://metalab.unc.edu/pub/Linux/docs/faqs/Threads-FAQ/html/,
January, 1997.
-
P. Wilson, M. Johnstone, M. Neely and D. Boles,
"Dynamic Storage Allocation: A Survey and Critical Review,"
Proc. 1995 Int'l Workshop on Memory Management, Springer LNCS, 1995.
See also
ftp://ftp.cs.utexas.edu/pub/garbage/allocsrv.ps.
-
B. Zorn, "Malloc/Free and GC Implementations,"
http://www.cs.colorado.edu/homes/zorn/public_html/Malloc.html, March 1998.
-
B. Zorn and D. Grunwald,
"Empirical measurements of six allocation-intensive C programs,"
Tech. Rep. CU-CS-604-92, University of Colorado at Boulder,
Dept. of CS, July 1992.
Appendix B: Benchmark code
/*
* malloc-test
* cel - Thu Jan 7 15:49:16 EST 1999
*
* Benchmark libc's malloc, and check how well it
* can handle malloc requests from multiple threads.
*
* Syntax:
* malloc-test [ size [ iterations [ thread count ]]]
*
*/
#include stdio.h
#include stdlib.h
#include sys/time.h
#include unistd.h
#include pthread.h
#define USECSPERSEC 1000000
#define pthread_attr_default NULL
#define MAX_THREADS 50
void run_test(void);
void * dummy(unsigned);
static unsigned size = 512;
static unsigned iteration_count = 1000000;
int main(int argc, char *argv[])
{
unsigned i;
unsigned thread_count = 1;
pthread_t thread[MAX_THREADS];
/*
* Parse our arguments
*/
switch (argc) {
case 4:
/* size, iteration count, and thread count were specified */
thread_count = atoi(argv[3]);
if (thread_count > MAX_THREADS) thread_count = MAX_THREADS;
case 3:
/* size and iteration count were specified; others default */
iteration_count = atoi(argv[2]);
case 2:
/* size was specified; others default */
size = atoi(argv[1]);
case 1:
/* use default values */
break;
default:
printf("Unrecognized arguments.\n");
exit(1);
}
/*
* Invoke the tests
*/
printf("Starting test...\n");
for (i=1; i<=thread_count; i++)
if (pthread_create(&(thread[i]), pthread_attr_default,
(void *) &run_test, NULL))
printf("failed.\n");
/*
* Wait for tests to finish
*/
for (i=1; i<=thread_count; i++)
pthread_join(thread[i], NULL);
exit(0);
}
void run_test(void)
{
register unsigned int i;
register unsigned request_size = size;
register unsigned total_iterations = iteration_count;
struct timeval start, end, null, elapsed, adjusted;
/*
* Time a null loop. We'll subtract this from the final
* malloc loop results to get a more accurate value.
*/
gettimeofday(&start, NULL);
for (i = 0; i < total_iterations; i++) {
register void * buf;
buf = dummy(i);
buf = dummy(i);
}
gettimeofday(&end, NULL);
null.tv_sec = end.tv_sec - start.tv_sec;
null.tv_usec = end.tv_usec - start.tv_usec;
if (null.tv_usec < 0) {
null.tv_sec--;
null.tv_usec += USECSPERSEC;
}
/*
* Run the real malloc test
*/
gettimeofday(&start, NULL);
for (i = 0; i < total_iterations; i++) {
register void * buf;
buf = malloc(request_size);
free(buf);
}
gettimeofday(&end, NULL);
elapsed.tv_sec = end.tv_sec - start.tv_sec;
elapsed.tv_usec = end.tv_usec - start.tv_usec;
if (elapsed.tv_usec < 0) {
elapsed.tv_sec--;
elapsed.tv_usec += USECSPERSEC;
}
/*
* Adjust elapsed time by null loop time
*/
adjusted.tv_sec = elapsed.tv_sec - null.tv_sec;
adjusted.tv_usec = elapsed.tv_usec - null.tv_usec;
if (adjusted.tv_usec < 0) {
adjusted.tv_sec--;
adjusted.tv_usec += USECSPERSEC;
}
printf("Thread %d adjusted timing: %d.%06d seconds for %d requests"
" of %d bytes.\n", pthread_self(),
adjusted.tv_sec, adjusted.tv_usec, total_iterations,
request_size);
pthread_exit(NULL);
}
void * dummy(unsigned i)
{
return NULL;
}
Appendix C: Raw Benchmark results
Series One
Each table reports elapsed real seconds.
Single process, 10,000,000 requests of 512 bytes each
Test | seconds |
1 | 23.278426 |
2 | 23.276260 |
3 | 23.289882 |
4 | 23.277055 |
5 | 23.280160 |
Average | 23.280357, s=0.005543 |
Two processes, 10,000,000 requests of 512 bytes each
Test | process 1, seconds |
process 2, seconds |
1 | 23.316156 | 23.310087 |
2 | 23.292935 | 23.330363 |
3 | 23.319814 | 23.302842 |
Average | 23.309635, s=0.014586 |
23.314431, s=0.014267 |
Two threads, 10,000,000 requests of 512 bytes each
Test | thread 1, seconds |
thread 2, seconds |
1 | 26.037229 | 26.061006 |
2 | 26.054782 | 26.058438 |
3 | 26.029145 | 26.070780 |
Average | 26.040385, s=0.013097 |
26.063408, s=0.006530 |
During this test, top showed that both threads were using
between 98% and 99.9% of both CPUs.
System (kernel) time was between 0.7% and 1.1% total.
Series Two
One thread, 10,000,000 requests of 8192 bytes
Test | elapsed seconds |
1 | 23.129978 |
2 | 23.124513 |
3 | 23.123466 |
4 | 23.137457 |
5 | 23.132442 |
Average: 23.129571, s=0.005778 |
Two threads, 10,000,000 requests of 8192 bytes each
Test | thread 1, seconds |
thread 2, seconds |
1 | 25.831591 | 25.850179 |
2 | 25.825378 | 25.827982 |
3 | 25.784374 | 25.829218 |
4 | 25.786692 | 25.807974 |
5 | 25.786227 | 25.808034 |
Average: 25.813765, s=0.022708 |
Three threads, 10,000,000 requests of 8192 bytes each
Test | thread 1, seconds |
thread 2, seconds |
thread 3, seconds |
1 | 36.718330 | 36.813648 |
37.948649 |
2 | 36.741720 | 37.003884 |
37.804076 |
3 | 36.737925 | 36.810934 |
38.019140 |
4 | 36.614051 | 36.580804 |
37.983132 |
5 | 36.626645 | 36.655177 |
37.966717 |
Average: 37.134989, s=0.602440 |
Four threads, 10,000,000 requests of 8192 bytes each
Test | thread 1, seconds |
thread 2, seconds |
thread 3, seconds |
thread 4, seconds |
1 | 47.885661 | 47.817898 |
48.158845 | 49.863222 |
2 | 47.799143 | 48.086551 |
48.209226 | 49.873930 |
3 | 47.653690 | 47.636704 |
48.018065 | 49.970975 |
4 | 48.115246 | 48.132953 |
48.316587 | 50.064879 |
5 | 48.149170 | 48.315691 |
48.423492 | 49.648395 |
Average: 48.507016, s=0.844653 |
Five threads, 10,000,000 requests of 8192 bytes each
Test | thread 1, seconds |
thread 2, seconds |
thread 3, seconds |
thread 4, seconds |
thread 5, seconds |
1 | 59.294351 | 59.681399 |
59.638248 | 60.130041 |
61.130962 |
2 | 59.101132 | 59.284428 |
59.527498 | 59.654372 |
61.693474 |
3 | 59.441826 | 59.897381 |
59.666740 | 59.945119 |
61.501687 |
4 | 59.222743 | 59.477798 |
59.603993 | 59.459370 |
61.546250 |
5 | 59.619357 | 60.110973 |
59.642299 | 59.868556 |
61.076883 |
Average: 59.968675, s=0.774664 |
Series Three
These tables contain the elapsed time measurements (in seconds)
for each thread in a single run.
Sixteen threads, 10,000,000 requests of 4100 bytes each
186.847741 | 190.101531 |
189.110768 | 192.143251 |
192.820830 | 193.165536 |
194.002894 | 191.507111 |
190.628776 | 192.121089 |
196.755606 | 193.220287 |
192.280556 | 193.925397 |
195.803901 | 193.118177 |
Average: 192.347091, s=2.427385 |
Twenty-four threads, 10,000,000 requests of 4100 bytes each
305.479192 | 306.739658 |
315.626822 | 308.897943 |
314.975798 | 309.208889 |
312.627709 | 314.094961 |
324.449772 | 321.174608 |
317.025200 | 317.345414 |
323.886614 | 316.348474 |
325.910042 | 325.329843 |
317.232358 | 331.110539 |
331.088659 | 326.364597 |
330.608197 | 327.970896 |
322.656336 | 324.981732 |
Average: 319.630594, s=7.768588 |
Thirty-two threads, 10,000,000 requests of 4100 bytes each
378.857499 | 390.409060 |
396.637062 | 396.073810 |
400.157624 | 405.399191 |
406.583160 | 406.142377 |
391.563927 | 395.862463 |
406.918681 | 409.891841 |
400.812766 | 393.261308 |
409.043509 | 401.297264 |
401.300762 | 408.055787 |
407.696035 | 413.025337 |
403.367946 | 416.398790 |
400.039313 | 413.406837 |
407.783664 | 398.993050 |
411.955450 | 407.682566 |
417.766406 | 410.327972 |
397.859388 | 402.840581 |
Average: 403.356607, s=8.294892 |
Forty threads, 10,000,000 requests of 4100 bytes each
543.494076 | 552.833343 |
550.587209 | 552.684517 |
567.072301 | 573.329704 |
552.832942 | 570.176946 |
551.715868 | 559.759884 |
554.505711 | 573.372131 |
586.304418 | 565.601347 |
558.429182 | 569.034294 |
574.561367 | 581.152444 |
575.698860 | 576.706009 |
565.202699 | 582.473824 |
588.982964 | 590.090940 |
569.908917 | 573.056255 |
591.941499 | 575.474809 |
572.731687 | 595.023077 |
572.121306 | 582.524946 |
570.483694 | 597.779803 |
593.384067 | 572.900727 |
590.233710 | 581.866072 |
598.763704 | 591.009196 |
Average: 573.645162, s=14.495399 |
Forty-eight threads, 10,000,000 requests of 4100 bytes each
607.424984 | 626.862756 |
627.084647 | 610.601752 |
610.647375 | 621.802457 |
628.814327 | 623.344699 |
636.146533 | 639.063616 |
622.256981 | 628.183262 |
626.409378 | 644.212525 |
639.892038 | 646.027143 |
624.227660 | 642.779141 |
646.969559 | 655.019415 |
646.873835 | 653.604088 |
644.001442 | 632.586474 |
647.710626 | 633.816685 |
662.139883 | 647.110876 |
661.555068 | 644.822349 |
672.967624 | 670.518977 |
664.599045 | 643.017935 |
648.041360 | 644.519162 |
677.222297 | 645.295884 |
654.493092 | 678.516325 |
665.871288 | 666.093872 |
657.123928 | 684.276825 |
657.243944 | 664.540376 |
648.818057 | 681.661745 |
Average: 645.975278, s=18.994390 |
This document was written as part of the Linux Scalability Project.
For more information, see
our home page.
If you have comments or suggestions, email
linux-scalability@citi.umich.edu