Linux Kernel Hash Table Behavior:
Chuck Lever, Netscape Communications Corp. |
Abstract |
---|
Hash tables are used in the Linux kernel for storing many high-usage data objects, such as pages, buffers, inodes, and others. The author found a significant performance boost with careful analysis and tuning of these data structures. |
This document is Copyright © 1999 Netscape Communications Corp., all rights reserved. Trademarked material referenced in this document is copyright by its respective owner. |
Why worry about the kernel's hash tables, you may ask? Linux performance depends on the efficiency and scalability of these tables. On a small machine with only 32M of physical RAM, a page cache hash table with 2048 buckets is probably enough to hold all the pages that could be hashed, in chains of less than 3. However, this hash table couldn't possibly hold all the pages on a large machine with, say, 512M of physical RAM, while maintaining short chains to keep lookup times quick. In fact, as we shall see, tuning the kernel's hash tables does result in significant changes in system performance.
Hash tables depend on good average behavior in order to perform well. This average behavior relies on the actual input data more often than we like to admit, especially if simple shift-add hash functions are used. Therefore, statistical examination of specific hash functions used, in combination with specific real world data, can reveal surprising behavior, and can expose opportunities for performance improvement.
It is also important to understand why hash tables are used in preference to a more sophisticated data structure, such as a tree. Insertion into and deletion from a hash table is O(1) if the hashed objects are simply maintained in LIFO order in each bucket. A tree insertion or deletion is O(log(n)). Hash table lookup operations are often O(n/m) (where n is the number of objects in the table and m is the number of buckets), which is close to O(1), especially when the hash function has spread the hashed objects evenly through the hash table, and there are more hash buckets than objects to be stored. Finally, if we are careful about our hash table design, we can keep the average lookup time for both successful and unsuccessful lookups low (i.e. less than O(log(n)) ) by using a large hash table and a hash function that does a good job randomizing the key.
In this report, we analyze several critical hash tables in the Linux kernel, and describe minor tuning changes that can improve Linux performance by a significant margin.
As with several of our earlier performance studies, we are using the SPEC SDM benchmark suite to drive our systems. SDM emulates a multi-tasking software development workload. It employs a fixed script of commands that are often used by software developers, such as cc, ed, nroff, and spell. Offered load is varied by varying the number of concurrent instances of the script that are running during the benchmark. The throughput values generated by the benchmark are in units of "scripts per hour". Each value is calculated by measuring the elapsed time for a given benchmark run, then dividing by the number of concurrent scripts that were running during the benchmark run. The elapsed time is measured down to hundredths of a second.
We tested on two different hardware bases:
These machines were running the Red Hat 5.2 distribution with a 2.2.5 Linux kernel built with egcs 1.1.1, and glibc 2.0 (as installed with Red Hat).
The dual Pentium Pro machine workloads varied from 16 to 64 concurrent scripts. The 16 script workload fits entirely in RAM and is CPU bound. The 64 script workload does not fit into RAM, and so is bound by swap and file I/O. The four-way system ran up to 128 scripts before exhausting the system file descriptor limit (the plain 2.2.5 kernels used in this report do not yet contain large fdset support). All of the 128 script benchmark fit easily into its 512M of physical memory, so this workload is designed to show how well the hash tables scale on large-memory systems, when unconstrained by I/O bottlenecks.
On the dual Pentium Pro, both disks were used for benchmark data and swap partitions. The swap partitions were both of equal priority and size. The benchmark data was stored on file systems mounted with the "noatime" and "nosync" options for best performance. Likewise, on the four-way, a single swap and benchmark file system was used and was mounted with the "noatime" and "nosync" options.
Hash performance depends directly on the laws of probability, so we are quite interested in the statistical behavior of the hash. First, we use special kernel instrumentation to generate a hash table bucket size distribution histogram. This tells us things such as:
And finally, we will be interested in how long it takes to compute the hash function. This value is estimated given a table of memory and CPU cycle times, estimating memory footprint and access rate, and guessing at how well the instructions to compute the hash function will be scheduled by the CPU.
We will provide some tuning recommendations based on maximizing system performance on all machine sizes. Therefore, we will take care to use expensive hash functions and oversized hash tables only as a last resort.
#define PAGE_HASH_BITS 11 #define PAGE_HASH_SIZE (1 << PAGE_HASH_BITS) static inline unsigned long _page_hashfn(struct inode * inode, unsigned long offset) { #define i (((unsigned long) inode)/(sizeof(struct inode) & ~ (sizeof(struct inode) - 1))) #define o (offset >> PAGE_SHIFT) #define s(x) ((x)+((x)>>PAGE_HASH_BITS)) return s(i+o) & (PAGE_HASH_SIZE-1); #undef i #undef o #undef s }This is a simple fast shift-add hash function, and it is surprisingly effective, due to the pre-existing randomness of the inode address and offset arguments. Our tests also revealed that bucket size distribution is good as PAGE_HASH_BITS is varied from 11 to 16.
Stephen Tweedie suggested that adding the offset, unshifted, before computing "s()" would mitigate bucket size distribution spikes caused when hashing swap cache pages (they use a non-page aligned offset). Normally, the offset value is page-aligned, but when the page cache is doubling as the swap cache, the offset value can contain important index-randomizing information in the lower bits. Our tests showed that adding the unshifted offset did indeed reduce the spiky-ness of the bucket size distribution, but at a measurable, but slight, across-the-board performance cost.
The following tables show relative throughput results for kernels built with minor hash table tuning modifications. The "reference" kernel is a plain 2.2.5 kernel with a 4000 entry process table. The "1[345]-bit" kernels are plain 2.2.5 kernel with a 4000 entry process table and a 13, 14, and 15-bit (8192, 16384, and 32768 buckets) page cache hash table. The "offset" kernel is just like the "14-bit" kernel, but whose page cache hash function looks like this:
return s(i+o+offset) & (PAGE_HASH_SIZE-1);The "mult" kernel is the page kernel with a multiplicative hash function instead of the plain additive one:
return ((((unsigned long)inode + offset) * 2654435761UL) >> \ (32 - PAGE_HASH_BITS)) & (PAGE_HASH_SIZE-1);See the section below on multiplicative hashing for more on how we derived this function.
Finally, the "rbtree" kernel was derived from a clean 2.2.5 kernel with a special patch applied. This patch was extracted from Andrea Archangeli's 2.2.5-arca10, which implements the page cache with per-inode red-black trees instead of a hash table. A red-black tree is a form of balanced binary tree.
We ran each workload seven times, and took the results from the middle five runs. The results in Table 1 are averages and standard deviations for the middle five benchmark runs for each workload. The timing result is the total length of all the runs for that kernel, including the two runs out of seven that were ignored in the average calculations. Each set of runs for a given kernel was benchmarked on a freshly rebooted system. These tests were run on our dual Pentium Pro using 16, 32, 48, and 64 concurrent script workloads to show how performance changed between CPU bound and I/O bound workloads. We also wanted to push the machine into swap to see how performance changes when the page cache is used as a swap cache.
kernel | table size (buckets) | 16 scripts | 32 scripts | 48 scripts | 64 scripts | total elapsed |
reference | 2048 | 1864.7 s=3.77 | 1800.8 s=8.51 | 1739.9 s=3.61 | 1644.6 s=29.35 | 50 min 25 sec |
13-bit | 8192 | 1875.8 s=5.59 | 1834.0 s=3.71 | 1765.5 s=3.01 | 1683.3 s=17.39 | 49 min 43 sec |
14-bit | 16384 | 1877.2 s=5.35 | 1830.8 s=3.81 | 1770.5 s=3.84 | 1694.3 s=41.42 | 49 min 35 sec |
15-bit | 32768 | 1875.4 s=10.72 | 1832.4 s=3.97 | 1770.3 s=3.97 | 1691.2 s=20.05 | 49 min 36 sec |
offset | 16384 | 1880.0 s=2.78 | 1843.7 s=14.65 | 1774.5 s=4.30 | 1685.4 s=33.46 | 49 min 40 sec |
mult | 16384 | 1876.4 s=6.45 | 1836.8 s=6.45 | 1773.7 s=5.20 | 1691.7 s=25.32 | 49 min 29 sec |
rbtree | N/A | 1874.9 s=6.57 | 1817.0 s=5.59 | 1755.3 s=3.01 | 1670.8 s=17.26 | 50 min 3 sec |
According to kernel profiling results, defining PAGE_HASH_BITS as 13 bits is enough to take find_page() out of the top kernel CPU users during most heavy VM loads on large-memory machines. However, increasing it further can help reduce the real elapsed time required for an average lookup, thus improving system performance even more. As one might expect, increasing the hash table size had little effect on smaller workloads. To show the effects of increased table size on a high-end machine, we ran 128 script benchmarks on our four-way 512M Dell PowerEdge. The kernels used in this test are otherwise unchanged reference kernels compiled with 4000 process slots. The results are averages of five runs on each kernel.
table size, in buckets | average throughput | average throughput, minus first run | maximum throughput | elapsed time |
2048 | 4282.8 s=29.96 | 4295.2 s=11.10 | 4313.0 | 12 min 57 sec |
8192 | 4387.3 s=23.10 | 4398.5 s=5.88 | 4407.5 | 12 min 40 sec |
32768 | 4405.3 s=5.59 | 4407.4 s=4.14 | 4413.8 | 12 min 49 sec |
The gains in inter-run variance are significant for larger memory machines. It is also clear that overall performance improves for tables larger than 8192 buckets, although not to the same degree that it improves for a table size increase of 2048 to 8192 buckets.
The "rbtree" kernel performed better than the "reference" kernel. It also scored very well in inter-run variance. A big advantage of this implementation is that it is more space efficient, especially on small machines, since it doesn't require many contiguous pages for the hash table. The "offset" kernel was predicted to perform better when the system was swapping, but appears to perform worse than both the "mult" and the "14 bit" kernel on the heaviest workload. Finally, the "mult" kernel appears to have the smoothest overall results, and the shortest overall elapsed time by one second.
The biggest gain occurs when the hash table size is increased. Because of the overall goodness of the existing hash function, the only improvement necessary for the page cache hash is to increase the number of buckets in the hash table. This will have performance benefits for machines of all memory sizes, because as hash table size increases, more pages will be hashed into buckets that only contain a single page, thereby decreasing average lookup time.
Increasing the page cache hash table's bucket count even further will continue to improve performance, especially for large memory machines. However, for use on generic hardware, 13 bits accounts for 8 pages worth of hash table, which is probably the practical upper limit for small memory machines.
#define _hashfn(dev,block) (((unsigned)(HASHDEV(dev)^block)) & bh_hash_mask)This function adds no randomness to the either argument, simply xor-ing them together, and truncating the result. Histogram output tells the whole story:
Apr 27 17:17:51 pillbox kernel: Buffer cache total lookups: 296481 (hit rate: 54%) Apr 27 17:17:51 pillbox kernel: hash table size is 16384 buckets Apr 27 17:17:51 pillbox kernel: hash table contains 37256 objects Apr 27 17:17:51 pillbox kernel: largest bucket contains 116 buffers Apr 27 17:17:51 pillbox kernel: find_buffer() iterations/lookup: 2155/1000 Apr 27 17:17:51 pillbox kernel: hash table histogram: Apr 27 17:17:51 pillbox kernel: size buckets buffers sum-pct Apr 27 17:17:51 pillbox kernel: 0 12047 0 0 Apr 27 17:17:51 pillbox kernel: 1 1037 1037 2 Apr 27 17:17:51 pillbox kernel: 2 381 762 4 Apr 27 17:17:51 pillbox kernel: 3 295 885 7 Apr 27 17:17:51 pillbox kernel: 4 325 1300 10 Apr 27 17:17:51 pillbox kernel: 5 399 1995 16 Apr 27 17:17:51 pillbox kernel: 6 188 1128 19 Apr 27 17:17:51 pillbox kernel: 7 303 2121 24 Apr 27 17:17:51 pillbox kernel: 8 160 1280 28 Apr 27 17:17:51 pillbox kernel: 9 169 1521 32 Apr 27 17:17:51 pillbox kernel: 10 224 2240 38 Apr 27 17:17:51 pillbox kernel: 11 64 704 40 Apr 27 17:17:51 pillbox kernel: 12 49 588 41 Apr 27 17:17:51 pillbox kernel: 13 15 195 42 Apr 27 17:17:51 pillbox kernel: 14 3 42 42 Apr 27 17:17:51 pillbox kernel: 15 4 60 42 Apr 27 17:17:51 pillbox kernel: >15 721 21398 100
The average bucket size for 37,000+ buffers stored in a 16384 bucket table should be about 3 (that is, O(n/m)). The largest bucket contained 116 buffers, almost 2 orders of magnitude more than the expected average, even though the hash table is less than ((16384 - 12047) / 16384) = 27% utilized. At one point during the benchmark, the author observed buckets containing more than 340 buffers. As well, over half the buffers in the cache are stored in buckets that already have more than 15 buffers in them. Even after the benchmark is over, most of the buffers reside in large buckets:
Apr 27 17:30:49 pillbox kernel: Buffer cache total lookups: 3548568 (hit rate: 78%) Apr 27 17:30:49 pillbox kernel: hash table size is 16384 buckets Apr 27 17:30:49 pillbox kernel: hash table contains 2644 objects Apr 27 17:30:49 pillbox kernel: largest bucket contains 80 buffers Apr 27 17:30:49 pillbox kernel: find_buffer() iterations/lookup: 1379/1000 Apr 27 17:30:49 pillbox kernel: hash table histogram: Apr 27 17:30:49 pillbox kernel: size buckets buffers sum-pct Apr 27 17:30:49 pillbox kernel: 0 16167 0 0 Apr 27 17:30:49 pillbox kernel: 1 110 110 4 Apr 27 17:30:49 pillbox kernel: 2 10 20 4 Apr 27 17:30:49 pillbox kernel: 3 3 9 5 Apr 27 17:30:49 pillbox kernel: 4 1 4 5 Apr 27 17:30:49 pillbox kernel: 5 0 0 5 Apr 27 17:30:49 pillbox kernel: 6 3 18 6 Apr 27 17:30:49 pillbox kernel: 7 1 7 6 Apr 27 17:30:49 pillbox kernel: 8 6 48 8 Apr 27 17:30:49 pillbox kernel: 9 2 18 8 Apr 27 17:30:49 pillbox kernel: 10 1 10 9 Apr 27 17:30:49 pillbox kernel: 11 2 22 10 Apr 27 17:30:49 pillbox kernel: 12 3 36 11 Apr 27 17:30:49 pillbox kernel: 13 3 39 12 Apr 27 17:30:49 pillbox kernel: 14 3 42 14 Apr 27 17:30:49 pillbox kernel: 15 1 15 15 Apr 27 17:30:49 pillbox kernel: >15 68 2246 100
Clearly, a better hash function is needed for the buffer cache hash table. The following table compares benchmark throughput results from the reference kernel (unmodified 2.2.5 kernel with 4000 process slots, as above) to results obtained after replacing the buffer cache hash function with several different hash functions. Here is our multiplicative hash function:
#define _hashfn(dev,block) ((((block) * 2654435761UL) >> SHIFT) & bh_hash_mask)We tested variations of this function (SHIFT value is fixed at 11, or varies depending on the table size).
We also tried a shift-add hash function to see if the multiplicative hash was really best. The shift-add function comes from Peter Steiner, and uses a shift and subtract ( (block << 7) - block ) to effectively multiply by a Mersenne prime ( block * 127 ). This type of prime multiplication is easy to calculate, since we can use shifting and subtraction.
#define _hashfn(dev,block) \ (((block << 7) - block + (block >> 10) + (block >> 18)) & bh_hash_mask)This series of tests consists of five runs of 128 concurrent scripts on the four-way Dell PowerEdge system. We report an average result for all five runs, and an average result without the first run. The five-run average and the total elapsed time shows how good or bad the first run, which warms the system caches after a reboot, can be. The four-run average indicates the steady-state operation of the buffer cache.
kernel | table size | average throughput | avg throughput, minus first run | maximum throughput | elapsed time |
reference | 32768 | 4282.8 s=29.96 | 4295.2 s=11.10 | 4313.0 | 12 min 57 sec |
mult, shift 16 | 32768 | 4369.3 s=19.35 | 4376.4 s=14.53 | 4393.2 | 12 min 45 sec |
mult, shift 11 | 32768 | 4380.8 s=12.09 | 4382.8 s=11.21 | 4394.0 | 12 min 50 sec |
shift-add | 32768 | 4388.9 s=21.90 | 4397.2 s=11.70 | 4415.5 | 12 min 31 sec |
mult, shift 11 | 16384 | 4350.5 s=99.75 | 4394.6 s=15.59 | 4417.2 | 12 min 41 sec |
mult, shift 17 | 16384 | 4343.7 s=61.17 | 4369.9 s=17.39 | 4390.2 | 12 min 46 sec |
shift-add | 16384 | 4390.2 s=22.55 | 4399.6 s=8.52 | 4408.3 | 12 min 37 sec |
mult, shift 18 | 8192 | 4328.9 s=16.61 | 4333.7 s=15.05 | 4349.6 | 12 min 41 sec |
shift-add | 8192 | 4362.5 s=13.37 | 4362.8 s=14.90 | 4382.3 | 12 min 45 sec |
On a Pentium II with 512K of level 2 cache, the the shift-add hash shows a higher average throughput than the multplicative variants. On CPUs with less pipelining, the race is somewhat closer, probably because the shift-add function, when performed serially, can sometimes take as long as multiplication. However, the shift-add function also has the lowest variance in this test, and the highest first-run throughput, making it a clear choice for use as the buffer cache hash function.
We also tested with smaller hash table sizes to demonstrate that buffer cache throughput can be maintained using fewer buckets. Our test results bear this out; in fact, often these functions appear to work better with fewer buckets. Reducing the size of the buffer cache hash table will save more than a dozen contiguous pages, since in the existing kernel, this hash table already consumes a contiguous 32 pages.
Here's a look at what a preferred bucket size distribution histogram looks like. These runs were made with the mult-11 hash function and a 16384 bucket hash table. This histogram snapshot was made at approximately the same points during the benchmark as the examples above.
Apr 27 18:14:50 pillbox kernel: Buffer cache total lookups: 287696 (hit rate: 54%) Apr 27 18:14:50 pillbox kernel: hash table size is 16384 buckets Apr 27 18:14:50 pillbox kernel: hash table contains 37261 objects Apr 27 18:14:50 pillbox kernel: largest bucket contains 11 buffers Apr 27 18:14:50 pillbox kernel: find_buffer() iterations/lookup: 242/1000 Apr 27 18:14:50 pillbox kernel: hash table histogram: Apr 27 18:14:50 pillbox kernel: size buckets buffers sum-pct Apr 27 18:14:50 pillbox kernel: 0 2034 0 0 Apr 27 18:14:50 pillbox kernel: 1 3317 3317 8 Apr 27 18:14:50 pillbox kernel: 2 4034 8068 30 Apr 27 18:14:50 pillbox kernel: 3 3833 11499 61 Apr 27 18:14:50 pillbox kernel: 4 2082 8328 83 Apr 27 18:14:50 pillbox kernel: 5 712 3560 93 Apr 27 18:14:50 pillbox kernel: 6 222 1332 96 Apr 27 18:14:50 pillbox kernel: 7 78 546 98 Apr 27 18:14:50 pillbox kernel: 8 46 368 99 Apr 27 18:14:50 pillbox kernel: 9 19 171 99 Apr 27 18:14:50 pillbox kernel: 10 5 50 99 Apr 27 18:14:50 pillbox kernel: 11 2 22 100 Apr 27 18:14:50 pillbox kernel: 12 0 0 100 Apr 27 18:14:50 pillbox kernel: 13 0 0 100 Apr 27 18:14:50 pillbox kernel: 14 0 0 100 Apr 27 18:14:50 pillbox kernel: 15 0 0 100 Apr 27 18:14:50 pillbox kernel: >15 0 0 100
Apr 27 18:27:19 pillbox kernel: Buffer cache total lookups: 3530977 (hit rate: 78%) Apr 27 18:27:19 pillbox kernel: hash table size is 16384 buckets Apr 27 18:27:19 pillbox kernel: hash table contains 2717 objects Apr 27 18:27:19 pillbox kernel: largest bucket contains 6 buffers Apr 27 18:27:19 pillbox kernel: find_buffer() iterations/lookup: 215/1000 Apr 27 18:27:19 pillbox kernel: hash table histogram: Apr 27 18:27:19 pillbox kernel: size buckets buffers sum-pct Apr 27 18:27:19 pillbox kernel: 0 14302 0 0 Apr 27 18:27:19 pillbox kernel: 1 1555 1555 57 Apr 27 18:27:19 pillbox kernel: 2 442 884 89 Apr 27 18:27:19 pillbox kernel: 3 73 219 97 Apr 27 18:27:19 pillbox kernel: 4 5 20 98 Apr 27 18:27:19 pillbox kernel: 5 3 15 99 Apr 27 18:27:19 pillbox kernel: 6 4 24 100 Apr 27 18:27:19 pillbox kernel: 7 0 0 100 Apr 27 18:27:19 pillbox kernel: 8 0 0 100 Apr 27 18:27:19 pillbox kernel: 9 0 0 100 Apr 27 18:27:19 pillbox kernel: 10 0 0 100 Apr 27 18:27:19 pillbox kernel: 11 0 0 100 Apr 27 18:27:19 pillbox kernel: 12 0 0 100 Apr 27 18:27:19 pillbox kernel: 13 0 0 100 Apr 27 18:27:19 pillbox kernel: 14 0 0 100 Apr 27 18:27:19 pillbox kernel: 15 0 0 100 Apr 27 18:27:19 pillbox kernel: >15 0 0 100
Second, in both Histogram 3 and 4, about 68% of all buffers contained in the hash table are stored in buckets containing the expected average number of buffers, or less. Sixty-eight percent of all samples is the expected standard deviation. And third, the number of empty buckets in the first example above is only 12.4%, meaning more than 87% of all buckets in the table are used.
#define D_HASHBITS 10 #define D_HASHSIZE (1UL << D_HASHBITS) #define D_HASHMASK (D_HASHSIZE-1) static inline struct list_head * d_hash(struct dentry * parent, unsigned long hash) { hash += (unsigned long) parent; hash = hash ^ (hash >> D_HASHBITS) ^ (hash >> D_HASHBITS*2); return dentry_hashtable + (hash & D_HASHMASK); }The arguments for this function are the address of the parent directory's dentry structure, and a hash value obtained by a simplified CRC algorithm on the target entry's name. This appears to work fairly well, but we wanted to see if we could improve on it.
Andrea Arcangeli has suggested that shrinking the dcache slightly more aggressively would reduce the number of objects in the table enough to help improve dcache hash lookup times. We test this idea by adding a couple of lines from his 2.2.5-arca10 patch: In fs/dcache.c, function shrink_dcache_memory(), we replace prune_dcache(found) with:
prune_dcache(dentry_stat.nr_unused / (priority+1));and move the shrink_dcache_memory() call in do_try_to_free_pages() up close to the top of the loop (meaning it will be invoked more often). In the following table, we show results from several different kernels. First, results from the reference 2.2.5 kernel are repeated, then a kernel that is like the reference kernel, except that the dcache hash table is increased to 16384 buckets, and the xor operations are replaced with addition when computing the hash function. The "shrink" kernel is a 2.2.5 kernel like the "dcache" kernel except that it more aggressively shrinks the dcache, as explained above. The "mult" kernels uses a multiplicative hash function similar to the buffer cache hash function, instead of the existing dcache hash function:
static inline struct list_head * d_hash(struct dentry * parent, unsigned long hash) { hash += (unsigned long) parent; hash = (hash * 2654435761UL) >> SHIFT; return dentry_hashtable + (hash & D_HASHMASK); }where SHIFT is either 11 or 17. The "shrink + mult" kernels combine the effects of both multiplicative hashing and shrinking the dcache.
The following results are average results from five benchmark runs of 128 concurrent scripts on the four-way Dell PowerEdge. The timing results are the elapsed time for all five runs on each kernel.
kernel | average throughput | maximum throughput | elapsed time |
reference | 4282.8 s=29.96 | 4313.0 | 12 min 57 sec |
14 bit | 4375.2 s=25.92 | 4397.4 | 12 min 42 sec |
mult, shift 11 | 4368.7 s=62.65 | 4406.2 | 12 min 39 sec |
mult, shift 17 | 4375.9 s=10.40 | 4389.0 | 12 min 40 sec |
shrink | 4368.7 s=33.36 | 4390.7 | 12 min 40 sec |
shrink + mult 11 | 4380.4 s=13.53 | 4396.5 | 12 min 35 sec |
shrink + mult 17 | 4368.5 s=16.21 | 4383.6 | 12 min 42 sec |
Some may argue that shrinking the dcache unnecessarily might degrade performance by lowering the overall effectiveness of the cache, but the author believes that shrinking the cache more aggressively than plain 2.2.5 will help, rather than hurt, overall system performance because a smaller cache allows faster lookups. In combination with an appropriate multiplicative hash function, such as the "shrink + mult 11" kernel, elapsed time and average throughput stays high enough to make it the fastest kernel we benchmarked in this series.
Clearly increasing the hash table size is a significant performance win. Replacing the current hash function is probably not necessary, but could improve inter-run variance and provide a slight performance edge.
The inode cache hash function is a shift-add function similar to the dentry cache hash function.
#define HASH_BITS 8 #define HASH_SIZE (1UL << HASH_BITS) #define HASH_MASK (HASH_SIZE-1) static inline unsigned long hash(struct super_block *sb, unsigned long i_ino) { unsigned long tmp = i_ino | (unsigned long) sb; tmp = tmp + (tmp >> HASH_BITS) + (tmp >> HASH_BITS*2); return tmp & HASH_MASK; }A histogram of this hash table implementation shows why this table is too small.
Apr 27 17:23:31 pillbox kernel: Inode cache total lookups: 189321 (hit rate: 3%) Apr 27 17:23:31 pillbox kernel: hash table size is 256 buckets Apr 27 17:23:31 pillbox kernel: hash table contains 9785 objects Apr 27 17:23:31 pillbox kernel: largest bucket contains 54 inodes Apr 27 17:23:31 pillbox kernel: find_inode() iterations/lookup: 38978/1000 Apr 27 17:23:31 pillbox kernel: hash table histogram: Apr 27 17:23:31 pillbox kernel: size buckets inodes sum-pct Apr 27 17:23:31 pillbox kernel: 0 0 0 0 Apr 27 17:23:31 pillbox kernel: 1 0 0 0 Apr 27 17:23:31 pillbox kernel: 2 0 0 0 Apr 27 17:23:31 pillbox kernel: 3 0 0 0 Apr 27 17:23:31 pillbox kernel: 4 0 0 0 Apr 27 17:23:31 pillbox kernel: 5 0 0 0 Apr 27 17:23:31 pillbox kernel: 6 0 0 0 Apr 27 17:23:31 pillbox kernel: 7 0 0 0 Apr 27 17:23:31 pillbox kernel: 8 0 0 0 Apr 27 17:23:31 pillbox kernel: 9 0 0 0 Apr 27 17:23:31 pillbox kernel: 10 0 0 0 Apr 27 17:23:31 pillbox kernel: 11 0 0 0 Apr 27 17:23:31 pillbox kernel: 12 0 0 0 Apr 27 17:23:31 pillbox kernel: 13 0 0 0 Apr 27 17:23:31 pillbox kernel: 14 0 0 0 Apr 27 17:23:31 pillbox kernel: 15 0 0 0 Apr 27 17:23:31 pillbox kernel: >15 256 9785 100The hash chains here are extremely long. As well, the hit rate shows that most lookups are unsuccessful, meaning that almost every lookup request has to traverse the entire bucket. The number of iterations per lookup is almost 40! Even though there are an order of magnitude fewer lookups in the inode cache than there are in the other caches, it is still clearly a performance bottleneck.
We ran tests on four different hash functions. Our reference kernel results appear in this table for convenience. The "12 bit" kernel is the same as the reference kernel except that the hash table size has been increased to 4096 buckets. The "mult" kernel has 4096 inode cache hash table buckets as well, and uses the multiplicative hash function introduced above. The "14 bit" kernel is the same as the reference kernel except that the hash table size has been increased to 16384 buckets.
kernel | average throughput | elapsed time |
reference | 4282.8 s=29.96 | 12 min 57 sec |
12 bit | 4361.3 s= 11.15 | 12 min 36 sec |
mult | 4346.0 s=20.87 | 12 min 52 sec |
14 bit | 4368.3 s= 20.41 | 12 min 54 sec |
The 12 bit hash table is the clear winner. Increasing the hash table size further helps performance slightly, but also increases inter-run variance to such an extent that total elapsed time was longer than for the "12 bit" kernel. Adding multiplicative hashing doesn't help much here because the table is already fairly full and well-balanced.
We selected optimizations among the best results shown above, then tried them in combination. We found that there were some performance relationships among the various caches, so we show the results for the best combinations that we tried.
kernel | average throughput | maximum throughput | elapsed time |
reference | 4300.7 s=15.73 | 4321.1 | 26 min 41 sec |
kernel A | 4582.9 s=12.55 | 4592.8 | 25 min 24 sec |
kernel B | 4577.9 s=16.22 | 4602.0 | 25 min 18 sec |
kernel C | 4596.2 s=22.30 | 4619.5 | 25 min 18 sec |
kernel D | 4591.3 s=10.98 | 4608.9 | 25 min 15 sec |
We'd like to select a combination that reduces inter-run variance and elapsed time, as well as maximizes throughput and minimizes hash table memory footprint. While kernel "C" offers the highest maximum throughput, its inter-run variance is also largest. On the other hand, kernel "D" has the second highest average throughput, the shortest elapsed time, and the best inter-run variance. This seems like a reasonable compromise. A patch against Linux 2.2.5 is included below for kernel "D".
However, it turns out that several of the alternatives are just as expensive, or even more expensive, than multiplicative hashing. Random table-driven hash functions require several table lookups, and several shifts, logical AND operations, and additions. This e-mail from the linux-kernel mailing list explains the problem:
Date: Thu, 15 Apr 1999 15:01:54 -0700 From: Iain McClatchieAnd, as it turns out, on our example 68030, shifting requires between 4 and 10 cycles, and addition operations aren't free either. If the instructions that implement the hash function are many, they will likely cause instruction cache contention that will be worse for performance than a multiplication operation. In general, a proper shift-add hash function will be almost as expensive in CPU cycles as a multiplicative hash. On a modern superscalar processor, the shifting and addition operations can happen in parallel as long as there are no address generation interlocks (AGIs). AGIs are much more likely for a table-driven hash function. An address generation interlock occurs when the results of one operation are required to form an address in a later operation that might otherwise have been parallelized by superscalar CPU hardware.To: Paul F. Dietz Cc: linux-kernel@vger.rutgers.edu Subject: Re: more on hash functions I got a few suggestions about how to use multiple lookups with a single table. All the suggestions make the hash function itself slower, and attempt to fix an issue -- hash distribution -- that doesn't appear to be a problem. I thought I should explain why the table lookup function is slow. A multiplication has a scheduling latency of either 5 or 9 cycles on a P6. Four memory accesses take four cycles on that same P6. So the core operations for the two hash function are actually very similar in delay, and the table lookup appears to have a slight edge. The difference is in the overhead. A multiplicative hash, at minimum, requires the loading of a constant, a multiplication, and a shift. Egcs actually transforms some constant multiplications into a sequence of shifts and adds which may have shorter latency, but essentially, the shift (and nothing else) goes in series with the multiplication and as a result the hash function has very little latency overhead. A table lookup hash spends quite a lot of time unpacking the bytes from the key, and furthurmore uses a load slot to unpack each byte. This makes for 8 load slots, which take 1 cycle each. Even if fully parallelized with unpacking, we end up with a fair bit of latency. Worse yet, egcs runs out of registers and ends up shifting the key value in place on the stack twice, which gobbles two load and two store slots. Bottom line: CPUs really suck at bit-shuffling and even byte-shuffling. If there is some clever way to code the byte unpacking in the table lookup hash function, perhaps using the x86's trick register file, it might end up faster than the multiplicative hash. -Iain
Multiplicative hash functions are often very brief. The hash functions we tried above, for example, compile to three instructions on ia32, comprising 15 bytes. Included in the 15 bytes are all the constants involved in the calculation, leaving only the key itself to be loaded as data. In other words, the whole hash function fits into a single Level 1 cache line on contemporary CPUs. The shift-add hash functions are generally more expensive, requiring several cache lines to contain, multiple loads of the key, and register allocation contention.
The question becomes, finally, how much, in terms of CPU cycles, do you need to spend on the hash function to get a reasonable bucket size distribution? In most practical situations, a simple shift-add function suffices. However, one should always test with actual data before deciding on a hash function implementation. Hashing on block numbers, as the Linux buffer cache does, turns out to require a particularly good hash function, as disk block numbers exhibit a great deal of regularity.
We selected 2654435761 as our multiplier. It is prime, and its value divided by 2 to the 32nd is a very good approximation of the golden ratio.
(sqrt(5) - 1 ) / 2 = 0.618033989 2654435761 / 4294967296 = 0.618033987
To obtain the best effects of this "division" we need to choose the correct shift value. This is usually the word size, in bits, minus the hash table size, in bits. This shifts the most significant bits of the result of the "division" down to where they can be used as the hash table index, preserving the greatest effects of the golden ratio. However, sometimes experimentation reveals a better shift value for a given set of input data.
We could also investigate the performance difference between inlining the page cache management routines (which eliminates the subroutine call overhead) and leaving them as stand-alone routines (which means they have a smaller L1 cache footprint). A separate swap cache hash function might also optimize the separate uses of the page cache hash tables.
As well, there are still open questions about why shrinking the dentry cache more aggressively can help performance. A study could focus on the cost of a dentry cache miss versus the cost of a page fault or buffer cache miss. Discovering alternative ways of triggering a dentry cache prune operation, or alternate ways of calculating the prune priority, may also be prudent.
Finally, there is still opportunity to analyze even more carefully the real keys and hash functions in use in several of the tables we've analyzed here, as well as several tables we didn't visit in this report, such as the uid and pid hash tables, and the vma data structures.
|