Brief: Linux 2.2.3 bforget() bug fix
Chuck Lever, Netscape Communications Corp.
chuckl@netscape.com
Andrea Arcangeli
andrea@e-mind.com
$Id: bforget.html,v 1.5 1999/11/12 20:12:54 cel Exp $
Abstract
|
Linux 2.2.x sports a redesigned VM and buffer cache system that
can self-tune based on offered system load.
This brief describes a bug in the buffer cache that was identified
and fixed by the authors.
|
This document is Copyright © 1999
Netscape Communications Corp.,
all rights reserved.
Trademarked material referenced in this document is copyright
by its respective owner.
|
Introduction
Linux is an open-source POSIX-compliant operating system that runs
on commodity hardware.
A feature of recent releases of Linux is its ability to self-tune
system parameters, such as buffer cache size, according to offered
system load.
Part of this self-tuning ability is implemented in a varying
partitioning of physical RAM between the traditional-style
buffer cache and the virtual memory system.
However, allowing the buffer cache to grow and shrink on-demand
introduces some interesting design problems, and, in this case,
a significant bug.
In this brief, we describe the bug, and how the bug was identified.
We provide performance measurements to show the significance of
the problem.
Finally, we detail the fix as we proposed it and as it was
adopted by the Linux kernel, and report on the performance
improvement.
Identifying the bug
Several developers, including the authors, had noticed that
early releases of Linux 2.2.x were experiencing a memory leak
of some kind under heavy file system and VM loads.
Symptoms included poor system performance after copying or
removing large files, sporadic "out of memory" errors, and
performance degradation on long-running jobs.
In order to stress the system and perhaps identify the cause
of the problem, the SPEC S-DET benchmark suite was used to
generate significant loads on a 4-way Dell PowerEdge 6300-450 with
512M of RAM and an 18G Ultra2 LVD SCSI hard drive.
The SPEC S-DET benchmark consists of a script that is designed
to emulate a software developer by running programs such as
cc, nroff, cpio, and ed.
The software developer workload stresses many aspects of an
operating system, including the file and VM subsystems.
Multiple concurrent instances of the script can run to simulate
reproducible and increasing amounts of system load.
The S-DET benchmark is carefully designed to provide a consistent
workload across runs, and to error-check the output of each
script to quickly catch problems.
It became quickly apparent that running S-DET with a large
number of scripts could reproduce the performance degradation
scenario quickly.
Figure 1. Consecutive runs of 128 scripts on Linux 2.2.3
[ raw data ]
Notice below how the buffer cache continues to grow without bounds during
the benchmark runs.
Eventually, the buffer cache causes the system
to begin flushing pages unnecessarily.
This pressure on the VM system results in, among other things,
pages being stolen back from the buffer cache.
Since the buffer cache doesn't have any aging or replacement policy,
any buffer can be stolen, even buffers for heavily-used data.
The system doesn't usually recover from this condition until it is rebooted.
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 0 0 0 413404 74064 7940 0 0 0 759 178 1852 14 13 73
0 0 0 0 412948 74508 7940 0 0 1 79 119 1754 14 14 72
1 0 0 0 412572 74944 7940 0 0 1 89 117 1749 13 14 72
161 0 0 0 375724 75916 10572 0 0 69 88 140 1529 19 30 50
149 3 0 0 319992 97796 34716 0 0 697 144 256 793 15 85 0
116 0 0 0 255888 136664 49480 0 0 284 2184 1217 3828 37 53 10
119 4 0 0 232072 153492 46196 0 0 17 138 277 688 15 85 0
139 0 0 0 234232 161756 49000 0 0 25 408 228 638 21 79 0
...
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
15 107 0 88 2892 388056 42528 0 0 480 81 320 10401 24 43 33
29 92 1 88 3092 387488 43832 0 0 376 116 310 13246 9 14 77
10 111 0 88 1584 387376 45524 0 0 360 94 307 13222 2 9 89
13 108 1 88 6076 385548 42692 0 0 359 224 310 12467 3 9 88
2 122 1 88 4812 383928 44592 0 0 444 67 304 10600 6 12 82
9 111 0 88 1504 381636 48740 0 0 400 143 303 10244 3 13 83
3 117 1 88 3040 380332 47396 0 0 392 122 299 7481 4 6 91
43 78 1 88 3936 380432 50056 0 0 427 180 301 10556 19 28 52
21 98 1 88 2724 382424 48456 0 0 455 55 311 12743 9 10 81
19 100 1 88 7484 379236 46612 0 0 336 84 302 11554 7 11 83
Figure 2. "vmstat" output excerpts of benchmark runs on Linux 2.2.3
Other features of this scenario include low CPU utilization and a large
number of blocked processes. The size of the buffer cache and the
size of the free memory list are continually fluctuating,
suggesting a high flow of pages into and out of the buffer cache.
After some number of unsuccessful guesses at what might be the
problem, it was noticed
that the rate at which blocks were being read was low during
benchmark runs with good performance, but increased significantly
during poorly performing runs.
This suggested that the buffer cache was somehow becoming
ineffective over time.
The elevated read rate interferes with disk write bandwidth,
since Linux prioritizes read requests over write requests, since
usually a read request is more time-critical.
The original bforget()
The authors, independently of one another, discovered that buffers
were being orphaned. In other words, many buffers were ending up
unlinked from the hash, such that they would not be found during
a
find_buffer() request.
Each time a buffer is orphaned, another buffer is allocated
for the same logical block of data, and another read operation
is requested to pull the same data in from the disk.
This also causes the buffer cache to grow in size as more and more
copies of the same data blocks appear in the cache.
A review of the source that manages buffers (linux/fs/buffer.c)
revealed that the bforget() function, used by ext2 when files are
truncated or deleted, removes buffers from the hash table,
but then it abandons them without recycling them.
/*
* bforget() is like brelse(), except it removes the buffer
* from the hash-queues (so that it won't be re-used if it's
* shared).
*/
void __bforget(struct buffer_head * buf)
{
mark_buffer_clean(buf);
clear_bit(BH_Protected, &buf->b_state);
remove_from_hash_queue(buf);
buf->b_dev = NODEV;
refile_buffer(buf);
if (!--buf->b_count)
return;
printk("VFS: forgot an in-use buffer! (count=%d)\n",
buf->b_count);
}
Figure 3. The original bforget()
In Linux 2.0,
refile_buffer() was probably responsible for ensuring
that the buffer was properly added to the free list.
However, rewrites of the VM and buffer cache subsystems in Linux
have since removed that functionality from
refile_buffer().
Proposed and final versions
The authors proposed the following replacement to correct this
problem.
void __bforget(struct buffer_head * buf)
{
clear_bit(BH_Protected, &buf->b_state);
remove_from_queues(buf);
put_last_free(buf);
if (!--buf->b_count)
return;
printk("VFS: forgot an in-use buffer! (count=%d)\n",
buf->b_count);
}
Figure 4. Proposed fix
After more analysis and testing, Linus Torvalds' final version of the
routine looks like this (from Linux 2.2.5):
/*
* bforget() is like brelse(), except it puts the buffer on the
* free list if it can.. We can NOT free the buffer if:
* - there are other users of it
* - it is locked and thus can have active IO
*/
void __bforget(struct buffer_head * buf)
{
if (buf->b_count != 1 || buffer_locked(buf)) {
__brelse(buf);
return;
}
buf->b_count = 0;
remove_from_queues(buf);
put_last_free(buf);
}
Figure 5. Final fix
The "if" construct is simply defensive programming.
Studying the current implementation of ext2, it appears unlikely
that
bforget() will be called with a buffer that has a usage
count other than one.
However, it was thought that eventually, other file systems may want
to use
bforget(), or ext2 may change how it uses
bforget(), so leaving the safety check in there would help
catch software problems that may be introduced later.
The rest
of the routine ensures that the buffer's usage count is zero
before finally inserting it into the free list.
In general, a non-zero usage count prevents try_to_free_buffers()
from releasing pages containing buffers.
If buffers with non-zero usage counts appear in the buffer free lists,
they will be skipped over by try_to_free_buffers() in favor of
buffers that may still have useful data.
Having a large number of these buffers in the free list can even
cause a severe system-wide memory shortage.
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 0 0 0 414080 73704 7636 0 0 1 123 130 1759 13 14 73
0 0 0 0 413516 74140 7636 0 0 1 89 117 1745 13 15 72
0 0 0 0 413136 74576 7636 0 0 1 90 117 1749 13 15 72
159 0 0 0 372752 75372 11088 0 0 130 88 145 1379 19 40 41
134 1 0 0 332664 89228 32172 0 0 688 75 259 749 13 87 0
110 30 0 0 322096 89232 38256 0 0 79 466 337 746 22 78 0
132 0 0 0 303544 89232 47352 0 0 104 68 166 751 54 46 0
144 0 0 0 301464 89232 45012 0 0 36 285 244 777 39 61 0
139 1 0 0 305996 89232 49180 0 0 6 211 235 666 36 65 0
142 0 0 0 307616 89232 52908 0 0 15 141 161 606 31 69 0
140 0 0 0 303904 89232 49568 0 0 27 199 249 669 27 73 0
139 3 0 0 299460 89232 51920 0 0 5 317 310 924 31 69 0
139 0 0 0 307004 89232 51676 0 0 33 61 167 458 26 74 0
138 2 0 0 304716 89232 49880 0 0 5 415 287 728 34 66 0
...
138 0 0 0 286980 91856 53820 0 0 10 20 126 510 33 67 0
137 7 0 0 291312 91856 56588 0 0 8 374 293 716 31 69 0
147 0 0 0 293088 91856 58432 0 0 7 78 144 593 38 62 0
141 0 0 0 287608 91856 57548 0 0 0 290 230 667 36 64 0
142 1 0 0 306696 91856 51956 0 0 0 178 203 643 35 65 0
135 0 0 0 303112 91856 54436 0 0 1 127 144 480 27 73 0
Figure 6. "vmstat" output excerpts of benchmark runs on Linux 2.2.3
with final bforget()
The final
bforget() has corrected the
performance degradation scenario.
CPU utilization is maximized, and few processes are blocked.
Block read rate is low, and the buffer cache size expands to a
reasonable working set, then remains at a steady size.
The following graph shows that performance is flat for each
consecutive run of the 128 script benchmark.
Figure 7. Consecutive runs of 128 scripts on Linux 2.2.3 with final
bforget()
[ raw data ]
The fixed version of
bforget() is contained in Linux 2.2.5
and later.
This document was written as part of the Linux Scalability Project.
For more information, see
our home page.
If you have comments or suggestions, email
linux-scalability@umich.edu