summaryrefslogtreecommitdiffstats
path: root/fs/fs-writeback.c
Commit message (Collapse)AuthorAgeFilesLines
* writeback: pass in super_block to bdi_start_writeback()Jens Axboe2009-09-261-2/+4
| | | | | | | | | | | Sometimes we only want to write pages from a specific super_block, so allow that to be passed in. This fixes a problem with commit 56a131dcf7ed36c3c6e36bea448b674ea85ed5bb causing writeback on all super_blocks on a bdi, where we only really want to sync a specific sb from writeback_inodes_sb(). Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: writeback_inodes_sb() should use bdi_start_writeback()Jens Axboe2009-09-251-1/+1
| | | | | | | Pointless to iterate other devices looking for a super, when we have a bdi mapping. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: don't delay inodes redirtied by a fast dirtierWu Fengguang2009-09-251-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | Debug traces show that in per-bdi writeback, the inode under writeback almost always get redirtied by a busy dirtier. We used to call redirty_tail() in this case, which could delay inode for up to 30s. This is unacceptable because it now happens so frequently for plain cp/dd, that the accumulated delays could make writeback of big files very slow. So let's distinguish between data redirty and metadata only redirty. The first one is caused by a busy dirtier, while the latter one could happen in XFS, NFS, etc. when they are doing delalloc or updating isize. The inode being busy dirtied will now be requeued for next io, while the inode being redirtied by fs will continue to be delayed to avoid repeated IO. CC: Jan Kara <jack@suse.cz> CC: Theodore Ts'o <tytso@mit.edu> CC: Dave Chinner <david@fromorbit.com> CC: Chris Mason <chris.mason@oracle.com> CC: Christoph Hellwig <hch@infradead.org> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: make the super_block pinning more efficientJens Axboe2009-09-251-17/+29
| | | | | | | | | | Currently we pin the inode->i_sb for every single inode. This increases cache traffic on sb->s_umount sem. Lets instead cache the inode sb pin state and keep the super_block pinned for as long as keep writing out inodes from the same super_block. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: don't resort for a single super_block in move_expired_inodes()Jens Axboe2009-09-251-1/+11
| | | | | | | If we only moved inodes from a single super_block to the temporary list, there's no point in doing a resort for multiple super_blocks. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: move inodes from one super_block togetherShaohua Li2009-09-251-3/+18
| | | | | | | | | | | | | | | | | | | __mark_inode_dirty adds inode to wb dirty list in random order. If a disk has several partitions, writeback might keep spindle moving between partitions. To reduce the move, better write big chunk of one partition and then move to another. Inodes from one fs usually are in one partion, so idealy move indoes from one fs together should reduce spindle move. This patch tries to address this. Before per-bdi writeback is added, the behavior is write indoes from one fs first and then another, so the patch restores previous behavior. The loop in the patch is a bit ugly, should we add a dirty list for each superblock in bdi_writeback? Test in a two partition disk with attached fio script shows about 3% ~ 6% improvement. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: get rid to incorrect references to pdflush in commentsJens Axboe2009-09-251-4/+1
| | | | Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: improve readability of the wb_writeback() continue/break logicJens Axboe2009-09-251-20/+23
| | | | | | | And throw some comments in there, too. Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: cleanup writeback_single_inode()Wu Fengguang2009-09-251-8/+7
| | | | | | | | | | | Make the if-else straight in writeback_single_inode(). No behavior change. Cc: Jan Kara <jack@suse.cz> Cc: Michael Rubin <mrubin@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: kupdate writeback shall not stop when more io is possibleWu Fengguang2009-09-251-2/+2
| | | | | | | | | | | | | | | | | Fix the kupdate case, which disregards wbc.more_io and stop writeback prematurely even when there are more inodes to be synced. wbc.more_io should always be respected. Also remove the pages_skipped check. It will set when some page(s) of some inode(s) cannot be written for now. Such inodes will be delayed for a while. This variable has nothing to do with whether there are other writeable inodes. CC: Jan Kara <jack@suse.cz> CC: Dave Chinner <david@fromorbit.com> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: stop background writeback when below background thresholdWu Fengguang2009-09-251-11/+17
| | | | | | | | | | | | | | | Treat bdi_start_writeback(0) as a special request to do background write, and stop such work when we are below the background dirty threshold. Also simplify the (nr_pages <= 0) checks. Since we already pass in nr_pages=LONG_MAX for WB_SYNC_ALL and background writes, we don't need to worry about it being decreased to zero. Reported-by: Richard Kennedy <richard@rsk.demon.co.uk> CC: Jan Kara <jack@suse.cz> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* fs: Fix busyloop in wb_writeback()Jan Kara2009-09-251-1/+18
| | | | | | | | | | | | If all inodes are under writeback (e.g. in case when there's only one inode with dirty pages), wb_writeback() with WB_SYNC_NONE work basically degrades to busylooping until I_SYNC flags of the inode is cleared. Fix the problem by waiting on I_SYNC flags of an inode on b_more_io list in case we failed to write anything. Tested-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: fix possible bdi writeback refcounting problemNick Piggin2009-09-161-7/+6
| | | | | | | | | | | | | | | wb_clear_pending AFAIKS should not be called after the item has been put on the list, except by the worker threads. It could lead to the situation where the refcount is decremented below 0 and cause lots of problems. Presumably the !wb_has_dirty_io case is not a common one, so it can be discovered when the thread wakes up to check? Also add a comment in bdi_work_clear. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: Fix bdi use after free in wb_work_complete()Nick Piggin2009-09-161-2/+3
| | | | | | | | | | By the time bdi_work_on_stack gets evaluated again in bdi_work_free, it can already have been deallocated and used for something else in the !on stack case, giving a false positive in this test and causing corruption. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: improve scalability of bdi writeback work queuesNick Piggin2009-09-161-1/+2
| | | | | | | | | If you're going to do an atomic RMW on each list entry, there's not much point in all the RCU complexities of the list walking. This is only going to help the multi-thread case I guess, but it doesn't hurt to do now. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: remove smp_mb(), it's not needed with list_add_tail_rcu()Nick Piggin2009-09-161-3/+3
| | | | | | | list_add_tail_rcu contains required barriers. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: use schedule_timeout_interruptible()Jens Axboe2009-09-161-2/+1
| | | | | | Gets rid of a manual set_current_state(). Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: add comments to bdi_work structureJens Axboe2009-09-161-7/+11
| | | | | | | And document its retriever, get_next_work_item(). Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: separate starting of sync vs opportunistic writebackJens Axboe2009-09-161-65/+67
| | | | | | | | | | | | | bdi_start_writeback() is currently split into two paths, one for WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback() for WB_SYNC_ALL writeback and let bdi_start_writeback() handle only WB_SYNC_NONE. Push down the writeback_control allocation and only accept the parameters that make sense for each function. This cleans up the API considerably. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: inline allocation failure handling in bdi_alloc_queue_work()Jens Axboe2009-09-161-22/+27
| | | | | | | | This gets rid of work == NULL in bdi_queue_work() and puts the OOM handling where it belongs. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: use RCU to protect bdi_listJens Axboe2009-09-161-3/+3
| | | | | | | | Now that bdi_writeback_all() no longer handles integrity writeback, it doesn't have to block anymore. This means that we can switch bdi_list reader side protection to RCU. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: only use bdi_writeback_all() for WB_SYNC_NONE writeoutJens Axboe2009-09-161-56/+14
| | | | | | | | Data integrity writeback must use bdi_start_writeback() and ensure that wbc->sb and wbc->bdi are set. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: make wb_writeback() take an argument structureJens Axboe2009-09-161-29/+49
| | | | | | | | | | | We need to be able to pass in range_cyclic as well, so instead of growing yet another argument, split the arguments into a struct wb_writeback_args structure that we can use internally. Also makes it easier to just copy all members to an on-stack struct, since we can't access work after clearing the pending bit. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: merely wakeup flusher thread if work allocation fails for ↵Christoph Hellwig2009-09-161-32/+14
| | | | | | | | | | | | WB_SYNC_NONE Since it's an opportunistic writeback and not a data integrity action, don't punt to blocking writeback. Just wakeup the thread and it will flush old data. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* vfs: Remove generic_osync_inode() and sync_page_range{_nolock}()Jan Kara2009-09-141-54/+0
| | | | | | Remove these three functions since nobody uses them anymore. Signed-off-by: Jan Kara <jack@suse.cz>
* writeback: check for registered bdi in flusher add and inode dirtyJens Axboe2009-09-111-0/+8
| | | | | | | Also a debugging aid. We want to catch dirty inodes being added to backing devices that don't do writeback. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: get rid of pdflush completelyJens Axboe2009-09-111-0/+5
| | | | | | It is now unused, so kill it off. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: switch to per-bdi threads for flushing dataJens Axboe2009-09-111-289/+710
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This gets rid of pdflush for bdi writeout and kupdated style cleaning. pdflush writeout suffers from lack of locality and also requires more threads to handle the same workload, since it has to work in a non-blocking fashion against each queue. This also introduces lumpy behaviour and potential request starvation, since pdflush can be starved for queue access if others are accessing it. A sample ffsb workload that does random writes to files is about 8% faster here on a simple SATA drive during the benchmark phase. File layout also seems a LOT more smooth in vmstat: r b swpd free buff cache si so bi bo in cs us sy id wa 0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42 0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44 1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37 0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58 0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34 0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37 0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44 0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38 0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41 0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45 where vanilla tends to fluctuate a lot in the creation phase: r b swpd free buff cache si so bi bo in cs us sy id wa 1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36 1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51 0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40 0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37 1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41 0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49 0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36 1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43 0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39 1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45 1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34 0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54 A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A SSD based writeback test on XFS performs over 20% better as well, with the throughput being very stable around 1GB/sec, where pdflush only manages 750MB/sec and fluctuates wildly while doing so. Random buffered writes to many files behave a lot better as well, as does random mmap'ed writes. A separate thread is added to sync the super blocks. In the long term, adding sync_supers_bdi() functionality could get rid of this thread again. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: move dirty inodes from super_block to backing_dev_infoJens Axboe2009-09-111-70/+127
| | | | | | | | | This is a first step at introducing per-bdi flusher threads. We should have no change in behaviour, although sb_has_dirty_inodes() is now ridiculously expensive, as there's no easy way to answer that question. Not a huge problem, since it'll be deleted in subsequent patches. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: get rid of generic_sync_sb_inodes() exportJens Axboe2009-09-111-28/+42
| | | | | | | | | | | | This adds two new exported functions: - writeback_inodes_sb(), which only attempts to writeback dirty inodes on this super_block, for WB_SYNC_NONE writeout. - sync_inodes_sb(), which writes out all dirty inodes on this super_block and also waits for the IO to complete. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* cleanup __writeback_single_inodeChristoph Hellwig2009-06-241-50/+50
| | | | | | | | | | | | There is no reason to for the split between __writeback_single_inode and __sync_single_inode, the former just does a couple of checks before tail-calling the latter. So merge the two, and while we're at it split out the I_SYNC waiting case for data integrity writers, as it's logically separate function. Finally rename __writeback_single_inode to writeback_single_inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* writeback: skip new or to-be-freed inodesWu Fengguang2009-06-161-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1) I_FREEING tests should be coupled with I_CLEAR The two I_FREEING tests are racy because clear_inode() can set i_state to I_CLEAR between the clear of I_SYNC and the test of I_FREEING. 2) skip I_WILL_FREE inodes in generic_sync_sb_inodes() to avoid possible races with generic_forget_inode() generic_forget_inode() sets I_WILL_FREE call writeback on its own, so generic_sync_sb_inodes() shall not try to step in and create possible races: generic_forget_inode inode->i_state |= I_WILL_FREE; spin_unlock(&inode_lock); generic_sync_sb_inodes() spin_lock(&inode_lock); __iget(inode); __writeback_single_inode // see non zero i_count may WARN here ==> WARN_ON(inode->i_state & I_WILL_FREE); spin_unlock(&inode_lock); may call generic_forget_inode again ==> iput(inode); The above race and warning didn't turn up because writeback_inodes() holds the s_umount lock, so generic_forget_inode() finds MS_ACTIVE and returns early. But we are not sure the UBIFS calls and future callers will guarantee that. So skip I_WILL_FREE inodes for the sake of safety. Cc: Eric Sandeen <sandeen@sandeen.net> Acked-by: Jeff Layton <jlayton@redhat.com> Cc: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Acked-by: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: block_dump missing dentry lockingNick Piggin2009-06-111-17/+24
| | | | | | | | | I think the block_dump output in __mark_inode_dirty is missing dentry locking. Surely the i_dentry list can change any time, so we may not even *get* a dentry there. If we do get one by chance, then it would appear to be able to go away or get renamed at any time... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs: remove incorrect I_NEW warningsNick Piggin2009-06-111-2/+0
| | | | | | | | | | | | | | | | Some filesystems can call in to sync an inode that is still in the I_NEW state (eg. ext family, when mounted with -osync). This is OK because the filesystem has sole access to the new inode, so it can modify i_state without races (because no other thread should be modifying it, by definition of I_NEW). Ie. a false positive, so remove the warnings. The races are described here 7ef0d7377cb287e08f3ae94cebc919448e1f5dff, which is also where the warnings were introduced. Reported-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: Make sys_sync() use fsync_super() (version 4)Jan Kara2009-06-111-49/+0
| | | | | | | | | | | | | | | | | | It is unnecessarily fragile to have two places (fsync_super() and do_sync()) doing data integrity sync of the filesystem. Alter __fsync_super() to accommodate needs of both callers and use it. So after this patch __fsync_super() is the only place where we gather all the calls needed to properly send all data on a filesystem to disk. Nice bonus is that we get a complete livelock avoidance and write_supers() is now only used for periodic writeback of superblocks. sync_blockdevs() introduced a couple of patches ago is gone now. [build fixes folded] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'for-linus' of ↵Linus Torvalds2009-04-031-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits) trivial: Update my email address trivial: NULL noise: drivers/mtd/tests/mtd_*test.c trivial: NULL noise: drivers/media/dvb/frontends/drx397xD_fw.h trivial: Fix misspelling of "Celsius". trivial: remove unused variable 'path' in alloc_file() trivial: fix a pdlfush -> pdflush typo in comment trivial: jbd header comment typo fix for JBD_PARANOID_IOFAIL trivial: wusb: Storage class should be before const qualifier trivial: drivers/char/bsr.c: Storage class should be before const qualifier trivial: h8300: Storage class should be before const qualifier trivial: fix where cgroup documentation is not correctly referred to trivial: Give the right path in Documentation example trivial: MTD: remove EOL from MODULE_DESCRIPTION trivial: Fix typo in bio_split()'s documentation trivial: PWM: fix of #endif comment trivial: fix typos/grammar errors in Kconfig texts trivial: Fix misspelling of firmware trivial: cgroups: documentation typo and spelling corrections trivial: Update contact info for Jochen Hein trivial: fix typo "resgister" -> "register" ...
| * trivial: fix a pdlfush -> pdflush typo in commentMasatake YAMATO2009-03-301-1/+1
| | | | | | | | | | Signed-off-by: Masatake YAMATO <yamato@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | writeback: guard against jiffies wraparound on inode->dirtied_when checks ↵Jeff Layton2009-04-021-4/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (try #3) The dirtied_when value on an inode is supposed to represent the first time that an inode has one of its pages dirtied. This value is in units of jiffies. It's used in several places in the writeback code to determine when to write out an inode. The problem is that these checks assume that dirtied_when is updated periodically. If an inode is continuously being used for I/O it can be persistently marked as dirty and will continue to age. Once the time compared to is greater than or equal to half the maximum of the jiffies type, the logic of the time_*() macros inverts and the opposite of what is needed is returned. On 32-bit architectures that's just under 25 days (assuming HZ == 1000). As the least-recently dirtied inode, it'll end up being the first one that pdflush will try to write out. sync_sb_inodes does this check: /* Was this inode dirtied after sync_sb_inodes was called? */ if (time_after(inode->dirtied_when, start)) break; ...but now dirtied_when appears to be in the future. sync_sb_inodes bails out without attempting to write any dirty inodes. When this occurs, pdflush will stop writing out inodes for this superblock. Nothing can unwedge it until jiffies moves out of the problematic window. This patch fixes this problem by changing the checks against dirtied_when to also check whether it appears to be in the future. If it does, then we consider the value to be far in the past. This should shrink the problematic window of time to such a small period (30s) as not to matter. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Acked-by: Ian Kent <raven@themaw.net> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | vfs: skip I_CLEAR state inodesWu Fengguang2009-04-021-1/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | clear_inode() will switch inode state from I_FREEING to I_CLEAR, and do so _outside_ of inode_lock. So any I_FREEING testing is incomplete without a coupled testing of I_CLEAR. So add I_CLEAR tests to drop_pagecache_sb(), generic_sync_sb_inodes() and add_dquot_ref(). Masayoshi MIZUMA discovered the bug in drop_pagecache_sb() and Jan Kara reminds fixing the other two cases. Masayoshi MIZUMA has a nice panic flow: ===================================================================== [process A] | [process B] | | | prune_icache() | drop_pagecache() | spin_lock(&inode_lock) | drop_pagecache_sb() | inode->i_state |= I_FREEING; | | | spin_unlock(&inode_lock) | V | | | spin_lock(&inode_lock) | V | | | dispose_list() | | | list_del() | | | clear_inode() | | | inode->i_state = I_CLEAR | | | | | V | | | if (inode->i_state & (I_FREEING|I_WILL_FREE)) | | | continue; <==== NOT MATCH | | | | | | (DANGER from here on! Accessing disposing inode!) | | | | | | __iget() | | | list_move() <===== PANIC on poisoned list !! V V | (time) ===================================================================== Reported-by: Masayoshi MIZUMA <m.mizuma@jp.fujitsu.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: new inode i_state corruption fixNick Piggin2009-03-121-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There was a report of a data corruption http://lkml.org/lkml/2008/11/14/121. There is a script included to reproduce the problem. During testing, I encountered a number of strange things with ext3, so I tried ext2 to attempt to reduce complexity of the problem. I found that fsstress would quickly hang in wait_on_inode, waiting for I_LOCK to be cleared, even though instrumentation showed that unlock_new_inode had already been called for that inode. This points to memory scribble, or synchronisation problme. i_state of I_NEW inodes is not protected by inode_lock because other processes are not supposed to touch them until I_LOCK (and I_NEW) is cleared. Adding WARN_ON(inode->i_state & I_NEW) to sites where we modify i_state revealed that generic_sync_sb_inodes is picking up new inodes from the inode lists and passing them to __writeback_single_inode without waiting for I_NEW. Subsequently modifying i_state causes corruption. In my case it would look like this: CPU0 CPU1 unlock_new_inode() __sync_single_inode() reg <- inode->i_state reg -> reg & ~(I_LOCK|I_NEW) reg <- inode->i_state reg -> inode->i_state reg -> reg | I_SYNC reg -> inode->i_state Non-atomic RMW on CPU1 overwrites CPU0 store and sets I_LOCK|I_NEW again. Fix for this is rather than wait for I_NEW inodes, just skip over them: inodes concurrently being created are not subject to data integrity operations, and should not significantly contribute to dirty memory either. After this change, I'm unable to reproduce any of the added warnings or hangs after ~1hour of running. Previously, the new warnings would start immediately and hang would happen in under 5 minutes. I'm also testing on ext3 now, and so far no problems there either. I don't know whether this fixes the problem reported above, but it fixes a real problem for me. Cc: "Jorge Boncompte [DTI2]" <jorge@dti2.net> Reported-by: Adrian Hunter <ext-adrian.hunter@nokia.com> Cc: Jan Kara <jack@suse.cz> Cc: <stable@kernel.org> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: sys_sync fixNick Piggin2009-01-061-19/+1
| | | | | | | | | | | | | | s_syncing livelock avoidance was breaking data integrity guarantee of sys_sync, by allowing sys_sync to skip writing or waiting for superblocks if there is a concurrent sys_sync happening. This livelock avoidance is much less important now that we don't have the get_super_to_sync() call after every sb that we sync. This was replaced by __put_super_and_need_restart. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: sync_sb_inodes fixNick Piggin2009-01-061-7/+53
| | | | | | | | | | Fix data integrity semantics required by sys_sync, by iterating over all inodes and waiting for any writeback pages after the initial writeout. Comments explain the exact problem. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: remove WB_SYNC_HOLDNick Piggin2009-01-061-10/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | Remove WB_SYNC_HOLD. The primary motiviation is the design of my anti-starvation code for fsync. It requires taking an inode lock over the sync operation, so we could run into lock ordering problems with multiple inodes. It is possible to take a single global lock to solve the ordering problem, but then that would prevent a future nice implementation of "sync multiple inodes" based on lock order via inode address. Seems like a backward step to remove this, but actually it is busted anyway: we can't use the inode lists for data integrity wait: an inode can be taken off the dirty lists but still be under writeback. In order to satisfy data integrity semantics, we should wait for it to finish writeback, but if we only search the dirty lists, we'll miss it. It would be possible to have a "writeback" list, for sys_sync, I suppose. But why complicate things by prematurely optimise? For unmounting, we could avoid the "livelock avoidance" code, which would be easier, but again premature IMO. Fixing the existing data integrity problem will come next. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Remove Andrew Morton's old email accountsFrancois Cami2008-10-161-1/+1
| | | | | | | | | People can use the real name an an index into MAINTAINERS to find the current email address. Signed-off-by: Francois Cami <francois.cami@free.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* VFS: export sync_sb_inodesArtem Bityutskiy2008-07-141-2/+9
| | | | | | | | | | This patch exports the 'sync_sb_inodes()' which is needed for UBIFS because it has to force write-back from time to time. Namely, the UBIFS budgeting subsystem forces write-back when its pessimistic callculations show that there is no free space on the media. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
* VFS: move inode_lock into sync_sb_inodesHans Reiser2008-07-141-8/+3
| | | | | | | | | | This patch makes 'sync_sb_inodes()' lock 'inode_lock', rather than expect that the caller will do this. This change was previously done by Hans Reiser <reiser@namesys.com> and sat in the -mm tree. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
* fs/fs-writeback.c: make 2 functions staticAdrian Bunk2008-04-291-39/+39
| | | | | | | | | | | Make the following needlessly global functions static: - writeback_acquire() - writeback_release() Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: fix kernel-doc notation warningsRandy Dunlap2008-03-191-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix kernel-doc notation warnings in fs/. Warning(mmotm-2008-0314-1449//fs/super.c:560): missing initial short description on line: * mark_files_ro Warning(mmotm-2008-0314-1449//fs/locks.c:1277): missing initial short description on line: * lease_get_mtime Warning(mmotm-2008-0314-1449//fs/locks.c:1277): missing initial short description on line: * lease_get_mtime Warning(mmotm-2008-0314-1449//fs/namei.c:1368): missing initial short description on line: * lookup_one_len: filesystem helper to lookup single pathname component Warning(mmotm-2008-0314-1449//fs/buffer.c:3221): missing initial short description on line: * bh_uptodate_or_lock: Test whether the buffer is uptodate Warning(mmotm-2008-0314-1449//fs/buffer.c:3240): missing initial short description on line: * bh_submit_read: Submit a locked buffer for reading Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:30): missing initial short description on line: * writeback_acquire: attempt to get exclusive writeback access to a device Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:47): missing initial short description on line: * writeback_in_progress: determine whether there is writeback in progress Warning(mmotm-2008-0314-1449//fs/fs-writeback.c:58): missing initial short description on line: * writeback_release: relinquish exclusive writeback access against a device. Warning(mmotm-2008-0314-1449//include/linux/jbd.h:351): contents before sections Warning(mmotm-2008-0314-1449//include/linux/jbd.h:561): contents before sections Warning(mmotm-2008-0314-1449//fs/jbd/transaction.c:1935): missing initial short description on line: * void journal_invalidatepage() Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* write_inode_now(): avoid unnecessary synchronous writeMike Galbraith2008-02-081-1/+1
| | | | | | | | | | We shouldn't use WB_SYNC_ALL if the caller is asking for asynchronous treatment. Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: use list_for_each_entry_reverse and kill sb_entryAkinobu Mita2008-02-061-5/+2
| | | | | | | | | Use list_for_each_entry_reverse for super_blocks list and remove unused sb_entry macro. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>