summaryrefslogtreecommitdiffstats
path: root/fs/sync.c
Commit message (Collapse)AuthorAgeFilesLines
* introduce sys_syncfs to sync a single file systemSage Weil2011-03-211-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is frequently useful to sync a single file system, instead of all mounted file systems via sync(2): - On machines with many mounts, it is not at all uncommon for some of them to hang (e.g. unresponsive NFS server). sync(2) will get stuck on those and may never get to the one you do care about (e.g., /). - Some applications write lots of data to the file system and then want to make sure it is flushed to disk. Calling fsync(2) on each file introduces unnecessary ordering constraints that result in a large amount of sub-optimal writeback/flush/commit behavior by the file system. There are currently two ways (that I know of) to sync a single super_block: - BLKFLSBUF ioctl on the block device: That also invalidates the bdev mapping, which isn't usually desirable, and doesn't work for non-block file systems. - 'mount -o remount,rw' will call sync_filesystem as an artifact of the current implemention. Relying on this little-known side effect for something like data safety sounds foolish. Both of these approaches require root privileges, which some applications do not have (nor should they need?) given that sync(2) is an unprivileged operation. This patch introduces a new system call syncfs(2) that takes an fd and syncs only the file system it references. Maybe someday we can $ sync /some/path and not get sync: ignoring all arguments The syscall is motivated by comments by Al and Christoph at the last LSF. syncfs(2) seems like an appropriate name given statfs(2). A similar ioctl was also proposed a while back, see http://marc.info/?l=linux-fsdevel&m=127970513829285&w=2 Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* get rid of file_fsync()Al Viro2010-08-091-25/+0
| | | | | | Copy and simplify in the only two users remaining. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'master' into for-linusJens Axboe2010-06-011-5/+3
|\ | | | | | | | | | | | | Conflicts: fs/pipe.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
| * drop unused dentry argument to ->fsyncChristoph Hellwig2010-05-271-5/+3
| | | | | | | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Revert "writeback: fix WB_SYNC_NONE writeback from umount"Jens Axboe2010-06-011-1/+1
|/ | | | | | | | | | | | | | This reverts commit e913fc825dc685a444cb4c1d0f9d32f372f59861. We are investigating a hang associated with the WB_SYNC_NONE changes, so revert them for now. Conflicts: fs/fs-writeback.c mm/page-writeback.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2010-05-211-72/+14
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits) fix handling of offsets in cris eeprom.c, get rid of fake on-stack files get rid of home-grown mutex in cris eeprom.c switch ecryptfs_write() to struct inode *, kill on-stack fake files switch ecryptfs_get_locked_page() to struct inode * simplify access to ecryptfs inodes in ->readpage() and friends AFS: Don't put struct file on the stack Ban ecryptfs over ecryptfs logfs: replace inode uid,gid,mode initialization with helper function ufs: replace inode uid,gid,mode initialization with helper function udf: replace inode uid,gid,mode init with helper ubifs: replace inode uid,gid,mode initialization with helper function sysv: replace inode uid,gid,mode initialization with helper function reiserfs: replace inode uid,gid,mode initialization with helper function ramfs: replace inode uid,gid,mode initialization with helper function omfs: replace inode uid,gid,mode initialization with helper function bfs: replace inode uid,gid,mode initialization with helper function ocfs2: replace inode uid,gid,mode initialization with helper function nilfs2: replace inode uid,gid,mode initialization with helper function minix: replace inode uid,gid,mode init with helper ext4: replace inode uid,gid,mode init with helper ... Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)
| * sanitize vfs_fsync calling conventionsChristoph Hellwig2010-05-211-34/+8
| | | | | | | | | | | | | | | | | | | | | | | | Now that the last user passing a NULL file pointer is gone we can remove the redundant dentry argument and associated hacks inside vfs_fsynmc_range. The next step will be removig the dentry argument from ->fsync, but given the luck with the last round of method prototype changes I'd rather defer this until after the main merge window. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * new helper: iterate_supers()Al Viro2010-05-211-19/+6
| | | | | | | | | | | | | | ... and switch the simple "loop over superblocks and do something" loops to it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * get rid of restarts in sync_filesystems()Al Viro2010-05-211-25/+3
| | | | | | | | | | | | | | At the same time we can kill s_need_restart and local mutex in there. __put_super() made public for a while; will be gone later. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * Leave superblocks on s_list until the endAl Viro2010-05-211-1/+4
| | | | | | | | | | | | | | | | | | | | | | We used to remove from s_list and s_instances at the same time. So let's *not* do the former and skip superblocks that have empty s_instances in the loops over s_list. The next step, of course, will be to get rid of rescan logics in those loops. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | writeback: fix WB_SYNC_NONE writeback from umountJens Axboe2010-05-171-1/+1
|/ | | | | | | | | | | | | When umount calls sync_filesystem(), we first do a WB_SYNC_NONE writeback to kick off writeback of pending dirty inodes, then follow that up with a WB_SYNC_ALL to wait for it. Since umount already holds the sb s_umount mutex, WB_SYNC_NONE ends up doing nothing and all writeback happens as WB_SYNC_ALL. This can greatly slow down umount, since WB_SYNC_ALL writeback is a data integrity operation and thus a bigger hammer than simple WB_SYNC_NONE. For barrier aware file systems it's a lot slower. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* Catch filesystems lacking s_bdiJörn Engel2010-04-251-1/+2
| | | | | | | | | | | | | | noop_backing_dev_info is used only as a flag to mark filesystems that don't have any backing store, like tmpfs, procfs, spufs, etc. Signed-off-by: Joern Engel <joern@logfs.org> Changed the BUG_ON() to a WARN_ON(). Note that adding dirty inodes to the noop_backing_dev_info is not legal and will not result in them being flushed, but we already catch this condition in __mark_inode_dirty() when checking for a registered bdi. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo2010-03-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
* quota: move code from sync_quota_sb into vfs_quota_syncChristoph Hellwig2010-03-051-7/+7
| | | | | | | | | | | | | Currenly sync_quota_sb does a lot of sync and truncate action that only applies to "VFS" style quotas and is actively harmful for the sync performance in XFS. Move it into vfs_quota_sync and add a wait parameter to ->quota_sync to tell if we need it or not. My audit of the GFS2 code says it's also not needed given the way GFS2 implements quotas, but I'd be happy if this can get a detailed review. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
* fold do_sync_file_range into sys_sync_file_rangeChristoph Hellwig2009-12-171-36/+23
| | | | | | | | | | | | We recently go rid of all callers of do_sync_file_range as they're better served with vfs_fsync or the filemap_write_and_wait. Now that do_sync_file_range is down to a single caller fold it into it so that people don't start using it again accidentally. While at it also switch it from using __filemap_fdatawrite_range(..., WB_SYNC_ALL) to the more clear filemap_fdatawrite_range(). Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* kill wait_on_page_writeback_rangeChristoph Hellwig2009-12-101-6/+2
| | | | | | | | | All callers really want the more logical filemap_fdatawait_range interface, so convert them to use it and merge wait_on_page_writeback_range into filemap_fdatawait_range. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
* vfs: Implement proper O_SYNC semanticsChristoph Hellwig2009-12-101-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While Linux provided an O_SYNC flag basically since day 1, it took until Linux 2.4.0-test12pre2 to actually get it implemented for filesystems, since that day we had generic_osync_around with only minor changes and the great "For now, when the user asks for O_SYNC, we'll actually give O_DSYNC" comment. This patch intends to actually give us real O_SYNC semantics in addition to the O_DSYNC semantics. After Jan's O_SYNC patches which are required before this patch it's actually surprisingly simple, we just need to figure out when to set the datasync flag to vfs_fsync_range and when not. This patch renames the existing O_SYNC flag to O_DSYNC while keeping it's numerical value to keep binary compatibility, and adds a new real O_SYNC flag. To guarantee backwards compatiblity it is defined as expanding to both the O_DSYNC and the new additional binary flag (__O_SYNC) to make sure we are backwards-compatible when compiled against the new headers. This also means that all places that don't care about the differences can just check O_DSYNC and get the right behaviour for O_SYNC, too - only places that actuall care need to check __O_SYNC in addition. Drivers and network filesystems have been updated in a fail safe way to always do the full sync magic if O_DSYNC is set. The few places setting O_SYNC for lower layers are kept that way for now to stay failsafe. We enforce that O_DSYNC is set when __O_SYNC is set early in the open path to make sure we always get these sane options. Note that parisc really screwed up their headers as they already define a O_DSYNC that has always been a no-op. We try to repair it by using it for the new O_DSYNC and redefinining O_SYNC to send both the traditional O_SYNC numerical value _and_ the O_DSYNC one. Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Grant Grundler <grundler@parisc-linux.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andreas Dilger <adilger@sun.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Kyle McMartin <kyle@mcmartin.ca> Acked-by: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Kara <jack@suse.cz>
* fs/buffer.c: clean up EXPORT* macrosH Hartley Sweeten2009-09-231-0/+1
| | | | | | | | | | | | | | | According to Documentation/CodingStyle the EXPORT* macro should follow immediately after the closing function brace line. Also, mark_buffer_async_write_endio() and do_thaw_all() are not used elsewhere so they should be marked as static. In addition, file_fsync() is actually in fs/sync.c so move the EXPORT* to that file. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: Assign bdi in super_blockJens Axboe2009-09-161-1/+8
| | | | | | | | | | | We do this automatically in get_sb_bdev() from the set_bdev_super() callback. Filesystems that have their own private backing_dev_info must assign that in ->fill_super(). Note that ->s_bdi assignment is required for proper writeback! Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* fsync: wait for data writeout completion before calling ->fsyncChristoph Hellwig2009-09-141-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currenly vfs_fsync(_range) first calls filemap_fdatawrite to write out the data, the calls into ->fsync to write out the metadata and then finally calls filemap_fdatawait to wait for the data I/O to complete. What sounds like a clever micro-optimization actually is nast trap for many filesystems. For many modern filesystems i_size or other inode information is only updated on I/O completion and we need to wait for I/O to finish before we can write out the metadata. For old fashionen filesystems that instanciate blocks during the actual write and also update the metadata at that point it opens up a large window were we could expose uninitialized blocks after a crash. While a few filesystems that need it already wait for the I/O to finish inside their ->fsync methods it is rather suboptimal as it is done under the i_mutex and also always for the whole file instead of just a part as we could do for O_SYNC handling. Here is a small audit of all fsync instances in the tree: - spufs_mfc_fsync: - ps3flash_fsync: - vol_cdev_fsync: - printer_fsync: - fb_deferred_io_fsync: - bad_file_fsync: - simple_sync_file: don't care - filesystems/drivers do't use the page cache or are purely in-memory. - simple_fsync: - file_fsync: - affs_file_fsync: - fat_file_fsync: - jfs_fsync: - ubifs_fsync: - reiserfs_dir_fsync: - reiserfs_sync_file: never touch pagecache themselves. We need to wait before if we do not want to expose stale data after an allocation. - afs_fsync: - fuse_fsync_common: do the waiting writeback itself in awkward ways, would benefit from proper semantics - block_fsync: Does a filemap_write_and_wait on the block device inode. Because we now have f_mapping that is the same inode we call it on in vfs_fsync. So just removing it and letting the VFS do the work in one go would be an improvement. - btrfs_sync_file: - cifs_fsync: - xfs_file_fsync: need the wait first and currently do it themselves. would benefit from doing it outside i_mutex. - coda_fsync: - ecryptfs_fsync: - exofs_file_fsync: - shm_fsync: only passes the fsync through to the lower layer - ext3_sync_file: doesn't seem to care, comments are confusing. - ext4_sync_file: would need the wait to work correctly for delalloc mode with late i_size updates. Otherwise the ext3 comment applies. currently implemens it's own writeback and wait in an odd way, could benefit from doing it properly. - gfs2_fsync: not needed for journaled data mode, but probably harmless there. Currently writes back data asynchronously itself. Needs some major audit. - hostfs_fsync: just calls fsync/datasync on the host FD. Without the wait before data might not even be inflight yet if we're unlucky. - hpfs_file_fsync: - ncp_fsync: no-ops. Dangerous before and after. - jffs2_fsync: just calls jffs2_flush_wbuf_gc, not sure how this relates to data. - nfs_fsync_dir: just increments stats, claims all directory operations are synchronous - nfs_file_fsync: only writes out data??? Looks very odd. - nilfs_sync_file: looks like it expects all data done, but not sure from the code - ntfs_dir_fsync: - ntfs_file_fsync: appear to do their own data writeback. Very convoluted code. - ocfs2_sync_file: does it's own data writeback, but no wait. probably needs the wait. - smb_fsync: according to a comment expects all pages written already, probably needs the wait before. This patch only changes vfs_fsync_range, removal of the wait in the methods that have it is left to the filesystem maintainers. Note that most filesystems really do need an audit for their fsync methods given the gems found in this very brief audit. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
* vfs: Introduce new helpers for syncing after writing to O_SYNC file or ↵Jan Kara2009-09-141-7/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | IS_SYNC inode Introduce new function for generic inode syncing (vfs_fsync_range) and use it from fsync() path. Introduce also new helper for syncing after a sync write (generic_write_sync) using the generic function. Use these new helpers for syncing from generic VFS functions. This makes O_SYNC writes to block devices acquire i_mutex for syncing. If we really care about this, we can make block_fsync() drop the i_mutex and reacquire it before it returns. CC: Evgeniy Polyakov <zbr@ioremap.net> CC: ocfs2-devel@oss.oracle.com CC: Joel Becker <joel.becker@oracle.com> CC: Felix Blyakher <felixb@sgi.com> CC: xfs@oss.sgi.com CC: Anton Altaparmakov <aia21@cantab.net> CC: linux-ntfs-dev@lists.sourceforge.net CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> CC: linux-ext4@vger.kernel.org CC: tytso@mit.edu Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
* writeback: switch to per-bdi threads for flushing dataJens Axboe2009-09-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This gets rid of pdflush for bdi writeout and kupdated style cleaning. pdflush writeout suffers from lack of locality and also requires more threads to handle the same workload, since it has to work in a non-blocking fashion against each queue. This also introduces lumpy behaviour and potential request starvation, since pdflush can be starved for queue access if others are accessing it. A sample ffsb workload that does random writes to files is about 8% faster here on a simple SATA drive during the benchmark phase. File layout also seems a LOT more smooth in vmstat: r b swpd free buff cache si so bi bo in cs us sy id wa 0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42 0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44 1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37 0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58 0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34 0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37 0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44 0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38 0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41 0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45 where vanilla tends to fluctuate a lot in the creation phase: r b swpd free buff cache si so bi bo in cs us sy id wa 1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36 1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51 0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40 0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37 1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41 0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49 0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36 1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43 0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39 1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45 1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34 0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54 A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A SSD based writeback test on XFS performs over 20% better as well, with the throughput being very stable around 1GB/sec, where pdflush only manages 750MB/sec and fluctuates wildly while doing so. Random buffered writes to many files behave a lot better as well, as does random mmap'ed writes. A separate thread is added to sync the super blocks. In the long term, adding sync_supers_bdi() functionality could get rid of this thread again. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: get rid of generic_sync_sb_inodes() exportJens Axboe2009-09-111-8/+10
| | | | | | | | | | | | This adds two new exported functions: - writeback_inodes_sb(), which only attempts to writeback dirty inodes on this super_block, for WB_SYNC_NONE writeout. - sync_inodes_sb(), which writes out all dirty inodes on this super_block and also waits for the IO to complete. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* sys_sync(): fix 16% performance regression in ffsb create_4k testZhang, Yanmin2009-07-061-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I run many ffsb test cases on JBODs (typically 13/12 disks). Comparing with kernel 2.6.30, 2.6.31-rc1 has about 16% regression with ffsb_create_4k. The sub test case creates files continuously for 10 minitues and every file is 1MB. Bisect located below patch. 5cee5815d1564bbbd505fea86f4550f1efdb5cd0 is first bad commit commit 5cee5815d1564bbbd505fea86f4550f1efdb5cd0 Author: Jan Kara <jack@suse.cz> Date: Mon Apr 27 16:43:51 2009 +0200 vfs: Make sys_sync() use fsync_super() (version 4) It is unnecessarily fragile to have two places (fsync_super() and do_sync()) doing data integrity sync of the filesystem. Alter __fsync_super() to accommodate needs of both callers and use it. So after this patch __fsync_super() is the only place where we gather all the calls needed to properly send all data on a filesystem to disk. As a matter of fact, ffsb calls sys_sync in the end to make sure all data is flushed to disks and the flushing is counted into the result. vmstat shows ffsb is blocked when syncing for a long time. With 2.6.30, ffsb is blocked for a short time. I checked the patch and did experiments to recover the original methods. Eventually, the root cause is the patch deletes the calling to wakeup_pdflush when syncing, so only ffsb is blocked on disk I/O. wakeup_pdflush could ask pdflush to write back pages with ffsb at the same time. [akpm@linux-foundation.org: restore comment too] Signed-off-by: Zhang Yanmin <yanmin_zhang@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* remove the call to ->write_super in __sync_filesystemChristoph Hellwig2009-06-111-2/+0
| | | | | | | | | | | | Now that all filesystems provide ->sync_fs methods we can change __sync_filesystem to only call ->sync_fs. This gives us a clear separation between periodic writeouts which are driven by ->write_super and data integrity syncs that go through ->sync_fs. (modulo file_fsync which is also going away) Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ->write_super lock_super pushdownChristoph Hellwig2009-06-111-4/+0
| | | | | | | | | | | | | | | | | | | Push down lock_super into ->write_super instances and remove it from the caller. Following filesystem don't need ->s_lock in ->write_super and are skipped: * bfs, nilfs2 - no other uses of s_lock and have internal locks in ->write_super * ext2 - uses BKL in ext2_write_super and has internal calls without s_lock * reiserfs - no other uses of s_lock as has reiserfs_write_lock (BKL) in ->write_super * xfs - no other uses of s_lock and uses internal lock (buffer lock on superblock buffer) to serialize ->write_super. Also xfs_fs_write_super is superflous and will go away in the next merge window Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* enforce ->sync_fs is only called for rw superblockChristoph Hellwig2009-06-111-7/+16
| | | | | | | | | | | | | | Make sure a superblock really is writeable by checking MS_RDONLY under s_umount. sync_filesystems needed some re-arragement for that, but all but one sync_filesystem caller had the correct locking already so that we could add that check there. cachefiles grew s_umount locking. I've also added a WARN_ON to sync_filesystem to assert this for future callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* quota: Introduce writeout_quota_sb() (version 4)Jan Kara2009-06-111-1/+5
| | | | | | | | | Introduce this function which just writes all the quota structures but avoids all the syncing and cache pruning work to expose quota structures to userspace. Use this function from __sync_filesystem when wait == 0. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* quota: cleanup dquota sync functions (version 4)Christoph Hellwig2009-06-111-1/+1
| | | | | | | | | | | | | | | | | | | | Currently the VFS calls vfs_dq_sync to sync out disk quotas for a given superblock. This is a small wrapper around sync_dquots which for the case of a non-NULL superblock is a small wrapper around quota_sync_sb. Just make quota_sync_sb global (rename it to sync_quota_sb) and call it directly. Also call it directly for those cases in quota.c that have a superblock and leave sync_dquots purely an iterator over sync_quota_sb and remove it's superblock argument. To make this nicer move the check for the lack of a quota_sync method from the callers into sync_quota_sb. [folded build fix from Alexander Beregalov <a.beregalov@gmail.com>] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: Rename fsync_super() to sync_filesystem() (version 4)Jan Kara2009-06-111-6/+6
| | | | | | | | Rename the function so that it better describe what it really does. Also remove the unnecessary include of buffer_head.h. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: Move syncing code from super.c to sync.c (version 4)Jan Kara2009-06-111-0/+85
| | | | | | | | | | Move sync_filesystems(), __fsync_super(), fsync_super() from super.c to sync.c where it fits better. [build fixes folded] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: Make sys_sync() use fsync_super() (version 4)Jan Kara2009-06-111-21/+10
| | | | | | | | | | | | | | | | | | It is unnecessarily fragile to have two places (fsync_super() and do_sync()) doing data integrity sync of the filesystem. Alter __fsync_super() to accommodate needs of both callers and use it. So after this patch __fsync_super() is the only place where we gather all the calls needed to properly send all data on a filesystem to disk. Nice bonus is that we get a complete livelock avoidance and write_supers() is now only used for periodic writeback of superblocks. sync_blockdevs() introduced a couple of patches ago is gone now. [build fixes folded] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: Fix sys_sync() and fsync_super() reliability (version 4)Jan Kara2009-06-111-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | So far, do_sync() called: sync_inodes(0); sync_supers(); sync_filesystems(0); sync_filesystems(1); sync_inodes(1); This ordering makes it kind of hard for filesystems as sync_inodes(0) need not submit all the IO (for example it skips inodes with I_SYNC set) so e.g. forcing transaction to disk in ->sync_fs() is not really enough. Therefore sys_sync has not been completely reliable on some filesystems (ext3, ext4, reiserfs, ocfs2 and others are hit by this) when racing e.g. with background writeback. A similar problem hits also other filesystems (e.g. ext2) because of write_supers() being called before the sync_inodes(1). Change the ordering of calls in do_sync() - this requires a new function sync_blockdevs() to preserve the property that block devices are always synced after write_super() / sync_fs() call. The same issue is fixed in __fsync_super() function used on umount / remount read-only. [AV: build fixes] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'for_linus' of ↵Linus Torvalds2009-03-271-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6: (27 commits) ext2: Zero our b_size in ext2_quota_read() trivial: fix typos/grammar errors in fs/Kconfig quota: Coding style fixes quota: Remove superfluous inlines quota: Remove uppercase aliases for quota functions. nfsd: Use lowercase names of quota functions jfs: Use lowercase names of quota functions udf: Use lowercase names of quota functions ufs: Use lowercase names of quota functions reiserfs: Use lowercase names of quota functions ext4: Use lowercase names of quota functions ext3: Use lowercase names of quota functions ext2: Use lowercase names of quota functions ramfs: Remove quota call vfs: Use lowercase names of quota functions quota: Remove dqbuf_t and other cleanups quota: Remove NODQUOT macro quota: Make global quota locks cacheline aligned quota: Move quota files into separate directory ext4: quota reservation for delayed allocation ...
| * vfs: Use lowercase names of quota functionsJan Kara2009-03-261-1/+1
| | | | | | | | | | | | | | Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by: Jan Kara <jack@suse.cz> CC: Alexander Viro <viro@zeniv.linux.org.uk>
* | Get rid of pdflush_operation() in emergency sync and remountJens Axboe2009-03-261-1/+13
|/ | | | | | | | | | | Opencode a cheasy approach with kevent. The idea here is that we'll add some generic delayed work infrastructure, which probably wont be based on pdflush (or maybe it will, in which case we can just add it back). This is in preparation for getting rid of pdflush completely. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* [CVE-2009-0029] System call wrappers part 09Heiko Carstens2009-01-141-3/+3
| | | | Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
* [CVE-2009-0029] System call wrapper special casesHeiko Carstens2009-01-141-4/+22
| | | | | | | | | | | | System calls with an unsigned long long argument can't be converted with the standard wrappers since that would include a cast to long, which in turn means that we would lose the upper 32 bit on 32 bit architectures. Also semctl can't use the standard wrapper since it has a 'union' parameter. So we handle them as special case and add some extra wrappers instead. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
* mm: do_sync_mapping_range integrity fixNick Piggin2009-01-061-1/+1
| | | | | | | | | | | | | | Chris Mason notices do_sync_mapping_range didn't actually ask for data integrity writeout. Unfortunately, it is advertised as being usable for data integrity operations. This is a data integrity bug. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* add a vfs_fsync helperChristoph Hellwig2009-01-051-11/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Fsync currently has a fdatawrite/fdatawait pair around the method call, and a mutex_lock/unlock of the inode mutex. All callers of fsync have to duplicate this, but we have a few and most of them don't quite get it right. This patch adds a new vfs_fsync that takes care of this. It's a little more complicated as usual as ->fsync might get a NULL file pointer and just a dentry from nfsd, but otherwise gets afile and we want to take the mapping and file operations from it when it is there. Notes on the fsync callers: - ecryptfs wasn't calling filemap_fdatawrite / filemap_fdatawait on the lower file - coda wasn't calling filemap_fdatawrite / filemap_fdatawait on the host file, and returning 0 when ->fsync was missing - shm wasn't calling either filemap_fdatawrite / filemap_fdatawait nor taking i_mutex. Now given that shared memory doesn't have disk backing not doing anything in fsync seems fine and I left it out of the vfs_fsync conversion for now, but in that case we might just not pass it through to the lower file at all but just call the no-op simple_sync_file directly. [and now actually export vfs_fsync] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* SYNC_FILE_RANGE_WRITE may and will block. Document that.Pavel Machek2008-07-241-1/+2
| | | | | | | [akpm@linux-foundation.org: fix comment text] Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vfs: fix unconditional write_super() call in file_fsync()OGAWA Hirofumi2008-04-291-1/+1
| | | | | | | | | | | | | We need to check ->s_dirt before calling write_super(). It became the cause of an unneeded write. This bug was noticed by Sudhanshu Saxena. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Introduce fixed sys_sync_file_range2() syscall, implement on PowerPC and ARMDavid Woodhouse2007-06-281-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | Not all the world is an i386. Many architectures need 64-bit arguments to be aligned in suitable pairs of registers, and the original sys_sync_file_range(int, loff_t, loff_t, int) was therefore wasting an argument register for padding after the first integer. Since we don't normally have more than 6 arguments for system calls, that left no room for the final argument on some architectures. Fix this by introducing sys_sync_file_range2(int, int, loff_t, loff_t) which all fits nicely. In fact, ARM already had that, but called it sys_arm_sync_file_range. Move it to fs/sync.c and rename it, then implement the needed compatibility routine. And stop the missing syscall check from bitching about the absence of sys_sync_file_range() if we've implemented sys_sync_file_range2() instead. Tested on PPC32 and with 32-bit and 64-bit userspace on PPC64. Signed-off-by: David Woodhouse <dwmw2@infradead.org> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Remove do_sync_file_range()Mark Fasheh2007-05-081-1/+1
| | | | | | | | | | Remove do_sync_file_range() and convert callers to just use do_sync_mapping_range(). Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* [PATCH] Turn do_sync_file_range() into do_sync_mapping_range()Mark Fasheh2007-04-261-5/+3
| | | | | | | | | | | | | | | | | do_sync_file_range() accepts a file * from which it takes an address_space to sync. Abstract out the bulk of the function into do_sync_mapping_range() which takes the address_space directly. This way callers who want to sync an address_space directly can take advantage of the functionality provided. do_sync_file_range() is preserved as a small wrapper around do_sync_mapping_range(). Ocfs2 in particular would like to use this to initiate a sync of a specific inode range during truncate, where a file * may not be available. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* [PATCH] VFS: change struct file to use struct pathJosef "Jeff" Sipek2006-12-081-2/+2
| | | | | | | | | | | | | This patch changes struct file to use struct path instead of having independent pointers to struct dentry and struct vfsmount, and converts all users of f_{dentry,vfsmnt} in fs/ to use f_path.{dentry,mnt}. Additionally, it adds two #define's to make the transition easier for users of the f_dentry and f_vfsmnt. Signed-off-by: Josef "Jeff" Sipek <jsipek@cs.sunysb.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] severing fs.h, radix-tree.h -> sched.hAl Viro2006-12-041-0/+1
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* [PATCH] BLOCK: Move functions out of buffer code [try #6]David Howells2006-09-301-0/+113
| | | | | | | | | | | | | | | | | | | | | | | | | Move some functions out of the buffering code that aren't strictly buffering specific. This is a precursor to being able to disable the block layer. (*) Moved some stuff out of fs/buffer.c: (*) The file sync and general sync stuff moved to fs/sync.c. (*) The superblock sync stuff moved to fs/super.c. (*) do_invalidatepage() moved to mm/truncate.c. (*) try_to_release_page() moved to mm/filemap.c. (*) Moved some related declarations between header files: (*) declarations for do_invalidatepage() and try_to_release_page() moved to linux/mm.h. (*) __set_page_dirty_buffers() moved to linux/buffer_head.h. Signed-Off-By: David Howells <dhowells@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* [PATCH] writeback: fix range handlingOGAWA Hirofumi2006-06-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a writeback_control's `start' and `end' fields are used to indicate a one-byte-range starting at file offset zero, the required values of .start=0,.end=0 mean that the ->writepages() implementation has no way of telling that it is being asked to perform a range request. Because we're currently overloading (start == 0 && end == 0) to mean "this is not a write-a-range request". To make all this sane, the patch changes range of writeback_control. So caller does: If it is calling ->writepages() to write pages, it sets range (range_start/end or range_cyclic) always. And if range_cyclic is true, ->writepages() thinks the range is cyclic, otherwise it just uses range_start and range_end. This patch does, - Add LLONG_MAX, LLONG_MIN, ULLONG_MAX to include/linux/kernel.h -1 is usually ok for range_end (type is long long). But, if someone did, range_end += val; range_end is "val - 1" u64val = range_end >> bits; u64val is "~(0ULL)" or something, they are wrong. So, this adds LLONG_MAX to avoid nasty things, and uses LLONG_MAX for range_end. - All callers of ->writepages() sets range_start/end or range_cyclic. - Fix updates of ->writeback_index. It seems already bit strange. If it starts at 0 and ended by check of nr_to_write, this last index may reduce chance to scan end of file. So, this updates ->writeback_index only if range_cyclic is true or whole-file is scanned. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Nathan Scott <nathans@sgi.com> Cc: Anton Altaparmakov <aia21@cantab.net> Cc: Steven French <sfrench@us.ibm.com> Cc: "Vladimir V. Saveliev" <vs@namesys.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [PATCH] sync_file_range(): use unsigned for flagsAndrew Morton2006-04-111-2/+2
| | | | | | | | Ulrich suggested that the `flags' arg to sync_file_range() become unsigned. Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>