summaryrefslogtreecommitdiffstats
path: root/fs/ocfs2/file.c
Commit message (Collapse)AuthorAgeFilesLines
* ocfs2: _really_ sync the right rangeAl Viro2015-04-091-4/+10
| | | | | | | | | | | | | | | "ocfs2 syncs the wrong range" had been broken; prior to it the code was doing the wrong thing in case of O_APPEND, all right, but _after_ it we were syncing the wrong range in 100% cases. *ppos, aka iocb->ki_pos is incremented prior to that point, so we are always doing sync on the area _after_ the one we'd written to. Spotted by Joseph Qi <joseph.qi@huawei.com> back in January; unfortunately, I'd missed his mail back then ;-/ Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ocfs2_file_write_iter: keep return value and current position update in syncAl Viro2015-04-081-1/+1
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* [regression] ocfs2: do *not* increment ->ki_pos twiceAl Viro2015-04-081-1/+0
| | | | | | | generic_file_direct_write() already does that. Broken by "ocfs2: do not fallback to buffer I/O write if appending" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ocfs2: set append dio as a ro compat featureJoseph Qi2015-02-161-1/+16
| | | | | | | | | | | | | | | Intruduce a bit OCFS2_FEATURE_RO_COMPAT_APPEND_DIO and check it in write flow. If the bit is not set, fall back to the old way. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Weiwei Wang <wangww631@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Xuejiufei <xuejiufei@huawei.com> Cc: alex chen <alex.chen@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: complete the rest request through buffer ioJoseph Qi2015-02-161-1/+42
| | | | | | | | | | | | | | Complte the rest request thourgh buffer io after direct write performed. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Weiwei Wang <wangww631@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Xuejiufei <xuejiufei@huawei.com> Cc: alex chen <alex.chen@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: do not fallback to buffer I/O write if appendingJoseph Qi2015-02-161-1/+4
| | | | | | | | | | | | | | | Now we can do direct io and do not fallback to buffered IO any more in case of append O_DIRECT write. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Weiwei Wang <wangww631@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Xuejiufei <xuejiufei@huawei.com> Cc: alex chen <alex.chen@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: prepare some interfaces used in append direct ioJoseph Qi2015-02-161-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | Currently in case of append O_DIRECT write (block not allocated yet), ocfs2 will fall back to buffered I/O. This has some disadvantages. Firstly, it is not the behavior as expected. Secondly, it will consume huge page cache, e.g. in mass backup scenario. Thirdly, modern filesystems such as ext4 support this feature. In this patch set, the direct I/O write doesn't fallback to buffer I/O write any more because the allocate blocks are enabled in direct I/O now. This patch (of 9): Prepare some interfaces which will be used in append O_DIRECT write. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Weiwei Wang <wangww631@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Xuejiufei <xuejiufei@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: alex chen <alex.chen@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-blockLinus Torvalds2015-02-121-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull backing device changes from Jens Axboe: "This contains a cleanup of how the backing device is handled, in preparation for a rework of the life time rules. In this part, the most important change is to split the unrelated nommu mmap flags from it, but also removing a backing_dev_info pointer from the address_space (and inode), and a cleanup of other various minor bits. Christoph did all the work here, I just fixed an oops with pages that have a swap backing. Arnd fixed a missing export, and Oleg killed the lustre backing_dev_info from staging. Last patch was from Al, unexporting parts that are now no longer needed outside" * 'for-3.20/bdi' of git://git.kernel.dk/linux-block: Make super_blocks and sb_lock static mtd: export new mtd_mmap_capabilities fs: make inode_to_bdi() handle NULL inode staging/lustre/llite: get rid of backing_dev_info fs: remove default_backing_dev_info fs: don't reassign dirty inodes to default_backing_dev_info nfs: don't call bdi_unregister ceph: remove call to bdi_unregister fs: remove mapping->backing_dev_info fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info nilfs2: set up s_bdi like the generic mount_bdev code block_dev: get bdev inode bdi directly from the block device block_dev: only write bdev inode on close fs: introduce f_op->mmap_capabilities for nommu mmap support fs: kill BDI_CAP_SWAP_BACKED fs: deduplicate noop_backing_dev_info
| * fs: export inode_to_bdi and use it in favor of mapping->backing_dev_infoChristoph Hellwig2015-01-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Now that we got rid of the bdi abuse on character devices we can always use sb->s_bdi to get at the backing_dev_info for a file, except for the block device special case. Export inode_to_bdi and replace uses of mapping->backing_dev_info with it to prepare for the removal of mapping->backing_dev_info. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com>
* | ocfs2: fix uninitialized variable accessJunxiao Bi2015-02-101-1/+1
|/ | | | | | | | | | Variable "why" is not yet initialized at line 615, fix it. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: reflink: fix slow unlink for refcounted fileJunxiao Bi2014-12-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When running ocfs2 test suite multiple nodes reflink stress test, for a 4 nodes cluster, every unlink() for refcounted file needs about 700s. The slow unlink is caused by the contention of refcount tree lock since all nodes are unlink files using the same refcount tree. When the unlinking file have many extents(over 1600 in our test), most of the extents has refcounted flag set. In ocfs2_commit_truncate(), it will execute the following call trace for every extents. This means it needs get and released refcount tree lock about 1600 times. And when several nodes are do this at the same time, the performance will be very low. ocfs2_remove_btree_range() -- ocfs2_lock_refcount_tree() ---- ocfs2_refcount_lock() ------ __ocfs2_cluster_lock() ocfs2_refcount_lock() is costly, move it to ocfs2_commit_truncate() to do lock/unlock once can improve a lot performance. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Wengang <wen.gang.wang@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: remove filesize checks for sync I/O journal commitGoldwyn Rodrigues2014-12-101-3/+1
| | | | | | | | | | | | | | | | | | | | Filesize is not a good indication that the file needs to be synced. An example where this breaks is: 1. Open the file in O_SYNC|O_RDWR 2. Read a small portion of the file (say 64 bytes) 3. Lseek to starting of the file 4. Write 64 bytes If the node crashes, it is not written out to disk because this was not committed in the journal and the other node which reads the file after recovery reads stale data (even if the write on the other node was successful) Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for_linus' of ↵Linus Torvalds2014-10-111-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull UDF and quota updates from Jan Kara: "A few UDF fixes and also a few patches which are preparing filesystems for support of project quotas in VFS" * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: Fix loading of special inodes ocfs2: Back out change to use OCFS2_MAXQUOTAS in ocfs2_setattr() udf: remove redundant sys_tz declaration ocfs2: Don't use MAXQUOTAS value reiserfs: Don't use MAXQUOTAS value ext3: Don't use MAXQUOTAS value udf: Fix race between write(2) and close(2)
| * ocfs2: Back out change to use OCFS2_MAXQUOTAS in ocfs2_setattr()Jan Kara2014-09-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | ocfs2_setattr() actually needs to really use MAXQUOTAS and not OCFS2_MAXQUOTAS since it will pass the array over to VFS. Currently this isn't a problem since MAXQUOTAS == OCFS2_MAXQUOTAS but it would be once we introduce project quotas. CC: Mark Fasheh <mfasheh@suse.com> CC: Joel Becker <jlbec@evilplan.org> CC: ocfs2-devel@oss.oracle.com Signed-off-by: Jan Kara <jack@suse.cz>
| * ocfs2: Don't use MAXQUOTAS valueJan Kara2014-09-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | MAXQUOTAS value defines maximum number of quota types VFS supports. This isn't necessarily the number of types ocfs2 supports and with addition of project quotas these two numbers stop matching. So make ocfs2 use its private definition. CC: Mark Fasheh <mfasheh@suse.com> CC: Joel Becker <jlbec@evilplan.org> CC: ocfs2-devel@oss.oracle.com Signed-off-by: Jan Kara <jack@suse.cz>
* | ocfs2: fix deadlock due to wrong locking orderJunxiao Bi2014-10-091-24/+23
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For commit ocfs2 journal, ocfs2 journal thread will acquire the mutex osb->journal->j_trans_barrier and wake up jbd2 commit thread, then it will wait until jbd2 commit thread done. In order journal mode, jbd2 needs flushing dirty data pages first, and this needs get page lock. So osb->journal->j_trans_barrier should be got before page lock. But ocfs2_write_zero_page() and ocfs2_write_begin_inline() obey this locking order, and this will cause deadlock and hung the whole cluster. One deadlock catched is the following: PID: 13449 TASK: ffff8802e2f08180 CPU: 31 COMMAND: "oracle" #0 [ffff8802ee3f79b0] __schedule at ffffffff8150a524 #1 [ffff8802ee3f7a58] schedule at ffffffff8150acbf #2 [ffff8802ee3f7a68] rwsem_down_failed_common at ffffffff8150cb85 #3 [ffff8802ee3f7ad8] rwsem_down_read_failed at ffffffff8150cc55 #4 [ffff8802ee3f7ae8] call_rwsem_down_read_failed at ffffffff812617a4 #5 [ffff8802ee3f7b50] ocfs2_start_trans at ffffffffa0498919 [ocfs2] #6 [ffff8802ee3f7ba0] ocfs2_zero_start_ordered_transaction at ffffffffa048b2b8 [ocfs2] #7 [ffff8802ee3f7bf0] ocfs2_write_zero_page at ffffffffa048e9bd [ocfs2] #8 [ffff8802ee3f7c80] ocfs2_zero_extend_range at ffffffffa048ec83 [ocfs2] #9 [ffff8802ee3f7ce0] ocfs2_zero_extend at ffffffffa048edfd [ocfs2] #10 [ffff8802ee3f7d50] ocfs2_extend_file at ffffffffa049079e [ocfs2] #11 [ffff8802ee3f7da0] ocfs2_setattr at ffffffffa04910ed [ocfs2] #12 [ffff8802ee3f7e70] notify_change at ffffffff81187d29 #13 [ffff8802ee3f7ee0] do_truncate at ffffffff8116bbc1 #14 [ffff8802ee3f7f50] sys_ftruncate at ffffffff8116bcbd #15 [ffff8802ee3f7f80] system_call_fastpath at ffffffff81515142 RIP: 00007f8de750c6f7 RSP: 00007fffe786e478 RFLAGS: 00000206 RAX: 000000000000004d RBX: ffffffff81515142 RCX: 0000000000000000 RDX: 0000000000000200 RSI: 0000000000028400 RDI: 000000000000000d RBP: 00007fffe786e040 R8: 0000000000000000 R9: 000000000000000d R10: 0000000000000000 R11: 0000000000000206 R12: 000000000000000d R13: 00007fffe786e710 R14: 00007f8de70f8340 R15: 0000000000028400 ORIG_RAX: 000000000000004d CS: 0033 SS: 002b crash64> bt PID: 7610 TASK: ffff88100fd56140 CPU: 1 COMMAND: "ocfs2cmt" #0 [ffff88100f4d1c50] __schedule at ffffffff8150a524 #1 [ffff88100f4d1cf8] schedule at ffffffff8150acbf #2 [ffff88100f4d1d08] jbd2_log_wait_commit at ffffffffa01274fd [jbd2] #3 [ffff88100f4d1d98] jbd2_journal_flush at ffffffffa01280b4 [jbd2] #4 [ffff88100f4d1dd8] ocfs2_commit_cache at ffffffffa0499b14 [ocfs2] #5 [ffff88100f4d1e38] ocfs2_commit_thread at ffffffffa0499d38 [ocfs2] #6 [ffff88100f4d1ee8] kthread at ffffffff81090db6 #7 [ffff88100f4d1f48] kernel_thread_helper at ffffffff81516284 crash64> bt PID: 7609 TASK: ffff88100f2d4480 CPU: 0 COMMAND: "jbd2/dm-20-86" #0 [ffff88100def3920] __schedule at ffffffff8150a524 #1 [ffff88100def39c8] schedule at ffffffff8150acbf #2 [ffff88100def39d8] io_schedule at ffffffff8150ad6c #3 [ffff88100def39f8] sleep_on_page at ffffffff8111069e #4 [ffff88100def3a08] __wait_on_bit_lock at ffffffff8150b30a #5 [ffff88100def3a58] __lock_page at ffffffff81110687 #6 [ffff88100def3ab8] write_cache_pages at ffffffff8111b752 #7 [ffff88100def3be8] generic_writepages at ffffffff8111b901 #8 [ffff88100def3c48] journal_submit_data_buffers at ffffffffa0120f67 [jbd2] #9 [ffff88100def3cf8] jbd2_journal_commit_transaction at ffffffffa0121372[jbd2] #10 [ffff88100def3e68] kjournald2 at ffffffffa0127a86 [jbd2] #11 [ffff88100def3ee8] kthread at ffffffff81090db6 #12 [ffff88100def3f48] kernel_thread_helper at ffffffff81516284 Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Alex Chen <alex.chen@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2014-06-121-114/+24
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "This the bunch that sat in -next + lock_parent() fix. This is the minimal set; there's more pending stuff. In particular, I really hope to get acct.c fixes merged this cycle - we need that to deal sanely with delayed-mntput stuff. In the next pile, hopefully - that series is fairly short and localized (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more iov_iter work. Most of prereqs for ->splice_write with sane locking order are there and Kent's dio rewrite would also fit nicely on top of this pile" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits) lock_parent: don't step on stale ->d_parent of all-but-freed one kill generic_file_splice_write() ceph: switch to iter_file_splice_write() shmem: switch to iter_file_splice_write() nfs: switch to iter_splice_write_file() fs/splice.c: remove unneeded exports ocfs2: switch to iter_file_splice_write() ->splice_write() via ->write_iter() bio_vec-backed iov_iter optimize copy_page_{to,from}_iter() bury generic_file_aio_{read,write} lustre: get rid of messing with iovecs ceph: switch to ->write_iter() ceph_sync_direct_write: stop poking into iov_iter guts ceph_sync_read: stop poking into iov_iter guts new helper: copy_page_from_iter() fuse: switch to ->write_iter() btrfs: switch to ->write_iter() ocfs2: switch to ->write_iter() xfs: switch to ->write_iter() ...
| * ocfs2: switch to iter_file_splice_write()Al Viro2014-06-121-80/+2
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * ocfs2: switch to ->write_iter()Al Viro2014-05-061-18/+12
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * ocfs2: switch to ->read_iter()Al Viro2014-05-061-11/+10
| | | | | | | | | | | | tracepoints are evil, exhibit #6969... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * iov_iter_truncate()Al Viro2014-05-061-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | Now It Can Be Done(tm) - we don't need to do iov_shorten() in generic_file_direct_write() anymore, now that all ->direct_IO() instances are converted to proper iov_iter methods and honour iter->count and iter->iov_offset properly. Get rid of count/ocount arguments of generic_file_direct_write(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * start adding the tag to iov_iterAl Viro2014-05-061-1/+1
| | | | | | | | | | | | | | | | | | For now, just use the same thing we pass to ->direct_IO() - it's all iovec-based at the moment. Pass it explicitly to iov_iter_init() and account for kvec vs. iovec in there, by the same kludge NFS ->direct_IO() uses. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * kill generic_segment_checks()Al Viro2014-05-061-6/+1
| | | | | | | | | | | | | | | | all callers of ->aio_read() and ->aio_write() have iov/nr_segs already checked - generic_segment_checks() done after that is just an odd way to spell iov_length(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * generic_file_direct_write(): switch to iov_iterAl Viro2014-05-061-3/+3
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | fs/buffer.c: remove block_write_full_page_endio()Matthew Wilcox2014-06-041-1/+1
|/ | | | | | | | | | | | | | | The last in-tree caller of block_write_full_page_endio() was removed in January 2013. It's time to remove the EXPORT_SYMBOL, which leaves block_write_full_page() as the only caller of block_write_full_page_endio(), so inline block_write_full_page_endio() into block_write_full_page(). Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Dheeraj Reddy <dheeraj.reddy@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2014-04-121-3/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "The first vfs pile, with deep apologies for being very late in this window. Assorted cleanups and fixes, plus a large preparatory part of iov_iter work. There's a lot more of that, but it'll probably go into the next merge window - it *does* shape up nicely, removes a lot of boilerplate, gets rid of locking inconsistencie between aio_write and splice_write and I hope to get Kent's direct-io rewrite merged into the same queue, but some of the stuff after this point is having (mostly trivial) conflicts with the things already merged into mainline and with some I want more testing. This one passes LTP and xfstests without regressions, in addition to usual beating. BTW, readahead02 in ltp syscalls testsuite has started giving failures since "mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages" - might be a false positive, might be a real regression..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) missing bits of "splice: fix racy pipe->buffers uses" cifs: fix the race in cifs_writev() ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure kill generic_file_buffered_write() ocfs2_file_aio_write(): switch to generic_perform_write() ceph_aio_write(): switch to generic_perform_write() xfs_file_buffered_aio_write(): switch to generic_perform_write() export generic_perform_write(), start getting rid of generic_file_buffer_write() generic_file_direct_write(): get rid of ppos argument btrfs_file_aio_write(): get rid of ppos kill the 5th argument of generic_file_buffered_write() kill the 4th argument of __generic_file_aio_write() lustre: don't open-code kernel_recvmsg() ocfs2: don't open-code kernel_recvmsg() drbd: don't open-code kernel_recvmsg() constify blk_rq_map_user_iov() and friends lustre: switch to kernel_sendmsg() ocfs2: don't open-code kernel_sendmsg() take iov_iter stuff to mm/iov_iter.c process_vm_access: tidy up a bit ...
| * ocfs2_file_aio_write(): switch to generic_perform_write()Al Viro2014-04-011-2/+5
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * generic_file_direct_write(): get rid of ppos argumentAl Viro2014-04-011-1/+1
| | | | | | | | | | | | always equal to &iocb->ki_pos. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * kill the 5th argument of generic_file_buffered_write()Al Viro2014-04-011-1/+1
| | | | | | | | | | | | same story - it's &iocb->ki_pos in all cases Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | ocfs2: call ocfs2_update_inode_fsync_trans when updating any inodeDarrick J. Wong2014-04-031-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ensure that ocfs2_update_inode_fsync_trans() is called any time we touch an inode in a given transaction. This is a follow-on to the previous patch to reduce lock contention and deadlocking during an fsync operation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Wengang <wen.gang.wang@oracle.com> Cc: Greg Marsden <greg.marsden@oracle.com> Cc: Srinivas Eeda <srinivas.eeda@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | ocfs2: llseek requires ocfs2 inode lock for the file in SEEK_ENDJensen2014-04-031-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | llseek requires ocfs2 inode lock for updating the file size in SEEK_END. because the file size maybe update on another node. This bug can be reproduce the following scenario: at first, we dd a test fileA, the file size is 10k. on NodeA: --------- 1) open the test fileA, lseek the end of file. and print the position. 2) close the test fileA on NodeB: 1) open the test fileA, append the 5k data to test FileA. 2) lseek the end of file. and print the position. 3) close file. At first we run the test program1 on NodeA , the result is 10k. And then run the test program2 on NodeB, the result is 15k. At last, we run the test program1 on NodeA again, the result is 10k. After applying this patch the three step result is 15k. test result: 1000000 times lseek call; index lseek with inode lock (unit:us) lseek without inode lock (unit:us) 1 1168162 555383 2 1168011 549504 3 1170538 549396 4 1170375 551685 5 1170444 556719 6 1174364 555307 7 1163294 551552 8 1170080 549350 9 1162464 553700 10 1165441 552594 avg 1168317 552519 avg with lock - avg without lock = 615798 (avg with lock - avg without lock)/1000000=0.615798 us Signed-off-by: Jensen <shencanquan@huawei.com> Cc: Jie Liu <jeff.liu@oracle.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | ocfs2: improve fsync efficiency and fix deadlock between aio_write and sync_fileDarrick J. Wong2014-04-031-21/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, ocfs2_sync_file grabs i_mutex and forces the current journal transaction to complete. This isn't terribly efficient, since sync_file really only needs to wait for the last transaction involving that inode to complete, and this doesn't require i_mutex. Therefore, implement the necessary bits to track the newest tid associated with an inode, and teach sync_file to wait for that instead of waiting for everything in the journal to commit. Furthermore, only issue the flush request to the drive if jbd2 hasn't already done so. This also eliminates the deadlock between ocfs2_file_aio_write() and ocfs2_sync_file(). aio_write takes i_mutex then calls ocfs2_aiodio_wait() to wait for unaligned dio writes to finish. However, if that dio completion involves calling fsync, then we can get into trouble when some ocfs2_sync_file tries to take i_mutex. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | ocfs2: change ip_unaligned_aio to of type mutex from atomit_tWengang Wang2014-04-031-12/+3
|/ | | | | | | | | | | | | | | There is a problem that waitqueue_active() may check stale data thus miss a wakeup of threads waiting on ip_unaligned_aio. The valid value of ip_unaligned_aio is only 0 and 1 so we can change it to be of type mutex thus the above prolem is avoid. Another benifit is that mutex which works as FIFO is fairer than wake_up_all(). Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2 syncs the wrong range...Al Viro2014-03-101-4/+4
| | | | | Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ocfs2: update inode size after zeroing the holeJunxiao Bi2014-02-101-8/+32
| | | | | | | | | | | | | | | | | | | | | fs-writeback will release the dirty pages without page lock whose offset are over inode size, the release happens at block_write_full_page_endio(). If not update, dirty pages in file holes may be released before flushed to the disk, then file holes will contain some non-zero data, this will cause sparse file md5sum error. To reproduce the bug, find a big sparse file with many holes, like vm image file, its actual size should be bigger than available mem size to make writeback work more frequently, tar it with -S option, then keep untar it and check its md5sum again and again until you get a wrong md5sum. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Younger Liu <younger.liu@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: fix issue that ocfs2_setattr() does not deal with new_i_size==i_sizeYounger Liu2014-02-101-7/+2
| | | | | | | | | | | | | | | | | | | | | | | | | The issue scenario is as following: - Create a small file and fallocate a large disk space for a file with FALLOC_FL_KEEP_SIZE option. - ftruncate the file back to the original size again. but the disk free space is not changed back. This is a real bug that be fixed in this patch. In order to solve the issue above, we modified ocfs2_setattr(), if attr->ia_size != i_size_read(inode), It calls ocfs2_truncate_file(), and truncate disk space to attr->ia_size. Signed-off-by: Younger Liu <younger.liu@huawei.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Tested-by: Jie Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Sunil Mushran <sunil.mushran@gmail.com> Reviewed-by: Jensen <shencanquan@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ocfs2: fix ocfs2_sync_file() if filesystem is readonlyYounger Liu2014-02-101-0/+3
| | | | | | | | | | | | If filesystem is readonly, there is no need to flush drive's caches or force any uncommitted transactions. [akpm@linux-foundation.org: return -EROFS, not 0] Signed-off-by: Younger Liu <younger.liucn@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge branch 'for-linus' of ↵Linus Torvalds2014-01-281-1/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "Assorted stuff; the biggest pile here is Christoph's ACL series. Plus assorted cleanups and fixes all over the place... There will be another pile later this week" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits) __dentry_path() fixes vfs: Remove second variable named error in __dentry_path vfs: Is mounted should be testing mnt_ns for NULL or error. Fix race when checking i_size on direct i/o read hfsplus: remove can_set_xattr nfsd: use get_acl and ->set_acl fs: remove generic_acl nfs: use generic posix ACL infrastructure for v3 Posix ACLs gfs2: use generic posix ACL infrastructure jfs: use generic posix ACL infrastructure xfs: use generic posix ACL infrastructure reiserfs: use generic posix ACL infrastructure ocfs2: use generic posix ACL infrastructure jffs2: use generic posix ACL infrastructure hfsplus: use generic posix ACL infrastructure f2fs: use generic posix ACL infrastructure ext2/3/4: use generic posix ACL infrastructure btrfs: use generic posix ACL infrastructure fs: make posix_acl_create more useful fs: make posix_acl_chmod more useful ...
| * ocfs2: use generic posix ACL infrastructureChristoph Hellwig2014-01-251-1/+3
| | | | | | | | | | | | | | | | | | This contains some major refactoring for the create path so that inodes are created with the right mode to start with instead of fixing it up later. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | ocfs2: punch hole should return EINVAL if the length argument in ioctl is ↵Tariq Saeed2014-01-211-1/+2
|/ | | | | | | | | | | | | | | | negative An unreserve space ioctl OCFS2_IOC_UNRESVSP/64 should reject a negative length. Orabug:14789508 Signed-off-by: Tariq Saseed <tariq.x.saeed@oracle.com> Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs/ocfs2/file.c: fix wrong commentJunxiao Bi2013-11-131-1/+1
| | | | | | | | | | | | Unwritten extent only exists for file systems which support holes. But the comment said was opposite meaning and also the comment is not very clear, so rephase it. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs/ocfs2: remove unnecessary variable bits_wanted from ocfs2_calc_extend_creditsGoldwyn Rodrigues2013-11-131-2/+1
| | | | | | | | | | | Code cleanup to remove unnecessary variable passed but never used to ocfs2_calc_extend_credits. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge git://git.kvack.org/~bcrl/aio-nextLinus Torvalds2013-09-131-3/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull aio changes from Ben LaHaise: "First off, sorry for this pull request being late in the merge window. Al had raised a couple of concerns about 2 items in the series below. I addressed the first issue (the race introduced by Gu's use of mm_populate()), but he has not provided any further details on how he wants to rework the anon_inode.c changes (which were sent out months ago but have yet to be commented on). The bulk of the changes have been sitting in the -next tree for a few months, with all the issues raised being addressed" * git://git.kvack.org/~bcrl/aio-next: (22 commits) aio: rcu_read_lock protection for new rcu_dereference calls aio: fix race in ring buffer page lookup introduced by page migration support aio: fix rcu sparse warnings introduced by ioctx table lookup patch aio: remove unnecessary debugging from aio_free_ring() aio: table lookup: verify ctx pointer staging/lustre: kiocb->ki_left is removed aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3" aio: be defensive to ensure request batching is non-zero instead of BUG_ON() aio: convert the ioctx list to table lookup v3 aio: double aio_max_nr in calculations aio: Kill ki_dtor aio: Kill ki_users aio: Kill unneeded kiocb members aio: Kill aio_rw_vect_retry() aio: Don't use ctx->tail unnecessarily aio: io_cancel() no longer returns the io_event aio: percpu ioctx refcount aio: percpu reqs_available aio: reqs_active -> reqs_available aio: fix build when migration is disabled ...
| * aio: Kill aio_rw_vect_retry()Kent Overstreet2013-07-301-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This code doesn't serve any purpose anymore, since the aio retry infrastructure has been removed. This change should be safe because aio_read/write are also used for synchronous IO, and called from do_sync_read()/do_sync_write() - and there's no looping done in the sync case (the read and write syscalls). Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
* | ocfs2: free path in ocfs2_remove_inode_range()Younger Liu2013-09-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | In ocfs2_remove_inode_range(), there is a memory leak. The variable path has allocated memory with ocfs2_new_path_from_et(), but it is not free. Signed-off-by: Younger Liu <younger.liu@huawei.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | ocfs2: lighten up allocate transactionYounger Liu2013-09-111-5/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The issue scenario is as following: When fallocating a very large disk space for a small file, __ocfs2_extend_allocation attempts to get a very large transaction. For some journal sizes, there may be not enough room for this transaction, and the fallocate will fail. The patch below extends & restarts the transaction as necessary while allocating space, and should work with even the smallest journal. This patch refers ext4 resize. Test: # mkfs.ocfs2 -b 4K -C 32K -T datafiles /dev/sdc ...(jounral size is 32M) # mount.ocfs2 /dev/sdc /mnt/ocfs2/ # touch /mnt/ocfs2/1.log # fallocate -o 0 -l 400G /mnt/ocfs2/1.log fallocate: /mnt/ocfs2/1.log: fallocate failed: Cannot allocate memory # tail -f /var/log/messages [ 7372.278591] JBD: fallocate wants too many credits (2051 > 2048) [ 7372.278597] (fallocate,6438,0):__ocfs2_extend_allocation:709 ERROR: status = -12 [ 7372.278603] (fallocate,6438,0):ocfs2_allocate_unwritten_extents:1504 ERROR: status = -12 [ 7372.278607] (fallocate,6438,0):__ocfs2_change_file_space:1955 ERROR: status = -12 ^C With this patch, the test works well. Signed-off-by: Younger Liu <younger.liu@huawei.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | ocfs2: fix NULL pointer dereference in ocfs2_duplicate_clusters_by_pageTiger Yang2013-08-131-3/+3
|/ | | | | | | | | | | | | | | | | | | Since ocfs2_cow_file_pos will invoke ocfs2_refcount_icow with a NULL as the struct file pointer, it finally result in a null pointer dereference in ocfs2_duplicate_clusters_by_page. This patch replace file pointer with inode pointer in cow_duplicate_clusters to fix this issue. [jeff.liu@oracle.com: rebased patch against linux-next tree] Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Acked-by: Tao Ma <tm@tao.ma> Tested-by: David Weber <wb@munzinger.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* vfs: export lseek_execute() to modulesJie Liu2013-07-031-11/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For those file systems(btrfs/ext4/ocfs2/tmpfs) that support SEEK_DATA/SEEK_HOLE functions, we end up handling the similar matter in lseek_execute() to update the current file offset to the desired offset if it is valid, ceph also does the simliar things at ceph_llseek(). To reduce the duplications, this patch make lseek_execute() public accessible so that we can call it directly from the underlying file systems. Thanks Dave Chinner for this suggestion. [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back] v2->v1: - Add kernel-doc comments for lseek_execute() - Call lseek_execute() in ceph->llseek() Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Chris Mason <chris.mason@fusionio.com> Cc: Josef Bacik <jbacik@fusionio.com> Cc: Ben Myers <bpm@sgi.com> Cc: Ted Tso <tytso@mit.edu> Cc: Hugh Dickins <hughd@google.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Sage Weil <sage@inktank.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* [readdir] convert ocfs2Al Viro2013-06-291-2/+2
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ocfs2: unlock rw lock if inode lock failedJoseph Qi2013-05-241-1/+1
| | | | | | | | | | | | | | | | | | | In ocfs2_file_aio_write(), it does ocfs2_rw_lock() first and then ocfs2_inode_lock(). But if ocfs2_inode_lock() failed, it goes to out_sems without unlocking rw lock. This will cause a bug in ocfs2_lock_res_free() when testing res->l_ex_holders, which is increased in __ocfs2_cluster_lock() and decreased in __ocfs2_cluster_unlock(). Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Li Zefan <lizefan@huawei.com> Cc: "Duyongfeng (B)" <du.duyongfeng@huawei.com> Acked-by: Sunil Mushran <sunil.mushran@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>