summaryrefslogtreecommitdiffstats
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* ovl: pass ovl_fs down to functions accessing private xattrsMiklos Szeredi2020-09-029-86/+107
| | | | | | | This paves the way for optionally using the "user.overlay." xattr namespace. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* ovl: drop flags argument from ovl_do_setxattr()Miklos Szeredi2020-09-026-9/+9
| | | | | | All callers pass zero flags to ovl_do_setxattr(). So drop this argument. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* ovl: adhere to the vfs_ vs. ovl_do_ conventions for xattrsMiklos Szeredi2020-09-022-4/+4
| | | | | | | | | | Call ovl_do_*xattr() when accessing an overlay private xattr, vfs_*xattr() otherwise. This has an effect on debug output, which is made more consistent by this patch. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* ovl: use ovl_do_getxattr() for private xattrMiklos Szeredi2020-09-024-8/+15
| | | | | | | | | | Use the convention of calling ovl_do_foo() for operations which are overlay specific. This patch is a no-op, and will have significance for supporting "user.overlay." xattr namespace. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* ovl: fold ovl_getxattr() into ovl_get_redirect_xattr()Miklos Szeredi2020-09-021-36/+17
| | | | | | | | This is a partial revert (with some cleanups) of commit 993a0b2aec52 ("ovl: Do not lose security.capability xattr over metadata file copy-up"), which introduced ovl_getxattr() in the first place. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* ovl: clean up ovl_getxattr() in copy_up.cMiklos Szeredi2020-09-021-21/+11
| | | | | | | Lose the padding and the failure message (in line with other parts of the copy up process). Return zero for both nonexistent or empty xattr. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* duplicate ovl_getxattr()Miklos Szeredi2020-09-023-4/+35
| | | | | | | | | | | | | ovl_getattr() returns the value of an xattr in a kmalloced buffer. There are two callers: ovl_copy_up_meta_inode_data() (copy_up.c) ovl_get_redirect_xattr() (util.c) This patch just copies ovl_getxattr() to copy_up.c, the following patches will deal with the differences in idividual callers. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* ovl: provide a mount option "volatile"Vivek Goyal2020-09-025-8/+97
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Container folks are complaining that dnf/yum issues too many sync while installing packages and this slows down the image build. Build requirement is such that they don't care if a node goes down while build was still going on. In that case, they will simply throw away unfinished layer and start new build. So they don't care about syncing intermediate state to the disk and hence don't want to pay the price associated with sync. So they are asking for mount options where they can disable sync on overlay mount point. They primarily seem to have two use cases. - For building images, they will mount overlay with nosync and then sync upper layer after unmounting overlay and reuse upper as lower for next layer. - For running containers, they don't seem to care about syncing upper layer because if node goes down, they will simply throw away upper layer and create a fresh one. So this patch provides a mount option "volatile" which disables all forms of sync. Now it is caller's responsibility to throw away upper if system crashes or shuts down and start fresh. With "volatile", I am seeing roughly 20% speed up in my VM where I am just installing emacs in an image. Installation time drops from 31 seconds to 25 seconds when nosync option is used. This is for the case of building on top of an image where all packages are already cached. That way I take out the network operations latency out of the measurement. Giuseppe is also looking to cut down on number of iops done on the disk. He is complaining that often in cloud their VMs are throttled if they cross the limit. This option can help them where they reduce number of iops (by cutting down on frequent sync and writebacks). Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* ovl: check for incompatible features in work dirAmir Goldstein2020-09-022-11/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An incompatible feature is marked by a non-empty directory nested 2 levels deep under "work" dir, e.g.: workdir/work/incompat/volatile. This commit checks for marked incompat features, warns about them and fails to mount the overlay, for example: overlayfs: overlay with incompat feature 'volatile' cannot be mounted Very old kernels (i.e. v3.18) will fail to remove a non-empty "work" dir and fail the mount. Newer kernels will fail to remove a "work" dir with entries nested 3 levels and fall back to read-only mount. User mounting with old kernel will see a warning like these in dmesg: overlayfs: cleanup of 'incompat/...' failed (-39) overlayfs: cleanup of 'work/incompat' failed (-39) overlayfs: cleanup of 'ovl-work/work' failed (-39) overlayfs: failed to create directory /vdf/ovl-work/work (errno: 17); mounting read-only These warnings should give the hint to the user that: 1. mount failure is caused by backward incompatible features 2. mount failure can be resolved by manually removing the "work" directory There is nothing preventing users on old kernels from manually removing workdir entirely or mounting overlay with a new workdir, so this is in no way a full proof backward compatibility enforcement, but only a best effort. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
* Merge tag '5.9-rc2-smb-fix' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2020-08-302-1/+16
|\ | | | | | | | | | | | | | | Pull cfis fix from Steve French: "DFS fix for referral problem when using SMB1" * tag '5.9-rc2-smb-fix' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix check of tcon dfs in smb1
| * cifs: fix check of tcon dfs in smb1Paulo Alcantara2020-08-282-1/+16
| | | | | | | | | | | | | | | | | | | | For SMB1, the DFS flag should be checked against tcon->Flags rather than tcon->share_flags. While at it, add an is_tcon_dfs() helper to check for DFS capability in a more generic way. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com>
* | Merge tag 'fallthrough-fixes-5.9-rc3' of ↵Linus Torvalds2020-08-291-1/+0
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull fallthrough fixes from Gustavo A. R. Silva: "Fix some minor issues introduced by the recent treewide fallthrough conversions: - Fix identation issue - Fix erroneous fallthrough annotation - Remove unnecessary fallthrough annotation - Fix code comment changed by fallthrough conversion" * tag 'fallthrough-fixes-5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: arm64/cpuinfo: Remove unnecessary fallthrough annotation media: dib0700: Fix identation issue in dib8096_set_param_override() afs: Remove erroneous fallthough annotation iio: dpot-dac: fix code comment in dpot_dac_read_raw()
| * | afs: Remove erroneous fallthough annotationDan Carpenter2020-08-271-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | The fall through annotation comes after a return statement so it's not reachable. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
* | | Merge tag 'io_uring-5.9-2020-08-28' of git://git.kernel.dk/linux-blockLinus Torvalds2020-08-282-44/+76
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring fixes from Jens Axboe: "A few fixes in here, all based on reports and test cases from folks using it. Most of it is stable material as well: - Hashed work cancelation fix (Pavel) - poll wakeup signalfd fix - memlock accounting fix - nonblocking poll retry fix - ensure we never return -ERESTARTSYS for reads - ensure offset == -1 is consistent with preadv2() as documented - IOPOLL -EAGAIN handling fixes - remove useless task_work bounce for block based -EAGAIN retry" * tag 'io_uring-5.9-2020-08-28' of git://git.kernel.dk/linux-block: io_uring: don't bounce block based -EAGAIN retry off task_work io_uring: fix IOPOLL -EAGAIN retries io_uring: clear req->result on IOPOLL re-issue io_uring: make offset == -1 consistent with preadv2/pwritev2 io_uring: ensure read requests go through -ERESTART* transformation io_uring: don't use poll handler if file can't be nonblocking read/written io_uring: fix imbalanced sqo_mm accounting io_uring: revert consumed iov_iter bytes on error io-wq: fix hang after cancelling pending hashed work io_uring: don't recurse on tsk->sighand->siglock with signalfd
| * | | io_uring: don't bounce block based -EAGAIN retry off task_workJens Axboe2020-08-271-20/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These events happen inline from submission, so there's no need to bounce them through the original task. Just set them up for retry and issue retry directly instead of going over task_work. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: fix IOPOLL -EAGAIN retriesJens Axboe2020-08-271-5/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This normally isn't hit, as polling is mostly done on NVMe with deep queue depths. But if we do run into request starvation, we need to ensure that retries are properly serialized. Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: clear req->result on IOPOLL re-issueJens Axboe2020-08-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure we clear req->result, which was set to -EAGAIN for retry purposes, when moving it to the reissue list. Otherwise we can end up retrying a request more than once, which leads to weird results in the io-wq handling (and other spots). Cc: stable@vger.kernel.org Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: make offset == -1 consistent with preadv2/pwritev2Jens Axboe2020-08-261-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The man page for io_uring generally claims were consistent with what preadv2 and pwritev2 accept, but turns out there's a slight discrepancy in how offset == -1 is handled for pipes/streams. preadv doesn't allow it, but preadv2 does. This currently causes io_uring to return -EINVAL if that is attempted, but we should allow that as documented. This change makes us consistent with preadv2/pwritev2 for just passing in a NULL ppos for streams if the offset is -1. Cc: stable@vger.kernel.org # v5.7+ Reported-by: Benedikt Ames <wisp3rwind@posteo.eu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: ensure read requests go through -ERESTART* transformationJens Axboe2020-08-251-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to call kiocb_done() for any ret < 0 to ensure that we always get the proper -ERESTARTSYS (and friends) transformation done. At some point this should be tied into general error handling, so we can get rid of the various (mostly network) related commands that check and perform this substitution. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: don't use poll handler if file can't be nonblocking read/writtenJens Axboe2020-08-251-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's no point in using the poll handler if we can't do a nonblocking IO attempt of the operation, since we'll need to go async anyway. In fact this is actively harmful, as reading from eg pipes won't return 0 to indicate EOF. Cc: stable@vger.kernel.org # v5.7+ Reported-by: Benedikt Ames <wisp3rwind@posteo.eu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: fix imbalanced sqo_mm accountingJens Axboe2020-08-251-7/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We do the initial accounting of locked_vm and pinned_vm before we have setup ctx->sqo_mm, which means we can end up having not accounted the memory at setup time, but still decrement it when we exit. This causes an imbalance in the accounting. Setup ctx->sqo_mm earlier in io_uring_create(), before we do the first accounting of mm->{locked,pinned}_vm. This also unifies the state grabbing for the ctx, and eliminates a failure case in io_sq_offload_start(). Fixes: f74441e6311a ("io_uring: account locked memory before potential error case") Reported-by: Robert M. Muncrief <rmuncrief@humanavance.com> Reported-by: Niklas Schnelle <schnelle@linux.ibm.com> Tested-by: Niklas Schnelle <schnelle@linux.ibm.com> Tested-by: Robert M. Muncrief <rmuncrief@humanavance.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: revert consumed iov_iter bytes on errorJens Axboe2020-08-251-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some consumers of the iov_iter will return an error, but still have bytes consumed in the iterator. This is an issue for -EAGAIN, since we rely on a sane iov_iter state across retries. Fix this by ensuring that we revert consumed bytes, if any, if the file operations have consumed any bytes from iterator. This is similar to what generic_file_read_iter() does, and is always safe as we have the previous bytes count handy already. Fixes: ff6165b2d7f6 ("io_uring: retain iov_iter state over io_read/io_write calls") Reported-by: Dmitry Shulyak <yashulyak@gmail.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io-wq: fix hang after cancelling pending hashed workPavel Begunkov2020-08-231-2/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't forget to update wqe->hash_tail after cancelling a pending work item, if it was hashed. Cc: stable@vger.kernel.org # 5.7+ Reported-by: Dmitry Shulyak <yashulyak@gmail.com> Fixes: 86f3cd1b589a1 ("io-wq: handle hashed writes in chains") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: don't recurse on tsk->sighand->siglock with signalfdJens Axboe2020-08-231-6/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an application is doing reads on signalfd, and we arm the poll handler because there's no data available, then the wakeup can recurse on the tasks sighand->siglock as the signal delivery from task_work_add() will use TWA_SIGNAL and that attempts to lock it again. We can detect the signalfd case pretty easily by comparing the poll->head wait_queue_head_t with the target task signalfd wait queue. Just use normal task wakeup for this case. Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | Merge tag 'writeback_for_v5.9-rc3' of ↵Linus Torvalds2020-08-283-53/+56
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull writeback fixes from Jan Kara: "Fixes for writeback code occasionally skipping writeback of some inodes or livelocking sync(2)" * tag 'writeback_for_v5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: writeback: Drop I_DIRTY_TIME_EXPIRE writeback: Fix sync livelock due to b_dirty_time processing writeback: Avoid skipping inode writeback writeback: Protect inode->i_io_list with inode->i_lock
| * | | | writeback: Drop I_DIRTY_TIME_EXPIREJan Kara2020-06-153-20/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only use of I_DIRTY_TIME_EXPIRE is to detect in __writeback_single_inode() that inode got there because flush worker decided it's time to writeback the dirty inode time stamps (either because we are syncing or because of age). However we can detect this directly in __writeback_single_inode() and there's no need for the strange propagation with I_DIRTY_TIME_EXPIRE flag. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
| * | | | writeback: Fix sync livelock due to b_dirty_time processingJan Kara2020-06-151-27/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we are processing writeback for sync(2), move_expired_inodes() didn't set any inode expiry value (older_than_this). This can result in writeback never completing if there's steady stream of inodes added to b_dirty_time list as writeback rechecks dirty lists after each writeback round whether there's more work to be done. Fix the problem by using sync(2) start time is inode expiry value when processing b_dirty_time list similarly as for ordinarily dirtied inodes. This requires some refactoring of older_than_this handling which simplifies the code noticeably as a bonus. Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option") CC: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
| * | | | writeback: Avoid skipping inode writebackJan Kara2020-06-151-5/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Inode's i_io_list list head is used to attach inode to several different lists - wb->{b_dirty, b_dirty_time, b_io, b_more_io}. When flush worker prepares a list of inodes to writeback e.g. for sync(2), it moves inodes to b_io list. Thus it is critical for sync(2) data integrity guarantees that inode is not requeued to any other writeback list when inode is queued for processing by flush worker. That's the reason why writeback_single_inode() does not touch i_io_list (unless the inode is completely clean) and why __mark_inode_dirty() does not touch i_io_list if I_SYNC flag is set. However there are two flaws in the current logic: 1) When inode has only I_DIRTY_TIME set but it is already queued in b_io list due to sync(2), concurrent __mark_inode_dirty(inode, I_DIRTY_SYNC) can still move inode back to b_dirty list resulting in skipping writeback of inode time stamps during sync(2). 2) When inode is on b_dirty_time list and writeback_single_inode() races with __mark_inode_dirty() like: writeback_single_inode() __mark_inode_dirty(inode, I_DIRTY_PAGES) inode->i_state |= I_SYNC __writeback_single_inode() inode->i_state |= I_DIRTY_PAGES; if (inode->i_state & I_SYNC) bail if (!(inode->i_state & I_DIRTY_ALL)) - not true so nothing done We end up with I_DIRTY_PAGES inode on b_dirty_time list and thus standard background writeback will not writeback this inode leading to possible dirty throttling stalls etc. (thanks to Martijn Coenen for this analysis). Fix these problems by tracking whether inode is queued in b_io or b_more_io lists in a new I_SYNC_QUEUED flag. When this flag is set, we know flush worker has queued inode and we should not touch i_io_list. On the other hand we also know that once flush worker is done with the inode it will requeue the inode to appropriate dirty list. When I_SYNC_QUEUED is not set, __mark_inode_dirty() can (and must) move inode to appropriate dirty list. Reported-by: Martijn Coenen <maco@android.com> Reviewed-by: Martijn Coenen <maco@android.com> Tested-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>
| * | | | writeback: Protect inode->i_io_list with inode->i_lockJan Kara2020-06-151-5/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, operations on inode->i_io_list are protected by wb->list_lock. In the following patches we'll need to maintain consistency between inode->i_state and inode->i_io_list so change the code so that inode->i_lock protects also all inode's i_io_list handling. Reviewed-by: Martijn Coenen <maco@android.com> Reviewed-by: Christoph Hellwig <hch@lst.de> CC: stable@vger.kernel.org # Prerequisite for "writeback: Avoid skipping inode writeback" Signed-off-by: Jan Kara <jack@suse.cz>
* | | | | Merge tag 'gfs2-v5.9-rc2-fixes' of ↵Linus Torvalds2020-08-282-0/+32
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 fix from Andreas Gruenbacher: "Fix a memory leak on filesystem withdraw. We didn't detect this bug because we have slab merging on by default (CONFIG_SLAB_MERGE_DEFAULT). Adding 'slub_nomerge' to the kernel command line exposed the problem" * tag 'gfs2-v5.9-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: add some much needed cleanup for log flushes that fail
| * | | | | gfs2: add some much needed cleanup for log flushes that failBob Peterson2020-08-242-0/+32
| | |_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a log flush fails due to io errors, it signals the failure but does not clean up after itself very well. This is because buffers are added to the transaction tr_buf and tr_databuf queue, but the io error causes gfs2_log_flush to bypass the "after_commit" functions responsible for dequeueing the bd elements. If the bd elements are added to the ail list before the error, function ail_drain takes care of dequeueing them. But if they haven't gotten that far, the elements are forgotten and make the transactions unable to be freed. This patch introduces new function trans_drain which drains the bd elements from the transaction so they can be freed properly. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
* | | | | Merge tag 'ceph-for-5.9-rc3' of git://github.com/ceph/ceph-clientLinus Torvalds2020-08-288-77/+75
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull ceph fixes from Ilya Dryomov: "We have an inode number handling change, prompted by s390x which is a 64-bit architecture with a 32-bit ino_t, a patch to disallow leases to avoid potential data integrity issues when CephFS is re-exported via NFS or CIFS and a fix for the bulk of W=1 compilation warnings" * tag 'ceph-for-5.9-rc3' of git://github.com/ceph/ceph-client: ceph: don't allow setlease on cephfs ceph: fix inode number handling on arches with 32-bit ino_t libceph: add __maybe_unused to DEFINE_CEPH_FEATURE
| * | | | | ceph: don't allow setlease on cephfsJeff Layton2020-08-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Leases don't currently work correctly on kcephfs, as they are not broken when caps are revoked. They could eventually be implemented similarly to how we did them in libcephfs, but for now don't allow them. [ idryomov: no need for simple_nosetlease() in ceph_dir_fops and ceph_snapdir_fops ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * | | | | ceph: fix inode number handling on arches with 32-bit ino_tJeff Layton2020-08-248-77/+74
| |/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tuan and Ulrich mentioned that they were hitting a problem on s390x, which has a 32-bit ino_t value, even though it's a 64-bit arch (for historical reasons). I think the current handling of inode numbers in the ceph driver is wrong. It tries to use 32-bit inode numbers on 32-bit arches, but that's actually not a problem. 32-bit arches can deal with 64-bit inode numbers just fine when userland code is compiled with LFS support (the common case these days). What we really want to do is just use 64-bit numbers everywhere, unless someone has mounted with the ino32 mount option. In that case, we want to ensure that we hash the inode number down to something that will fit in 32 bits before presenting the value to userland. Add new helper functions that do this, and only do the conversion before presenting these values to userland in getattr and readdir. The inode table hashvalue is changed to just cast the inode number to unsigned long, as low-order bits are the most likely to vary anyway. While it's not strictly required, we do want to put something in inode->i_ino. Instead of basing it on BITS_PER_LONG, however, base it on the size of the ino_t type. NOTE: This is a user-visible change on 32-bit arches: 1/ inode numbers will be seen to have changed between kernel versions. 32-bit arches will see large inode numbers now instead of the hashed ones they saw before. 2/ any really old software not built with LFS support may start failing stat() calls with -EOVERFLOW on inode numbers >2^32. Nothing much we can do about these, but hopefully the intersection of people running such code on ceph will be very small. The workaround for both problems is to mount with "-o ino32". [ idryomov: changelog tweak ] URL: https://tracker.ceph.com/issues/46828 Reported-by: Ulrich Weigand <Ulrich.Weigand@de.ibm.com> Reported-and-Tested-by: Tuan Hoang1 <Tuan.Hoang1@ibm.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* | | | | Merge tag 'nfsd-5.9-1' of git://git.linux-nfs.org/projects/cel/cel-2.6Linus Torvalds2020-08-251-0/+2
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfs server fixes from Chuck Lever: - Eliminate an oops introduced in v5.8 - Remove a duplicate #include added by nfsd-5.9 * tag 'nfsd-5.9-1' of git://git.linux-nfs.org/projects/cel/cel-2.6: SUNRPC: remove duplicate include nfsd: fix oops on mixed NFSv4/NFSv3 client access
| * | | | | nfsd: fix oops on mixed NFSv4/NFSv3 client accessJ. Bruce Fields2020-08-161-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an NFSv2/v3 client breaks an NFSv4 client's delegation, it will hit a NULL dereference in nfsd_breaker_owns_lease(). Easily reproduceable with for example mount -overs=4.2 server:/export /mnt/ sleep 1h </mnt/file & mount -overs=3 server:/export /mnt2/ touch /mnt2/file Reported-by: Robert Dinse <nanook@eskimo.com> Fixes: 28df3d1539de50 ("nfsd: clients don't need to break their own delegations") BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=208807 Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* | | | | | Merge tag 'm68knommu-for-v5.9-rc3' of ↵Linus Torvalds2020-08-251-8/+12
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu Pull m68knommu fix from Greg Ungerer: "Only a single fix for the binfmt_flat loader (reverting a recent change)" * tag 'm68knommu-for-v5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu: binfmt_flat: revert "binfmt_flat: don't offset the data start"
| * | | | | | binfmt_flat: revert "binfmt_flat: don't offset the data start"Max Filippov2020-08-241-8/+12
| | |/ / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | binfmt_flat loader uses the gap between text and data to store data segment pointers for the libraries. Even in the absence of shared libraries it stores at least one pointer to the executable's own data segment. Text and data can go back to back in the flat binary image and without offsetting data segment last few instructions in the text segment may get corrupted by the data segment pointer. Fix it by reverting commit a2357223c50a ("binfmt_flat: don't offset the data start"). Cc: stable@vger.kernel.org Fixes: a2357223c50a ("binfmt_flat: don't offset the data start") Signed-off-by: Max Filippov <jcmvbkbc@gmail.com> Signed-off-by: Greg Ungerer <gerg@linux-m68k.org>
* | | | | | Merge tag 'for-5.9-rc2-tag' of ↵Linus Torvalds2020-08-249-31/+39
|\ \ \ \ \ \ | |_|_|_|_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix swapfile activation on subvolumes with deleted snapshots - error value mixup when removing directory entries from tree log - fix lzo compression level reset after previous level setting - fix space cache memory leak after transaction abort - fix const function attribute - more error handling improvements * tag 'for-5.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: detect nocow for swap after snapshot delete btrfs: check the right error variable in btrfs_del_dir_entries_in_log btrfs: fix space cache memory leak after transaction abort btrfs: use the correct const function attribute for btrfs_get_num_csums btrfs: reset compression level for lzo on remount btrfs: handle errors from async submission
| * | | | | btrfs: detect nocow for swap after snapshot deleteBoris Burkov2020-08-214-16/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for detecting a must cow condition which is not exactly accurate, but saves unnecessary tree traversal. The incorrect assumption is that if the extent was created in a generation smaller than the last snapshot generation, it must be referenced by that snapshot. That is true, except the snapshot could have since been deleted, without affecting the last snapshot generation. The original patch claimed a performance win from this check, but it also leads to a bug where you are unable to use a swapfile if you ever snapshotted the subvolume it's in. Make the check slower and more strict for the swapon case, without modifying the general cow checks as a compromise. Turning swap on does not seem to be a particularly performance sensitive operation, so incurring a possibly unnecessary btrfs_search_slot seems worthwhile for the added usability. Note: Until the snapshot is competely cleaned after deletion, check_committed_refs will still cause the logic to think that cow is necessary, so the user must until 'btrfs subvolu sync' finished before activating the swapfile swapon. CC: stable@vger.kernel.org # 5.4+ Suggested-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | | btrfs: check the right error variable in btrfs_del_dir_entries_in_logJosef Bacik2020-08-211-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With my new locking code dbench is so much faster that I tripped over a transaction abort from ENOSPC. This turned out to be because btrfs_del_dir_entries_in_log was checking for ret == -ENOSPC, but this function sets err on error, and returns err. So instead of properly marking the inode as needing a full commit, we were returning -ENOSPC and aborting in __btrfs_unlink_inode. Fix this by checking the proper variable so that we return the correct thing in the case of ENOSPC. The ENOENT needs to be checked, because btrfs_lookup_dir_item_index() can return -ENOENT if the dir item isn't in the tree log (which would happen if we hadn't fsync'ed this guy). We actually handle that case in __btrfs_unlink_inode, so it's an expected error to get back. Fixes: 4a500fd178c8 ("Btrfs: Metadata ENOSPC handling for tree log") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note and comment about ENOENT ] Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | | btrfs: fix space cache memory leak after transaction abortFilipe Manana2020-08-192-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a transaction aborts it can cause a memory leak of the pages array of a block group's io_ctl structure. The following steps explain how that can happen: 1) Transaction N is committing, currently in state TRANS_STATE_UNBLOCKED and it's about to start writing out dirty extent buffers; 2) Transaction N + 1 already started and another task, task A, just called btrfs_commit_transaction() on it; 3) Block group B was dirtied (extents allocated from it) by transaction N + 1, so when task A calls btrfs_start_dirty_block_groups(), at the very beginning of the transaction commit, it starts writeback for the block group's space cache by calling btrfs_write_out_cache(), which allocates the pages array for the block group's io_ctl with a call to io_ctl_init(). Block group A is added to the io_list of transaction N + 1 by btrfs_start_dirty_block_groups(); 4) While transaction N's commit is writing out the extent buffers, it gets an IO error and aborts transaction N, also setting the file system to RO mode; 5) Task A has already returned from btrfs_start_dirty_block_groups(), is at btrfs_commit_transaction() and has set transaction N + 1 state to TRANS_STATE_COMMIT_START. Immediately after that it checks that the filesystem was turned to RO mode, due to transaction N's abort, and jumps to the "cleanup_transaction" label. After that we end up at btrfs_cleanup_one_transaction() which calls btrfs_cleanup_dirty_bgs(). That helper finds block group B in the transaction's io_list but it never releases the pages array of the block group's io_ctl, resulting in a memory leak. In fact at the point when we are at btrfs_cleanup_dirty_bgs(), the pages array points to pages that were already released by us at __btrfs_write_out_cache() through the call to io_ctl_drop_pages(). We end up freeing the pages array only after waiting for the ordered extent to complete through btrfs_wait_cache_io(), which calls io_ctl_free() to do that. But in the transaction abort case we don't wait for the space cache's ordered extent to complete through a call to btrfs_wait_cache_io(), so that's why we end up with a memory leak - we wait for the ordered extent to complete indirectly by shutting down the work queues and waiting for any jobs in them to complete before returning from close_ctree(). We can solve the leak simply by freeing the pages array right after releasing the pages (with the call to io_ctl_drop_pages()) at __btrfs_write_out_cache(), since we will never use it anymore after that and the pages array points to already released pages at that point, which is currently not a problem since no one will use it after that, but not a good practice anyway since it can easily lead to use-after-free issues. So fix this by freeing the pages array right after releasing the pages at __btrfs_write_out_cache(). This issue can often be reproduced with test case generic/475 from fstests and kmemleak can detect it and reports it with the following trace: unreferenced object 0xffff9bbf009fa600 (size 512): comm "fsstress", pid 38807, jiffies 4298504428 (age 22.028s) hex dump (first 32 bytes): 00 a0 7c 4d 3d ed ff ff 40 a0 7c 4d 3d ed ff ff ..|M=...@.|M=... 80 a0 7c 4d 3d ed ff ff c0 a0 7c 4d 3d ed ff ff ..|M=.....|M=... backtrace: [<00000000f4b5cfe2>] __kmalloc+0x1a8/0x3e0 [<0000000028665e7f>] io_ctl_init+0xa7/0x120 [btrfs] [<00000000a1f95b2d>] __btrfs_write_out_cache+0x86/0x4a0 [btrfs] [<00000000207ea1b0>] btrfs_write_out_cache+0x7f/0xf0 [btrfs] [<00000000af21f534>] btrfs_start_dirty_block_groups+0x27b/0x580 [btrfs] [<00000000c3c23d44>] btrfs_commit_transaction+0xa6f/0xe70 [btrfs] [<000000009588930c>] create_subvol+0x581/0x9a0 [btrfs] [<000000009ef2fd7f>] btrfs_mksubvol+0x3fb/0x4a0 [btrfs] [<00000000474e5187>] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs] [<00000000708ee349>] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs] [<00000000ea60106f>] btrfs_ioctl+0x12c/0x3130 [btrfs] [<000000005c923d6d>] __x64_sys_ioctl+0x83/0xb0 [<0000000043ace2c9>] do_syscall_64+0x33/0x80 [<00000000904efbce>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | | btrfs: use the correct const function attribute for btrfs_get_num_csumsDavid Sterba2020-08-192-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The build robot reports compiler: h8300-linux-gcc (GCC) 9.3.0 In file included from fs/btrfs/tests/extent-map-tests.c:8: >> fs/btrfs/tests/../ctree.h:2166:8: warning: type qualifiers ignored on function return type [-Wignored-qualifiers] 2166 | size_t __const btrfs_get_num_csums(void); | ^~~~~~~ The function attribute for const does not follow the expected scheme and in this case is confused with a const type qualifier. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | | btrfs: reset compression level for lzo on remountMarcos Paulo de Souza2020-08-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently a user can set mount "-o compress" which will set the compression algorithm to zlib, and use the default compress level for zlib (3): relatime,compress=zlib:3,space_cache If the user remounts the fs using "-o compress=lzo", then the old compress_level is used: relatime,compress=lzo:3,space_cache But lzo does not expose any tunable compression level. The same happens if we set any compress argument with different level, also with zstd. Fix this by resetting the compress_level when compress=lzo is specified. With the fix applied, lzo is shown without compress level: relatime,compress=lzo,space_cache CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | | | btrfs: handle errors from async submissionJohannes Thumshirn2020-08-191-8/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Btrfs' async submit mechanism is able to handle errors in the submission path and the meta-data async submit function correctly passes the error code to the caller. In btrfs_submit_bio_start() and btrfs_submit_bio_start_direct_io() we're not handling the errors returned by btrfs_csum_one_bio() correctly though and simply call BUG_ON(). This is unnecessary as the caller of these two functions - run_one_async_start - correctly checks for the return values and sets the status of the async_submit_bio. The actual bio submission will be handled later on by run_one_async_done only if async_submit_bio::status is 0, so the data won't be written if we encountered an error in the checksum process. Simply return the error from btrfs_csum_one_bio() to the async submitters, like it's done in btree_submit_bio_start(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | | | | | treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva2020-08-2377-226/+232
| |/ / / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
* | | | | Merge branch 'work.epoll' of ↵Linus Torvalds2020-08-221-13/+13
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull epoll fixes from Al Viro: "Fix reference counting and clean up exit paths" * 'work.epoll' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: do_epoll_ctl(): clean the failure exits up a bit epoll: Keep a reference on files added to the check list
| * | | | | do_epoll_ctl(): clean the failure exits up a bitAl Viro2020-08-221-13/+6
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | | | epoll: Keep a reference on files added to the check listMarc Zyngier2020-08-221-2/+9
| | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When adding a new fd to an epoll, and that this new fd is an epoll fd itself, we recursively scan the fds attached to it to detect cycles, and add non-epool files to a "check list" that gets subsequently parsed. However, this check list isn't completely safe when deletions can happen concurrently. To sidestep the issue, make sure that a struct file placed on the check list sees its f_count increased, ensuring that a concurrent deletion won't result in the file disapearing from under our feet. Cc: stable@vger.kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | | Merge tag 'io_uring-5.9-2020-08-21' of git://git.kernel.dk/linux-blockLinus Torvalds2020-08-211-94/+79
|\ \ \ \ \ | | |_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring fixes from Jens Axboe: - Make sure the head link cancelation includes async work - Get rid of kiocb_wait_page_queue_init(), makes no sense to have it as a separate function since you moved it into io_uring itself - io_import_iovec cleanups (Pavel, me) - Use system_unbound_wq for ring exit work, to avoid spawning tons of these if we have tons of rings exiting at the same time - Fix req->flags overflow flag manipulation (Pavel) * tag 'io_uring-5.9-2020-08-21' of git://git.kernel.dk/linux-block: io_uring: kill extra iovec=NULL in import_iovec() io_uring: comment on kfree(iovec) checks io_uring: fix racy req->flags modification io_uring: use system_unbound_wq for ring exit work io_uring: cleanup io_import_iovec() of pre-mapped request io_uring: get rid of kiocb_wait_page_queue_init() io_uring: find and cancel head link async work on files exit