summaryrefslogtreecommitdiffstats
path: root/fs/ceph
Commit message (Collapse)AuthorAgeFilesLines
* ceph: fix handling of "meta" errorsJeff Layton2021-10-275-27/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1bd85aa65d0e7b5e4d09240f492f37c569fdd431 upstream. Currently, we check the wb_err too early for directories, before all of the unsafe child requests have been waited on. In order to fix that we need to check the mapping->wb_err later nearer to the end of ceph_fsync. We also have an overly-complex method for tracking errors after blocklisting. The errors recorded in cleanup_session_requests go to a completely separate field in the inode, but we end up reporting them the same way we would for any other error (in fsync). There's no real benefit to tracking these errors in two different places, since the only reporting mechanism for them is in fsync, and we'd need to advance them both every time. Given that, we can just remove i_meta_err, and convert the places that used it to instead just use mapping->wb_err instead. That also fixes the original problem by ensuring that we do a check_and_advance of the wb_err at the end of the fsync op. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/52864 Reported-by: Patrick Donnelly <pdonnell@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ceph: lockdep annotations for try_nonblocking_invalidateJeff Layton2021-09-261-0/+2
| | | | | | | | | [ Upstream commit 3eaf5aa1cfa8c97c72f5824e2e9263d6cc977b03 ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ceph: request Fw caps before updating the mtime in ceph_write_iterJeff Layton2021-09-261-15/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit b11ed50346683a749632ea664959b28d524d7395 ] The current code will update the mtime and then try to get caps to handle the write. If we end up having to request caps from the MDS, then the mtime in the cap grant will clobber the updated mtime and it'll be lost. This is most noticable when two clients are alternately writing to the same file. Fw caps are continually being granted and revoked, and the mtime ends up stuck because the updated mtimes are always being overwritten with the old one. Fix this by changing the order of operations in ceph_write_iter to get the caps before updating the times. Also, make sure we check the pool full conditions before even getting any caps or uninlining. URL: https://tracker.ceph.com/issues/46574 Reported-by: Jozef Kováč <kovac@firma.zoznam.sk> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ceph: take snap_empty_lock atomically with snaprealm refcount changeJeff Layton2021-08-181-17/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8434ffe71c874b9c4e184b88d25de98c2bf5fe3f upstream. There is a race in ceph_put_snap_realm. The change to the nref and the spinlock acquisition are not done atomically, so you could decrement nref, and before you take the spinlock, the nref is incremented again. At that point, you end up putting it on the empty list when it shouldn't be there. Eventually __cleanup_empty_realms runs and frees it when it's still in-use. Fix this by protecting the 1->0 transition with atomic_dec_and_lock, and just drop the spinlock if we can get the rwsem. Because these objects can also undergo a 0->1 refcount transition, we must protect that change as well with the spinlock. Increment locklessly unless the value is at 0, in which case we take the spinlock, increment and then take it off the empty list if it did the 0->1 transition. With these changes, I'm removing the dout() messages from these functions, as well as in __put_snap_realm. They've always been racy, and it's better to not print values that may be misleading. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/46419 Reported-by: Mark Nelson <mnelson@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ceph: clean up locking annotation for ceph_get_snap_realm and ↵Jeff Layton2021-08-181-4/+4
| | | | | | | | | | | | | | | | | | | | __lookup_snap_realm commit df2c0cb7f8e8c83e495260ad86df8c5da947f2a7 upstream. They both say that the snap_rwsem must be held for write, but I don't see any real reason for it, and it's not currently always called that way. The lookup is just walking the rbtree, so holding it for read should be fine there. The "get" is bumping the refcount and (possibly) removing it from the empty list. I see no need to hold the snap_rwsem for write for that. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ceph: add some lockdep assertions around snaprealm handlingJeff Layton2021-08-181-0/+16
| | | | | | | | | | | commit a6862e6708c15995bc10614b2ef34ca35b4b9078 upstream. Turn some comments into lockdep asserts. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ceph: reduce contention in ceph_check_delayed_caps()Luis Henriques2021-08-183-11/+33
| | | | | | | | | | | | | | | | | | | | | | | | commit bf2ba432213fade50dd39f2e348085b758c0726e upstream. Function ceph_check_delayed_caps() is called from the mdsc->delayed_work workqueue and it can be kept looping for quite some time if caps keep being added back to the mdsc->cap_delay_list. This may result in the watchdog tainting the kernel with the softlockup flag. This patch breaks this loop if the caps have been recently (i.e. during the loop execution). Any new caps added to the list will be handled in the next run. Also, allow schedule_delayed() callers to explicitly set the delay value instead of defaulting to 5s, so we can ensure that it runs soon afterward if it looks like there is more work. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/46284 Signed-off-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ceph: remove bogus checks and WARN_ONs from ceph_set_page_dirtyJeff Layton2021-07-201-9/+1
| | | | | | | | | | | | | | | | | [ Upstream commit 22d41cdcd3cfd467a4af074165357fcbea1c37f5 ] The checks for page->mapping are odd, as set_page_dirty is an address_space operation, and I don't see where it would be called on a non-pagecache page. The warning about the page lock also seems bogus. The comment over set_page_dirty() says that it can be called without the page lock in some rare cases. I don't think we want to warn if that's the case. Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ceph: fix fscache invalidationJeff Layton2021-05-222-0/+2
| | | | | | | | | | | [ Upstream commit 10a7052c7868bc7bc72d947f5aac6f768928db87 ] Ensure that we invalidate the fscache whenever we invalidate the pagecache. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ceph: fix inode leak on getattr error in __fh_to_dentryJeff Layton2021-05-191-1/+3
| | | | | | | | | | [ Upstream commit 1775c7ddacfcea29051c67409087578f8f4d751b ] Fixes: 878dabb64117 ("ceph: don't return -ESTALE if there's still an open file") Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ceph: fix race in concurrent __ceph_remove_cap invocationsLuis Henriques2020-12-301-2/+9
| | | | | | | | | | | | | | | | | | | | | commit e5cafce3ad0f8652d6849314d951459c2bff7233 upstream. A NULL pointer dereference may occur in __ceph_remove_cap with some of the callbacks used in ceph_iterate_session_caps, namely trim_caps_cb and remove_session_caps_cb. Those callers hold the session->s_mutex, so they are prevented from concurrent execution, but ceph_evict_inode does not. Since the callers of this function hold the i_ceph_lock, the fix is simply a matter of returning immediately if caps->ci is NULL. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/43272 Suggested-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ceph: promote to unsigned long long before shiftingMatthew Wilcox (Oracle)2020-11-051-1/+1
| | | | | | | | | | | | | | commit c403c3a2fbe24d4ed33e10cabad048583ebd4edf upstream. On 32-bit systems, this shift will overflow for files larger than 4GB. Cc: stable@vger.kernel.org Fixes: 61f68816211e ("ceph: check caps in filemap_fault and page_mkwrite") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ceph: fix potential race in ceph_check_capsJeff Layton2020-10-011-1/+13
| | | | | | | | | | | | | | [ Upstream commit dc3da0461cc4b76f2d0c5b12247fcb3b520edbbf ] Nothing ensures that session will still be valid by the time we dereference the pointer. Take and put a reference. In principle, we should always be able to get a reference here, but throw a warning if that's ever not the case. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ceph: ensure we have a new cap before continuing in fill_inodeJeff Layton2020-10-011-1/+4
| | | | | | | | | | | [ Upstream commit 9a6bed4fe0c8bf57785cbc4db9f86086cb9b193d ] If the caller passes in a NULL cap_reservation, and we can't allocate one then ensure that we fail gracefully. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ceph: don't allow setlease on cephfsJeff Layton2020-09-091-0/+1
| | | | | | | | | | | | | | | | [ Upstream commit 496ceaf12432b3d136dcdec48424312e71359ea7 ] Leases don't currently work correctly on kcephfs, as they are not broken when caps are revoked. They could eventually be implemented similarly to how we did them in libcephfs, but for now don't allow them. [ idryomov: no need for simple_nosetlease() in ceph_dir_fops and ceph_snapdir_fops ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ceph: do not access the kiocb after aio requestsXiubo Li2020-09-031-2/+3
| | | | | | | | | | | | | | | | | [ Upstream commit d1d9655052606fd9078e896668ec90191372d513 ] In aio case, if the completion comes very fast just before the ceph_read_iter() returns to fs/aio.c, the kiocb will be freed in the completion callback, then if ceph_read_iter() access again we will potentially hit the use-after-free bug. [ jlayton: initialize direct_lock early, and use it everywhere ] URL: https://tracker.ceph.com/issues/45649 Signed-off-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ceph: fix potential mdsc use-after-free crashXiubo Li2020-09-031-1/+13
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit fa9967734227b44acb1b6918033f9122dc7825b9 ] Make sure the delayed work stopped before releasing the resources. cancel_delayed_work_sync() will only guarantee that the work finishes executing if the work is already in the ->worklist. That means after the cancel_delayed_work_sync() returns, it will leave the work requeued if it was rearmed at the end. That can lead to a use after free once the work struct is freed. Fix it by flushing the delayed work instead of trying to cancel it, and ensure that the work doesn't rearm if the mdsc is stopping. URL: https://tracker.ceph.com/issues/46293 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ceph: fix use-after-free for fsc->mdscXiubo Li2020-08-261-1/+2
| | | | | | | | | | | | [ Upstream commit a7caa88f8b72c136f9a401f498471b8a8e35370d ] If the ceph_mdsc_init() fails, it will free the mdsc already. Reported-by: syzbot+b57f46d8d6ea51960b8c@syzkaller.appspotmail.com Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ceph: handle zero-length feature mask in session messagesJeff Layton2020-08-211-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | commit 02e37571f9e79022498fd0525c073b07e9d9ac69 upstream. Most session messages contain a feature mask, but the MDS will routinely send a REJECT message with one that is zero-length. Commit 0fa8263367db ("ceph: fix endianness bug when handling MDS session feature bits") fixed the decoding of the feature mask, but failed to account for the MDS sending a zero-length feature mask. This causes REJECT message decoding to fail. Skip trying to decode a feature mask if the word count is zero. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/46823 Fixes: 0fa8263367db ("ceph: fix endianness bug when handling MDS session feature bits") Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Tested-by: Patrick Donnelly <pdonnell@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ceph: set sec_context xattr on symlink creationJeff Layton2020-08-211-0/+4
| | | | | | | | | | | | | | | | | commit b748fc7a8763a5b3f8149f12c45711cd73ef8176 upstream. Symlink inodes should have the security context set in their xattrs on creation. We already set the context on creation, but we don't attach the pagelist. The effect is that symlink inodes don't get an SELinux context set on them at creation, so they end up unlabeled instead of inheriting the proper context. Make it do so. Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ceph: don't return -ESTALE if there's still an open fileLuis Henriques2020-06-241-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 878dabb64117406abd40977b87544d05bb3031fc ] Similarly to commit 03f219041fdb ("ceph: check i_nlink while converting a file handle to dentry"), this fixes another corner case with name_to_handle_at/open_by_handle_at. The issue has been detected by xfstest generic/467, when doing: - name_to_handle_at("/cephfs/myfile") - open("/cephfs/myfile") - unlink("/cephfs/myfile") - sync; sync; - drop caches - open_by_handle_at() The call to open_by_handle_at should not fail because the file hasn't been deleted yet (only unlinked) and we do have a valid handle to it. -ESTALE shall be returned only if i_nlink is 0 *and* i_count is 1. This patch also makes sure we have LINK caps before checking i_nlink. Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ceph: flush release queue when handling caps for unknown inodeJeff Layton2020-06-031-1/+1
| | | | | | | | | | | | | | | | | | | | | [ Upstream commit fb33c114d3ed5bdac230716f5b0a93b56b92a90d ] It's possible for the VFS to completely forget about an inode, but for it to still be sitting on the cap release queue. If the MDS sends the client a cap message for such an inode, it just ignores it today, which can lead to a stall of up to 5s until the cap release queue is flushed. If we get a cap message for an inode that can't be located, then go ahead and flush the cap release queue. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/45532 Fixes: 1e9c2eb6811e ("ceph: delete stale dentry when last reference is dropped") Reported-and-Tested-by: Andrej Filipčič <andrej.filipcic@ijs.si> Suggested-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ceph: fix double unlock in handle_cap_export()Wu Bo2020-05-271-0/+1
| | | | | | | | | | | | | | [ Upstream commit 4d8e28ff3106b093d98bfd2eceb9b430c70a8758 ] If the ceph_mdsc_open_export_target_session() return fails, it will do a "goto retry", but the session mutex has already been unlocked. Re-lock the mutex in that case to ensure that we don't unlock it twice. Signed-off-by: Wu Bo <wubo40@huawei.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ceph: demote quotarealm lookup warning to a debug messageLuis Henriques2020-05-141-2/+2
| | | | | | | | | | | | | | | | | | | commit 12ae44a40a1be891bdc6463f8c7072b4ede746ef upstream. A misconfigured cephx can easily result in having the kernel client flooding the logs with: ceph: Can't lookup inode 1 (err: -13) Change this message to debug level. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/44546 Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ceph: fix endianness bug when handling MDS session feature bitsJeff Layton2020-05-141-5/+3
| | | | | | | | | | | | | | | | commit 0fa8263367db9287aa0632f96c1a5f93cc478150 upstream. Eduard reported a problem mounting cephfs on s390 arch. The feature mask sent by the MDS is little-endian, so we need to convert it before storing and testing against it. Cc: stable@vger.kernel.org Reported-and-Tested-by: Eduard Shishkin <edward6@linux.ibm.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ceph: don't skip updating wanted caps when cap is staleYan, Zheng2020-04-291-2/+6
| | | | | | | | | | | | | | | | | | | [ Upstream commit 0aa971b6fd3f92afef6afe24ef78d9bb14471519 ] 1. try_get_cap_refs() fails to get caps and finds that mds_wanted does not include what it wants. It returns -ESTALE. 2. ceph_get_caps() calls ceph_renew_caps(). ceph_renew_caps() finds that inode has cap, so it calls ceph_check_caps(). 3. ceph_check_caps() finds that issued caps (without checking if it's stale) already includes caps wanted by open file, so it skips updating wanted caps. Above events can cause an infinite loop inside ceph_get_caps(). Signed-off-by: "Yan, Zheng" <zyan@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ceph: return ceph_mdsc_do_request() errors from __get_parent()Qiujun Huang2020-04-291-0/+5
| | | | | | | | | | | | | [ Upstream commit c6d50296032f0b97473eb2e274dc7cc5d0173847 ] Return the error returned by ceph_mdsc_do_request(). Otherwise, r_target_inode ends up being NULL this ends up returning ENOENT regardless of the error. Signed-off-by: Qiujun Huang <hqjagain@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ceph: canonicalize server path in placeIlya Dryomov2020-04-132-92/+28
| | | | | | | | | | | | | | | | | | | | | | commit b27a939e8376a3f1ed09b9c33ef44d20f18ec3d0 upstream. syzbot reported that 4fbc0c711b24 ("ceph: remove the extra slashes in the server path") had caused a regression where an allocation could be done under a spinlock -- compare_mount_options() is called by sget_fc() with sb_lock held. We don't really need the supplied server path, so canonicalize it in place and compare it directly. To make this work, the leading slash is kept around and the logic in ceph_real_mount() to skip it is restored. CEPH_MSG_CLIENT_SESSION now reports the same (i.e. canonicalized) path, with the leading slash of course. Fixes: 4fbc0c711b24 ("ceph: remove the extra slashes in the server path") Reported-by: syzbot+98704a51af8e3d9425a9@syzkaller.appspotmail.com Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Luis Henriques <lhenriques@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ceph: remove the extra slashes in the server pathXiubo Li2020-04-131-19/+101
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 4fbc0c711b2464ee1551850b85002faae0b775d5 upstream. It's possible to pass the mount helper a server path that has more than one contiguous slash character. For example: $ mount -t ceph 192.168.195.165:40176:/// /mnt/cephfs/ In the MDS server side the extra slashes of the server path will be treated as snap dir, and then we can get the following debug logs: ceph: mount opening path // ceph: open_root_inode opening '//' ceph: fill_trace 0000000059b8a3bc is_dentry 0 is_target 1 ceph: alloc_inode 00000000dc4ca00b ceph: get_inode created new inode 00000000dc4ca00b 1.ffffffffffffffff ino 1 ceph: get_inode on 1=1.ffffffffffffffff got 00000000dc4ca00b And then when creating any new file or directory under the mount point, we can hit the following BUG_ON in ceph_fill_trace(): BUG_ON(ceph_snap(dir) != dvino.snap); Have the client ignore the extra slashes in the server path when mounting. This will also canonicalize the path, so that identical mounts can be consilidated. 1) "//mydir1///mydir//" 2) "/mydir1/mydir" 3) "/mydir1/mydir/" Regardless of the internal treatment of these paths, the kernel still stores the original string including the leading '/' for presentation to userland. URL: https://tracker.ceph.com/issues/42771 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Luis Henriques <lhenriques@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ceph: fix memory leak in ceph_cleanup_snapid_map()Luis Henriques2020-04-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit c8d6ee01449cd0d2f30410681cccb616a88f50b1 upstream. kmemleak reports the following memory leak: unreferenced object 0xffff88821feac8a0 (size 96): comm "kworker/1:0", pid 17, jiffies 4294896362 (age 20.512s) hex dump (first 32 bytes): a0 c8 ea 1f 82 88 ff ff 00 c9 ea 1f 82 88 ff ff ................ 00 00 00 00 00 00 00 00 00 01 00 00 00 00 ad de ................ backtrace: [<00000000b3ea77fb>] ceph_get_snapid_map+0x75/0x2a0 [<00000000d4060942>] fill_inode+0xb26/0x1010 [<0000000049da6206>] ceph_readdir_prepopulate+0x389/0xc40 [<00000000e2fe2549>] dispatch+0x11ab/0x1521 [<000000007700b894>] ceph_con_workfn+0xf3d/0x3240 [<0000000039138a41>] process_one_work+0x24d/0x590 [<00000000eb751f34>] worker_thread+0x4a/0x3d0 [<000000007e8f0d42>] kthread+0xfb/0x130 [<00000000d49bd1fa>] ret_from_fork+0x3a/0x50 A kfree is missing while looping the 'to_free' list of ceph_snapid_map objects. Cc: stable@vger.kernel.org Fixes: 75c9627efb72 ("ceph: map snapid to anonymous bdev ID") Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ceph: check POOL_FLAG_FULL/NEARFULL in addition to OSDMAP_FULL/NEARFULLIlya Dryomov2020-04-011-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 7614209736fbc4927584d4387faade4f31444fce upstream. CEPH_OSDMAP_FULL/NEARFULL aren't set since mimic, so we need to consult per-pool flags as well. Unfortunately the backwards compatibility here is lacking: - the change that deprecated OSDMAP_FULL/NEARFULL went into mimic, but was guarded by require_osd_release >= RELEASE_LUMINOUS - it was subsequently backported to luminous in v12.2.2, but that makes no difference to clients that only check OSDMAP_FULL/NEARFULL because require_osd_release is not client-facing -- it is for OSDs Since all kernels are affected, the best we can do here is just start checking both map flags and pool flags and send that to stable. These checks are best effort, so take osdc->lock and look up pool flags just once. Remove the FIXME, since filesystem quotas are checked above and RADOS quotas are reflected in POOL_FLAG_FULL: when the pool reaches its quota, both POOL_FLAG_FULL and POOL_FLAG_FULL_QUOTA are set. Cc: stable@vger.kernel.org Reported-by: Yanhu Cao <gmayyyha@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Acked-by: Sage Weil <sage@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ceph: do not execute direct write in parallel if O_APPEND is specifiedXiubo Li2020-03-051-6/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | [ Upstream commit 8e4473bb50a1796c9c32b244e5dbc5ee24ead937 ] In O_APPEND & O_DIRECT mode, the data from different writers will be possibly overlapping each other since they take the shared lock. For example, both Writer1 and Writer2 are in O_APPEND and O_DIRECT mode: Writer1 Writer2 shared_lock() shared_lock() getattr(CAP_SIZE) getattr(CAP_SIZE) iocb->ki_pos = EOF iocb->ki_pos = EOF write(data1) write(data2) shared_unlock() shared_unlock() The data2 will overlap the data1 from the same file offset, the old EOF. Switch to exclusive lock instead when O_APPEND is specified. Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ceph: check availability of mds cluster on mount after wait timeoutXiubo Li2020-02-242-2/+6
| | | | | | | | | | | | | | | | | [ Upstream commit 97820058fb2831a4b203981fa2566ceaaa396103 ] If all the MDS daemons are down for some reason, then the first mount attempt will fail with EIO after the mount request times out. A mount attempt will also fail with EIO if all of the MDS's are laggy. This patch changes the code to return -EHOSTUNREACH in these situations and adds a pr_info error message to help the admin determine the cause. URL: https://tracker.ceph.com/issues/4386 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
* ceph: hold extra reference to r_parent over life of requestJeff Layton2020-01-291-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | commit 9c1c2b35f1d94de8325344c2777d7ee67492db3b upstream. Currently, we just assume that it will stick around by virtue of the submitter's reference, but later patches will allow the syscall to return early and we can't rely on that reference at that point. While I'm not aware of any reports of it, Xiubo pointed out that this may fix a use-after-free. If the wait for a reply times out or is canceled via signal, and then the reply comes in after the syscall returns, the client can end up trying to access r_parent without a reference. Take an extra reference to the inode when setting r_parent and release it when releasing the request. Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ceph: fix compat_ioctl for ceph_dir_operationsArnd Bergmann2019-12-172-1/+2
| | | | | | | | | | | | | | | | | | | | | commit 18bd6caaef4021803dd0d031dc37c2d001d18a5b upstream. The ceph_ioctl function is used both for files and directories, but only the files support doing that in 32-bit compat mode. On the s390 architecture, there is also a problem with invalid 31-bit pointers that need to be passed through compat_ptr(). Use the new compat_ptr_ioctl() to address both issues. Note: When backporting this patch to stable kernels, "compat_ioctl: add compat_ptr_ioctl()" is needed as well. Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* ceph: increment/decrement dio counter on async requestsJeff Layton2019-11-141-0/+4
| | | | | | | | | | | | | | Ceph can in some cases issue an async DIO request, in which case we can end up calling ceph_end_io_direct before the I/O is actually complete. That may allow buffered operations to proceed while DIO requests are still in flight. Fix this by incrementing the i_dio_count when issuing an async DIO request, and decrement it when tearing down the aio_req. Fixes: 321fe13c9398 ("ceph: add buffered/direct exclusionary locking for reads and writes") Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: take the inode lock before acquiring cap refsJeff Layton2019-11-141-7/+18
| | | | | | | | | | | | | | | | | | | | Most of the time, we (or the vfs layer) takes the inode_lock and then acquires caps, but ceph_read_iter does the opposite, and that can lead to a deadlock. When there are multiple clients treading over the same data, we can end up in a situation where a reader takes caps and then tries to acquire the inode_lock. Another task holds the inode_lock and issues a request to the MDS which needs to revoke the caps, but that can't happen until the inode_lock is unwedged. Fix this by having ceph_read_iter take the inode_lock earlier, before attempting to acquire caps. Fixes: 321fe13c9398 ("ceph: add buffered/direct exclusionary locking for reads and writes") Link: https://tracker.ceph.com/issues/36348 Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: return -EINVAL if given fsc mount option on kernel w/o supportJeff Layton2019-11-071-1/+10
| | | | | | | | | | | If someone requests fscache on the mount, and the kernel doesn't support it, it should fail the mount. [ Drop ceph prefix -- it's provided by pr_err. ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: don't allow copy_file_range when stripe_count != 1Luis Henriques2019-11-051-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | copy_file_range tries to use the OSD 'copy-from' operation, which simply performs a full object copy. Unfortunately, the implementation of this system call assumes that stripe_count is always set to 1 and doesn't take into account that the data may be striped across an object set. If the file layout has stripe_count different from 1, then the destination file data will be corrupted. For example: Consider a 8 MiB file with 4 MiB object size, stripe_count of 2 and stripe_size of 2 MiB; the first half of the file will be filled with 'A's and the second half will be filled with 'B's: 0 4M 8M Obj1 Obj2 +------+------+ +----+ +----+ file: | AAAA | BBBB | | AA | | AA | +------+------+ |----| |----| | BB | | BB | +----+ +----+ If we copy_file_range this file into a new file (which needs to have the same file layout!), then it will start by copying the object starting at file offset 0 (Obj1). And then it will copy the object starting at file offset 4M -- which is Obj1 again. Unfortunately, the solution for this is to not allow remote object copies to be performed when the file layout stripe_count is not 1 and simply fallback to the default (VFS) copy_file_range implementation. Cc: stable@vger.kernel.org Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: don't try to handle hashed dentries in non-O_CREAT atomic_openJeff Layton2019-11-051-0/+3
| | | | | | | | | | | | | | | | | | | | If ceph_atomic_open is handed a !d_in_lookup dentry, then that means that it already passed d_revalidate so we *know* that it's negative (or at least was very recently). Just return -ENOENT in that case. This also addresses a subtle bug in dentry handling. Non-O_CREAT opens call atomic_open with the parent's i_rwsem shared, but calling d_splice_alias on a hashed dentry requires the exclusive lock. If ceph_atomic_open receives a hashed, negative dentry on a non-O_CREAT open, and another client were to race in and create the file before we issue our OPEN, ceph_fill_trace could end up calling d_splice_alias on the dentry with the new inode with insufficient locks. Cc: stable@vger.kernel.org Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: add missing check in d_revalidate snapdir handlingAl Viro2019-10-291-0/+1
| | | | | | | | | We should not play with dcache without parent locked... Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: fix RCU case handling in ceph_d_revalidate()Al Viro2019-10-291-7/+8
| | | | | | | | | | | | | | For RCU case ->d_revalidate() is called with rcu_read_lock() and without pinning the dentry passed to it. Which means that it can't rely upon ->d_inode remaining stable; that's the reason for d_inode_rcu(), actually. Make sure we don't reload ->d_inode there. Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: fix use-after-free in __ceph_remove_cap()Luis Henriques2019-10-291-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KASAN reports a use-after-free when running xfstest generic/531, with the following trace: [ 293.903362] kasan_report+0xe/0x20 [ 293.903365] rb_erase+0x1f/0x790 [ 293.903370] __ceph_remove_cap+0x201/0x370 [ 293.903375] __ceph_remove_caps+0x4b/0x70 [ 293.903380] ceph_evict_inode+0x4e/0x360 [ 293.903386] evict+0x169/0x290 [ 293.903390] __dentry_kill+0x16f/0x250 [ 293.903394] dput+0x1c6/0x440 [ 293.903398] __fput+0x184/0x330 [ 293.903404] task_work_run+0xb9/0xe0 [ 293.903410] exit_to_usermode_loop+0xd3/0xe0 [ 293.903413] do_syscall_64+0x1a0/0x1c0 [ 293.903417] entry_SYSCALL_64_after_hwframe+0x44/0xa9 This happens because __ceph_remove_cap() may queue a cap release (__ceph_queue_cap_release) which can be scheduled before that cap is removed from the inode list with rb_erase(&cap->ci_node, &ci->i_caps); And, when this finally happens, the use-after-free will occur. This can be fixed by removing the cap from the inode list before being removed from the session list, and thus eliminating the risk of an UAF. Cc: stable@vger.kernel.org Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: just skip unrecognized info in ceph_reply_info_extraJeff Layton2019-10-151-10/+11
| | | | | | | | | | | | | In the future, we're going to want to extend the ceph_reply_info_extra for create replies. Currently though, the kernel code doesn't accept an extra blob that is larger than the expected data. Change the code to skip over any unrecognized fields at the end of the extra blob, rather than returning -EIO. Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* Merge tag 'ceph-for-5.4-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds2019-09-2516-335/+596
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull ceph updates from Ilya Dryomov: "The highlights are: - automatic recovery of a blacklisted filesystem session (Zheng Yan). This is disabled by default and can be enabled by mounting with the new "recover_session=clean" option. - serialize buffered reads and O_DIRECT writes (Jeff Layton). Care is taken to avoid serializing O_DIRECT reads and writes with each other, this is based on the exclusion scheme from NFS. - handle large osdmaps better in the face of fragmented memory (myself) - don't limit what security.* xattrs can be get or set (Jeff Layton). We were overly restrictive here, unnecessarily preventing things like file capability sets stored in security.capability from working. - allow copy_file_range() within the same inode and across different filesystems within the same cluster (Luis Henriques)" * tag 'ceph-for-5.4-rc1' of git://github.com/ceph/ceph-client: (41 commits) ceph: call ceph_mdsc_destroy from destroy_fs_client libceph: use ceph_kvmalloc() for osdmap arrays libceph: avoid a __vmalloc() deadlock in ceph_kvmalloc() ceph: allow object copies across different filesystems in the same cluster ceph: include ceph_debug.h in cache.c ceph: move static keyword to the front of declarations rbd: pull rbd_img_request_create() dout out into the callers ceph: reconnect connection if session hang in opening state libceph: drop unused con parameter of calc_target() ceph: use release_pages() directly rbd: fix response length parameter for encoded strings ceph: allow arbitrary security.* xattrs ceph: only set CEPH_I_SEC_INITED if we got a MAC label ceph: turn ceph_security_invalidate_secctx into static inline ceph: add buffered/direct exclusionary locking for reads and writes libceph: handle OSD op ceph_pagelist_append() errors ceph: don't return a value from void function ceph: don't freeze during write page faults ceph: update the mtime when truncating up ceph: fix indentation in __get_snap_name() ...
| * ceph: call ceph_mdsc_destroy from destroy_fs_clientJeff Layton2019-09-161-4/+1
| | | | | | | | | | | | | | They're always called in succession. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * ceph: allow object copies across different filesystems in the same clusterLuis Henriques2019-09-161-4/+13
| | | | | | | | | | | | | | | | | | | | | | OSDs are able to perform object copies across different pools. Thus, there's no need to prevent copy_file_range from doing remote copies if the source and destination superblocks are different. Only return -EXDEV if they have different fsid (the cluster ID). Signed-off-by: Luis Henriques <lhenriques@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * ceph: include ceph_debug.h in cache.cIlya Dryomov2019-09-161-0/+2
| | | | | | | | | | | | Any file that uses dout() should include ceph_debug.h at the top. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * ceph: move static keyword to the front of declarationsKrzysztof Wilczynski2019-09-161-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the static keyword to the front of declarations of snap_handle_length, handle_length and connected_handle_length, and resolve the following compiler warnings that can be seen when building with warnings enabled (W=1): fs/ceph/export.c:38:2: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration] fs/ceph/export.c:88:2: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration] fs/ceph/export.c:90:2: warning: ‘static’ is not at beginning of declaration [-Wold-style-declaration] Signed-off-by: Krzysztof Wilczynski <kw@linux.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * ceph: reconnect connection if session hang in opening stateErqi Chen2019-09-161-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If client mds session is evicted in CEPH_MDS_SESSION_OPENING state, mds won't send session msg to client, and delayed_work skip CEPH_MDS_SESSION_OPENING state session, the session hang forever. Allow ceph_con_keepalive to reconnect a session in OPENING to avoid session hang. Also, ensure that we skip sessions in RESTARTING and REJECTED states since those states can't be resurrected by issuing a keepalive. Link: https://tracker.ceph.com/issues/41551 Signed-off-by: Erqi Chen chenerqi@gmail.com Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>