summaryrefslogtreecommitdiffstats
path: root/fs/ceph/mds_client.c
Commit message (Collapse)AuthorAgeFilesLines
* libceph: msg signing callouts don't need con argumentIlya Dryomov2015-11-021-6/+8
| | | | | | | | | | We can use msg->con instead - at the point we sign an outgoing message or check the signature on the incoming one, msg->con is always set. We wouldn't know how to sign a message without an associated session (i.e. msg->con == NULL) and being able to sign a message using an explicitly provided authorizer is of no use. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* ceph: make fsync() wait unsafe requests that created/modified inodeYan, Zheng2015-11-021-0/+14
| | | | | | | | If we get a unsafe reply for request that created/modified inode, add the unsafe request to a list in the newly created/modified inode. So we can make fsync() wait these unsafe requests. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: add request to i_unsafe_dirops when getting unsafe replyYan, Zheng2015-11-021-7/+11
| | | | | | | | | Previously we add request to i_unsafe_dirops when registering request. So ceph_fsync() also waits for imcomplete requests. This is unnecessary, ceph_fsync() only needs to wait unsafe requests. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: don't invalidate page cache when inode is no longer usedYan, Zheng2015-11-021-1/+8
| | | | | | | | ceph_check_caps() invalidate page cache when inode is not used by any open file. This behaviour is not friendly for workload that repeatly read files. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: fix message length computationArnd Bergmann2015-11-021-1/+1
| | | | | | | | | | | | | | create_request_message() computes the maximum length of a message, but uses the wrong type for the time stamp: sizeof(struct timespec) may be 8 or 16 depending on the architecture, while sizeof(struct ceph_timespec) is always 8, and that is what gets put into the message. Found while auditing the uses of timespec for y2038 problems. Fixes: b8e69066d8af ("ceph: include time stamp in every MDS request") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: cleanup use of ceph_msg_getJianpeng Ma2015-09-081-2/+1
| | | | | Signed-off-by: Jianpeng Ma <jianpeng.ma@intel.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: remove redundant test of head->safe and silence static analysis warningsBrad Hubbard2015-09-081-1/+1
| | | | | Signed-off-by: Brad Hubbard <bhubbard@redhat.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: EIO all operations after forced umountYan, Zheng2015-09-081-11/+43
| | | | | | | | | This patch makes try_get_cap_refs() and __do_request() check if the file system was forced umount, and return -EIO if it was. This patch also adds a helper function to drops dirty caps and wakes up blocking operation. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: switch some GFP_NOFS memory allocation to GFP_KERNELYan, Zheng2015-06-251-1/+2
| | | | | | | | GFP_NOFS memory allocation is required for page writeback path. But there is no need to use GFP_NOFS in syscall path and readpage path Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: pre-allocate data structure that tracks caps flushingYan, Zheng2015-06-251-1/+5
| | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: re-send flushing caps (which are revoked) in reconnect stageYan, Zheng2015-06-251-0/+3
| | | | | | | | if flushing caps were revoked, we should re-send the cap flush in client reconnect stage. This guarantees that MDS processes the cap flush message before issuing the flushing caps to other client. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: track pending caps flushing globallyYan, Zheng2015-06-251-45/+48
| | | | | | | | | | So we know TID of the oldest pending caps flushing. Later patch will send this information to MDS, so that MDS can trim its completed caps flush list. Tracking pending caps flushing globally also simplifies syncfs code. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: track pending caps flushing accuratelyYan, Zheng2015-06-251-0/+20
| | | | | | | | | | | | | | | Previously we do not trace accurate TID for flushing caps. when MDS failovers, we have no choice but to re-send all flushing caps with a new TID. This can cause problem because MDS can has already flushed some caps and has issued the same caps to other client. The re-sent cap flush has a new TID, which makes MDS unable to detect if it has already processed the cap flush. This patch adds code to track pending caps flushing accurately. When re-sending cap flush is needed, we use its original flush TID. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: ratelimit warn messages for MDS closes sessionYan, Zheng2015-06-251-3/+7
| | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: simplify two mount_timeout sitesIlya Dryomov2015-06-251-8/+10
| | | | | | | | No need to bifurcate wait now that we've got ceph_timeout_jiffies(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Reviewed-by: Yan, Zheng <zyan@redhat.com>
* libceph: store timeouts in jiffies, verify user inputIlya Dryomov2015-06-251-6/+6
| | | | | | | | | | | | | | | | | | | | | | There are currently three libceph-level timeouts that the user can specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive. All of these are in seconds and no checking is done on user input: negative values are accepted, we multiply them all by HZ which may or may not overflow, arbitrarily large jiffies then get added together, etc. There is also a bug in the way mount_timeout=0 is handled. It's supposed to mean "infinite timeout", but that's not how wait.h APIs treat it and so __ceph_open_session() for example will busy loop without much chance of being interrupted if none of ceph-mons are there. Fix all this by verifying user input, storing timeouts capped by msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies() helper for all user-specified waits to handle infinite timeouts correctly. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
* ceph: exclude setfilelock requests when calculating oldest tidYan, Zheng2015-06-251-7/+23
| | | | | | | setfilelock requests can block for a long time, which can prevent client from advancing its oldest tid. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: don't pre-allocate space for cap release messagesYan, Zheng2015-06-251-137/+86
| | | | | | | | | | | Previously we pre-allocate cap release messages for each caps. This wastes lots of memory when there are large amount of caps. This patch make the code not pre-allocate the cap release messages. Instead, we add the corresponding ceph_cap struct to a list when releasing a cap. Later when flush cap releases is needed, we allocate the cap release messages dynamically. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: make sure syncfs flushes all cap snapsYan, Zheng2015-06-251-24/+62
| | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: don't trim auth cap when there are cap snapsYan, Zheng2015-06-251-1/+2
| | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: check OSD caps before read/writeYan, Zheng2015-06-251-0/+4
| | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2015-04-261-12/+12
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fourth vfs update from Al Viro: "d_inode() annotations from David Howells (sat in for-next since before the beginning of merge window) + four assorted fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RCU pathwalk breakage when running into a symlink overmounting something fix I_DIO_WAKEUP definition direct-io: only inc/dec inode->i_dio_count for file systems fs/9p: fix readdir() VFS: assorted d_backing_inode() annotations VFS: fs/inode.c helpers: d_inode() annotations VFS: fs/cachefiles: d_backing_inode() annotations VFS: fs library helpers: d_inode() annotations VFS: assorted weird filesystems: d_inode() annotations VFS: normal filesystems (and lustre): d_inode() annotations VFS: security/: d_inode() annotations VFS: security/: d_backing_inode() annotations VFS: net/: d_inode() annotations VFS: net/unix: d_backing_inode() annotations VFS: kernel/: d_inode() annotations VFS: audit: d_backing_inode() annotations VFS: Fix up some ->d_inode accesses in the chelsio driver VFS: Cachefiles should perform fs modifications on the top layer only VFS: AF_UNIX sockets should call mknod on the top layer only
| * VFS: normal filesystems (and lustre): d_inode() annotationsDavid Howells2015-04-151-12/+12
| | | | | | | | | | | | | | that's the bulk of filesystem drivers dealing with inodes of their own Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | ceph: fix null pointer dereference in send_mds_reconnect()Yan, Zheng2015-04-221-1/+2
| | | | | | | | | | | | sb->s_root can be null when umounting Signed-off-by: Yan, Zheng <zyan@redhat.com>
* | ceph: cleanup unsafe requests when reconnecting is deniedYan, Zheng2015-04-201-0/+28
| | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
* | ceph: don't zero i_wrbuffer_ref when reconnecting is deniedYan, Zheng2015-04-201-7/+0
| | | | | | | | | | | | | | | | remove_session_caps_cb() does not truncate dirty data in page cache, but zeros i_wrbuffer_ref/i_wrbuffer_ref_head. This will result negtive i_wrbuffer_ref/i_wrbuffer_ref_head Signed-off-by: Yan, Zheng <zyan@redhat.com>
* | ceph: don't mark dirty caps when there is no auth capYan, Zheng2015-04-201-1/+1
| | | | | | | | | | | | | | No i_auth_cap means reconnecting to MDS was denied. So don't add new dirty caps. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* | ceph: use msecs_to_jiffies for time conversionNicholas Mc Guire2015-04-201-1/+1
| | | | | | | | | | | | | | | | This is only an API consolidation and should make things more readable it replaces var * HZ / 1000 by msecs_to_jiffies(var). Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org> Signed-off-by: Yan, Zheng <zyan@redhat.com>
* | ceph: drop cap releases in requests composed before cap reconnectYan, Zheng2015-04-201-6/+13
|/ | | | | | | | | | These cap releases are stale because MDS will re-establish client caps according to the cap reconnect messages. Note: MDS can detect stale cap messages, so these stale cap releases are harmless even we don't drop them. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2015-02-191-34/+93
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph changes from Sage Weil: "On the RBD side, there is a conversion to blk-mq from Christoph, several long-standing bug fixes from Ilya, and some cleanup from Rickard Strandqvist. On the CephFS side there is a long list of fixes from Zheng, including improved session handling, a few IO path fixes, some dcache management correctness fixes, and several blocking while !TASK_RUNNING fixes. The core code gets a few cleanups and Chaitanya has added support for TCP_NODELAY (which has been used on the server side for ages but we somehow missed on the kernel client). There is also an update to MAINTAINERS to fix up some email addresses and reflect that Ilya and Zheng are doing most of the maintenance for RBD and CephFS these days. Do not be surprised to see a pull request come from one of them in the future if I am unavailable for some reason" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits) MAINTAINERS: update Ceph and RBD maintainers libceph: kfree() in put_osd() shouldn't depend on authorizer libceph: fix double __remove_osd() problem rbd: convert to blk-mq ceph: return error for traceless reply race ceph: fix dentry leaks ceph: re-send requests when MDS enters reconnecting stage ceph: show nocephx_require_signatures and notcp_nodelay options libceph: tcp_nodelay support rbd: do not treat standalone as flatten ceph: fix atomic_open snapdir ceph: properly mark empty directory as complete client: include kernel version in client metadata ceph: provide seperate {inode,file}_operations for snapdir ceph: fix request time stamp encoding ceph: fix reading inline data when i_size > PAGE_SIZE ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions) ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps) ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync) rbd: fix error paths in rbd_dev_refresh() ...
| * ceph: re-send requests when MDS enters reconnecting stageYan, Zheng2015-02-191-3/+26
| | | | | | | | | | | | | | | | | | So that MDS can check if any request is already completed and process completed requests in clientreplay stage. When completed requests are processed in clientreplay stage, MDS can avoid sending traceless replies. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * client: include kernel version in client metadataYan, Zheng2015-02-191-1/+2
| | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: fix request time stamp encodingYan, Zheng2015-02-191-2/+10
| | | | | | | | | | | | | | | | | | struct timespec uses 'long' to present second and nanosecond. 'long' is 64 bits on 64bits machine. ceph MDS expects time stamp to be encoded as struct ceph_timespec, which uses 'u32' to present second and nanosecond. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions)Yan, Zheng2015-02-191-9/+4
| | | | | | | | | | | | | | use an atomic variable to track number of sessions, this can avoid block operation inside wait loops. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync)Yan, Zheng2015-02-191-17/+34
| | | | | | | | | | | | | | check_cap_flush() calls mutex_lock(), which may block. So we can't use it as condition check function for wait_event(); Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: improve reference tracking for snaprealmYan, Zheng2015-02-191-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | When snaprealm is created, its initial reference count is zero. But in some rare cases, the newly created snaprealm is not referenced by anyone. This causes snaprealm with zero reference count not freed. The fix is set reference count of newly snaprealm to 1. The reference is return the function who requests to create the snaprealm. When the function finishes its job, it releases the reference. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: handle SESSION_FORCE_RO messageYan, Zheng2015-02-191-0/+10
| | | | | | | | | | | | mark session as readonly and wake up all cap waiters. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* | ceph: move spinlocking into ceph_encode_locks_to_buffer and ceph_count_locksJeff Layton2015-01-161-4/+0
|/ | | | | | | | | There is only a single call site for each of these functions, and the caller takes the i_lock prior to calling them and drops it just afterward. Move the spinlocking into the functions instead. Signed-off-by: Jeff Layton <jlayton@primarydata.com> Acked-by: Christoph Hellwig <hch@lst.de>
* ceph: parse inline data in MClientReply and MClientCapsYan, Zheng2014-12-171-0/+10
| | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: message versioning fixesJohn Spray2014-12-171-2/+5
| | | | | | | | | | | | | There were two places we were assigning version in host byte order instead of network byte order. Also in MSG_CLIENT_SESSION we weren't setting compat_version in the header to reflect continued compatability with older MDSs. Fixes: http://tracker.ceph.com/issues/9945 Signed-off-by: John Spray <john.spray@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
* libceph: message signature supportYan, Zheng2014-12-171-0/+16
| | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph, rbd: delete unnecessary checks before two function callsSF Markus Elfring2014-12-171-4/+2
| | | | | | | | | | | | The functions ceph_put_snap_context() and iput() test whether their argument is NULL and then return immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> [idryomov@redhat.com: squashed rbd.c hunk, changelog] Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
* ceph: fix file lock interruptionYan, Zheng2014-12-171-0/+2
| | | | | | | | | | | | | When a lock operation is interrupted, current code sends a unlock request to MDS to undo the lock operation. This method does not work as expected because the unlock request can drop locks that have already been acquired. The fix is use the newly introduced CEPH_LOCK_FCNTL_INTR/CEPH_LOCK_FLOCK_INTR requests to interrupt blocked file lock request. These requests do not drop locks that have alread been acquired, they only interrupt blocked file lock request. Signed-off-by: Yan, Zheng <zyan@redhat.com>
* ceph: export ceph_session_state_name functionJohn Spray2014-10-141-7/+7
| | | | | | | ...so that it can be used from the ceph debugfs code when dumping session info. Signed-off-by: John Spray <john.spray@redhat.com>
* ceph: use pagelist to present MDS request dataYan, Zheng2014-10-141-5/+9
| | | | | | | | | | | | | Current code uses page array to present MDS request data. Pages in the array are allocated/freed by caller of ceph_mdsc_do_request(). If request is interrupted, the pages can be freed while they are still being used by the request message. The fix is use pagelist to present MDS request data. Pagelist is reference counted. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
* libceph: reference counting pagelistYan, Zheng2014-10-141-1/+0
| | | | | | | this allow pagelist to present data that may be sent multiple times. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
* ceph: send client metadata to MDSJohn Spray2014-10-141-1/+70
| | | | | | | | | | Implement version 2 of CEPH_MSG_CLIENT_SESSION syntax, which includes additional client metadata to allow the MDS to report on clients by user-sensible names like hostname. Signed-off-by: John Spray <john.spray@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
* ceph: move ceph_find_inode() outside the s_mutexYan, Zheng2014-10-141-3/+4
| | | | | | | | | ceph_find_inode() may wait on freeing inode, using it inside the s_mutex may cause deadlock. (the freeing inode is waiting for OSD read reply, but dispatch thread is blocked by the s_mutex) Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
* ceph: make sure request isn't in any waiting list when kicking request.Yan, Zheng2014-10-141-0/+1
| | | | | | | we may corrupt waiting list if a request in the waiting list is kicked. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
* ceph: protect kick_requests() with mdsc->mutexYan, Zheng2014-10-141-2/+3
| | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>