summaryrefslogtreecommitdiffstats
path: root/fs/ceph
Commit message (Collapse)AuthorAgeFilesLines
* ceph: disable use of dcache for readdir etc.Sage Weil2011-12-291-26/+3
| | | | | | | | | Ceph attempts to use the dcache to satisfy negative lookups and readdir when the entire directory contents are in cache. Disable this behavior until lingering bugs in this code are shaken out; we'll re-enable these hooks once things are fully stable. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: add missing spin_unlock at ceph_mdsc_build_path()Yehuda Sadeh2011-12-131-0/+1
| | | | | | one of the paths was missing spin_unlock Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
* ceph: fix SEEK_CUR, SEEK_SET regressionSage Weil2011-12-131-1/+2
| | | | | | | | | | | | Commit 06222e491e663dac939f04b125c9dc52126a75c4 got the if wrong so that it always evaluates as true. This is semantically harmless, but makes SEEK_CUR and SEEK_SET needlessly query the server. Rewrite the if to explicitly enumerate the cases we DO need a valid i_size to make this code less fragile. Reported-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: use i_ceph_lock instead of i_lockSage Weil2011-12-0711-207/+212
| | | | | | | | | | | | | | | | | We have been using i_lock to protect all kinds of data structures in the ceph_inode_info struct, including lists of inodes that we need to iterate over while avoiding races with inode destruction. That requires grabbing a reference to the inode with the list lock protected, but igrab() now takes i_lock to check the inode flags. Changing the list lock ordering would be a painful process. However, using a ceph-specific i_ceph_lock in the ceph inode instead of i_lock is a simple mechanical change and avoids the ordering constraints imposed by igrab(). Reported-by: Amon Ott <a.ott@m-privacy.de> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix rasize reporting by ceph_show_optionsSage Weil2011-12-021-1/+1
| | | | | | | Fix typo. Reported-by: mowang da <whooya.xxl@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: initialize root dentrySage Weil2011-11-112-3/+5
| | | | | | | | | | | Set up d_fsdata on the root dentry. This fixes a NULL pointer dereference in ceph_d_prune on umount. It also means we can eventually strip out all of the conditional checks on d_fsdata because it is now set unconditionally (prior to setting up the d_ops). Fix the ceph_d_prune debug print while we're here. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix iput race when queueing inode workSage Weil2011-11-051-3/+6
| | | | | | | | | | | | | | | | | | | | | If we queue a work item that calls iput(), make sure we ihold() before attempting to queue work. Otherwise our queued work might miraculously run before we notice the queue_work() succeeded and call ihold(), allowing the inode to be destroyed. That is, instead of if (queue_work(...)) ihold(); we need to do ihold(); if (!queue_work(...)) iput(); Reported-by: Amon Ott <a.ott@m-privacy.de> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph/super.c: quiet sparse noiseH Hartley Sweeten2011-11-051-2/+2
| | | | | | | | | | | | Quiet the sparse noise: warning: symbol 'create_fs_client' was not declared. Should it be static? warning: symbol 'destroy_fs_client' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Sage Weil <sage@newdream.net> ceph-devel@vger.kernel.org Signed-off-by: Sage Weil <sage@newdream.net>
* ceph/mds_client.c: quiet sparse noiseH Hartley Sweeten2011-11-051-2/+2
| | | | | | | | | | | | | Quiet the following sparse noise: warning: symbol 'get_nonsnap_parent' was not declared. Should it be static? warning: symbol 'done_closing_sessions' was not declared. Should it be static? Local functions don't need external visability. Make them static. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Sage Weil <sage@newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: use new D_COMPLETE dentry flagSage Weil2011-11-055-23/+68
| | | | | | | | We used to use a flag on the directory inode to track whether the dcache contents for a directory were a complete cached copy. Switch to a dentry flag CEPH_D_COMPLETE that is safely updated by ->d_prune(). Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: clear parent D_COMPLETE flag when on dentry pruneSage Weil2011-11-032-0/+41
| | | | | | | | When the VFS prunes a dentry from the cache, clear the D_COMPLETE flag on the parent dentry. Do this for the live and snapshotted namespaces. Do not bother for the .snap dir contents, since we do not cache that. Signed-off-by: Sage Weil <sage@newdream.net>
* filesystems: add set_nlink()Miklos Szeredi2011-11-022-2/+2
| | | | | | | | | Replace remaining direct i_nlink updates with a new set_nlink() updater function. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
* libceph: fix double-free of page vectorSage Weil2011-10-251-1/+0
| | | | | | | ceph_release_page_vector() kfrees the vector; we shouldn't do it here too. Reported-by: Jeff Wu <cpwu@tnsoft.com.cn> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix 32-bit ino numbersAmon Ott2011-10-251-5/+6
| | | | | | Fix 32-bit ino generation to not always be 1. Signed-off-by: Amon Ott <a.ott@m-privacy.de>
* ceph: let the set_layout ioctl set single traitsGreg Farnum2011-10-251-6/+28
| | | | | | | | | | Previously we were validating the passed-in stripe unit, object size, and stripe count against each other (and not testing most other stuff). Instead, make sure that the composed previous layout and new values are valid, and only send the new values to the MDS. This lets users change the pool without setting the whole layout, for instance. Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
* Revert "ceph: don't truncate dirty pages in invalidate work thread"Sage Weil2011-10-251-45/+1
| | | | | | | | | | | | | | | | | | | | This reverts commit c9af9fb68e01eb2c2165e1bc45cfeeed510c64e6. We need to block and truncate all pages in order to reliably invalidate them. Otherwise, we could: - have some uptodate pages in the cache - queue an invalidate - write(2) locks some pages - invalidate_work skips them - write(2) only overwrites part of the page - page now dirty and uptodate -> partial leakage of invalidated data It's not entirely clear why we started skipping locked pages in the first place. I just ran this through fsx and didn't see any problems. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: replace leading spaces with tabsNoah Watkins2011-10-251-20/+20
| | | | | | | Trivial formatting fix. Signed-off-by: Noah Watkins <noahwatkins@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
* libceph: don't complain on msgpool alloc failuresSage Weil2011-10-252-6/+7
| | | | | | | | | | The pool allocation failures are masked by the pool; there is no need to spam the console about them. (That's the whole point of having the pool in the first place.) Mark msg allocations whose failure is safely handled as such. Signed-off-by: Sage Weil <sage@newdream.net>
* libceph: create messenger with clientSage Weil2011-10-251-3/+6
| | | | | | | This simplifies the init/shutdown paths, and makes client->msgr available during the rest of the setup process. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: document ioctlsSage Weil2011-10-251-1/+54
| | | | | | ...after some prodding by Christoph. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: implement (optional) max read sizeSage Weil2011-10-251-3/+12
| | | | | | | | The 'rsize' mount option limits the maximum size of an individual read(ahead) operation that is sent off to an OSD. This is distinct from 'rasize', which controls the size of the readahead window. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: rename rsize -> rasizeSage Weil2011-10-252-6/+16
| | | | | | It controls readahead. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: make readpages fully asyncSage Weil2011-10-251-70/+115
| | | | | | | | When we get a ->readpages() aop, submit async reads for all page ranges in the provided page list. Lock the pages immediately, so that VFS/MM will block until the reads complete. Signed-off-by: Sage Weil <sage@newdream.net>
* Merge branch 'for-linus' of git://ceph.newdream.net/git/ceph-clientLinus Torvalds2011-09-092-3/+3
|\ | | | | | | | | | | | | | | * 'for-linus' of git://ceph.newdream.net/git/ceph-client: libceph: fix leak of osd structs during shutdown ceph: fix memory leak ceph: fix encoding of ino only (not relative) paths libceph: fix msgpool
| * ceph: fix memory leakNoah Watkins2011-08-221-2/+2
| | | | | | | | | | | | | | | | kfree does not clean up indirect allocations in ceph_fs_client and ceph_options (e.g. snapdir_name). Signed-off-by: Noah Watkins <noahwatkins@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: fix encoding of ino only (not relative) pathsSage Weil2011-08-151-1/+1
| | | | | | | | | | | | | | | | A 'path' consists of a starting ino and relative component. Encode even when there is no relative component. This is primarily needed by the NFS reexport code. Signed-off-by: Sage Weil <sage@newdream.net>
* | Merge branch 'for-linus' of ↵Linus Torvalds2011-07-2613-137/+249
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (23 commits) ceph: document unlocked d_parent accesses ceph: explicitly reference rename old_dentry parent dir in request ceph: document locking for ceph_set_dentry_offset ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bug ceph: protect d_parent access in ceph_d_revalidate ceph: protect access to d_parent ceph: handle racing calls to ceph_init_dentry ceph: set dir complete frag after adding capability rbd: set blk_queue request sizes to object size ceph: set up readahead size when rsize is not passed rbd: cancel watch request when releasing the device ceph: ignore lease mask ceph: fix ceph_lookup_open intent usage ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNC ceph: fix bad parent_inode calc in ceph_lookup_open ceph: avoid carrying Fw cap during write into page cache libceph: don't time out osd requests that haven't been received ceph: report f_bfree based on kb_avail rather than diffing. ceph: only queue capsnap if caps are dirty ceph: fix snap writeback when racing with writes ...
| * ceph: document unlocked d_parent accessesSage Weil2011-07-262-4/+11
| | | | | | | | | | | | | | | | | | For the most part we don't care about racing with rename when directing MDS requests; either the old or new parent is fine. Document that, and do some minor cleanup. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: explicitly reference rename old_dentry parent dir in requestSage Weil2011-07-264-11/+17
| | | | | | | | | | | | | | | | | | | | | | We carry a pin on the parent directory for the rename source and dest dentries. For the source it's r_locked_dir; we need to explicitly reference the old_dentry parent as well, since the dentry's d_parent may change between when the request was created and pinned and when it is freed. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: document locking for ceph_set_dentry_offsetSage Weil2011-07-261-1/+3
| | | | | | | | | | Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: avoid d_parent in ceph_dentry_hash; fix ceph_encode_fh() hashing bugSage Weil2011-07-264-13/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Have caller pass in a safely-obtained reference to the parent directory for calculating a dentry's hash valud. While we're here, simpify the flow through ceph_encode_fh() so that there is a single exit point and cleanup. Also fix a bug with the dentry hash calculation: calculate the hash for the dentry we were given, not its parent. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: protect d_parent access in ceph_d_revalidateSage Weil2011-07-261-15/+17
| | | | | | | | | | | | | | | | Protect d_parent with d_lock. Carry a reference. Simplify the flow so that there is a single exit point and cleanup. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: protect access to d_parentSage Weil2011-07-266-15/+33
| | | | | | | | | | | | | | | | | | | | d_parent is protected by d_lock: use it when looking up a dentry's parent directory inode. Also take a reference and drop it in the caller to avoid a use-after-free. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: handle racing calls to ceph_init_dentrySage Weil2011-07-261-9/+12
| | | | | | | | | | | | | | | | | | | | | | | | The ->lookup() and prepopulate_readdir() callers are working with unhashed dentries, so we don't have to worry. The export.c callers, though, need to initialize something they got back from d_obtain_alias() and are potentially racing with other callers. Make sure we don't return unless the dentry is properly initialized (by us or someone else). Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: set dir complete frag after adding capabilitySage Weil2011-07-261-13/+17
| | | | | | | | | | | | | | | | | | | | Curretly ceph_add_cap clears the complete bit if we are newly issued the FILE_SHARED cap, which is normally the case for a newly issue cap on a new directory. That means we clear the just-set bit. Move the check that sets the flag to after the cap is added/updated. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: set up readahead size when rsize is not passedYehuda Sadeh2011-07-261-0/+4
| | | | | | | | | | | | | | This should improve the default read performance, as without it readahead is practically disabled. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
| * ceph: ignore lease maskSage Weil2011-07-263-18/+12
| | | | | | | | | | | | | | | | The lease mask is no longer used (and it changed a while back). Instead, use a non-zero duration to indicate that there is a lease being issued. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: fix ceph_lookup_open intent usageSage Weil2011-07-263-19/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We weren't properly calling lookup_instantiate_filp when setting up the lookup intent, which could lead to file leakage on errors. So: - use separate helper for the hidden snapdir translation, immediately following the mds request - use ceph_finish_lookup for the final dentry/return value dance in the exit path - lookup_instantiate_filp on success Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: only link open operations to directory unsafe list if O_CREAT|O_TRUNCSage Weil2011-07-261-1/+2
| | | | | | | | | | | | | | | | We only need to put these on the directory unsafe list if they have side effects that fsync(2) should flush out. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: fix bad parent_inode calc in ceph_lookup_openSage Weil2011-07-261-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We were always getting NULL here because the intent file f_dentry is always NULL at this point, which means we were always passing NULL to ceph_mdsc_do_request. In reality, this was fine, since this isn't currently ever a write operation that needs to get strung on the dir's unsafe list. Use the dir explicitly, and only pass it if this open has side-effects that a dir fsync should flush. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: avoid carrying Fw cap during write into page cacheSage Weil2011-07-261-3/+19
| | | | | | | | | | | | | | | | | | | | The generic_file_aio_write call may block on balance_dirty_pages while we flush data to the OSDs. If we hold a reference to the FILE_WR cap during that interval revocation by the MDS (e.g., to do a stat(2)) may be very slow. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: report f_bfree based on kb_avail rather than diffing.Greg Farnum2011-07-261-2/+1
| | | | | | | | | | Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
| * ceph: only queue capsnap if caps are dirtySage Weil2011-07-261-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | We used to go into this branch if i_wrbuffer_ref_head was non-zero. This was an ancient check from before we were careful about dealing with all kinds of caps (and not just dirty pages). It is cleaner to only queue a capsnap if there is an actual dirty cap. If we are racing with... something...we will end up here with ci->i_wrbuffer_refs but no dirty caps. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: fix snap writeback when racing with writesSage Weil2011-07-261-3/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two problems that come up when we try to queue a capsnap while a write is in progress: - The FILE_WR cap is held, but not yet dirty, so we may queue a capsnap with dirty == 0. That will crash later in __ceph_flush_snaps(). Or on the FILE_WR cap if a write is in progress. - We may not have i_head_snapc set, which causes problems pretty quickly. Look to the snaprealm in this case. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: use flag bit for at_end readdir flagSage Weil2011-07-262-6/+6
| | | | | | | | | | | | | | This saves us a word of memory per file. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: add F_SYNC file flag to force sync (non-O_DIRECT) ioSage Weil2011-07-264-2/+18
| | | | | | | | | | | | | | | | | | | | This allows us to force IO through the sync path which you normally only get when multiple clients are reading/writing to the same file or by mounting with -o sync. Among other things, this lets test programs verify correctness with a single mount. Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: add flags field to file_infoSage Weil2011-07-261-1/+2
| | | | | | | | | | Reviewed-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
* | fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlersJosef Bacik2011-07-203-4/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Btrfs needs to be able to control how filemap_write_and_wait_range() is called in fsync to make it less of a painful operation, so push down taking i_mutex and the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some file systems can drop taking the i_mutex altogether it seems, like ext3 and ocfs2. For correctness sake I just pushed everything down in all cases to make sure that we keep the current behavior the same for everybody, and then each individual fs maintainer can make up their mind about what to do from there. Thanks, Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseekJosef Bacik2011-07-202-3/+25
| | | | | | | | | | | | | | | | | | | | | | | | This converts everybody to handle SEEK_HOLE/SEEK_DATA properly. In some cases we just return -EINVAL, in others we do the normal generic thing, and in others we're simply making sure that the properly due-dilligence is done. For example in NFS/CIFS we need to make sure the file size is update properly for the SEEK_HOLE and SEEK_DATA case, but since it calls the generic llseek stuff itself that is all we have to do. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | don't open-code parent_ino() in assorted ->readdir()Al Viro2011-07-201-1/+1
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>