summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'xfs-for-linus-v3.14-rc1-2' of git://oss.sgi.com/xfs/xfsLinus Torvalds2014-01-284-11/+32
|\ | | | | | | | | | | | | | | | | | | Pull second xfs update from Ben Myers: "Allow logical sector sized direct io on 'advanced format' 4k/512 disk" * tag 'xfs-for-linus-v3.14-rc1-2' of git://oss.sgi.com/xfs/xfs: xfs: allow logical-sector sized O_DIRECT xfs: rename xfs_buftarg structure members xfs: clean up xfs_buftarg
| * xfs: allow logical-sector sized O_DIRECTEric Sandeen2014-01-244-3/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some time ago, mkfs.xfs started picking the storage physical sector size as the default filesystem "sector size" in order to avoid RMW costs incurred by doing IOs at logical sector size alignments. However, this means that for a filesystem made with i.e. a 4k sector size on an "advanced format" 4k/512 disk, 512-byte direct IOs are no longer allowed. This means that XFS has essentially turned this AF drive into a hard 4K device, from the filesystem on up. XFS's mkfs-specified "sector size" is really just controlling the minimum size & alignment of filesystem metadata. There is no real need to tightly couple XFS's minimal metadata size to the minimum allowed direct IO size; XFS can continue doing metadata in optimal sizes, but still allow smaller DIOs for apps which issue them, for whatever reason. This patch adds a new field to the xfs_buftarg, so that we now track 2 sizes: 1) The metadata sector size, which is the minimum unit and alignment of IO which will be performed by metadata operations. 2) The device logical sector size The first is used internally by the file system for metadata alignment and IOs. The second is used for the minimum allowed direct IO alignment. This has passed xfstests on filesystems made with 4k sectors, including when run under the patch I sent to ignore XFS_IOC_DIOINFO, and issue 512 DIOs anyway. I also directly tested end of block behavior on preallocated, sparse, and existing files when we do a 512 IO into a 4k file on a 4k-sector filesystem, to be sure there were no unexpected behaviors. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
| * xfs: rename xfs_buftarg structure membersEric Sandeen2014-01-244-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for adding new members to the structure, give these old ones more descriptive names: bt_ssize -> bt_meta_sectorsize bt_smask -> bt_meta_sectormask Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
| * xfs: clean up xfs_buftargEric Sandeen2014-01-243-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | Clean up the xfs_buftarg structure a bit: - remove bt_bsize which is never used - replace bt_sshift with bt_ssize; we only ever shift it back Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com>
* | ceph: Fix up after semantic merge conflictLinus Torvalds2014-01-284-108/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous ceph-client merge resulted in ceph not even building, because there was a merge conflict that wasn't visible as an actual data conflict: commit 7221fe4c2ed7 ("ceph: add acl for cephfs") added support for POSIX ACL's into Ceph, but unluckily we also had the VFS tree change a lot of the POSIX ACL helper functions to be much more helpful to filesystems (see for example commits 2aeccbe957d0 "fs: add generic xattr_acl handlers", 5bf3258fd2ac "fs: make posix_acl_chmod more useful" and 37bc15392a23 "fs: make posix_acl_create more useful") The reason this conflict wasn't obvious was many-fold: because it was a semantic conflict rather than a data conflict, it wasn't visible in the git merge as a conflict. And because the VFS tree hadn't been in linux-next, people hadn't become aware of it that way. And because I was at jury duty this morning, I was using my laptop and as a result not doing constant "allmodconfig" builds. Anyway, this fixes the build and generally removes a fair chunk of the Ceph POSIX ACL support code, since the improved helpers seem to match really well for Ceph too. But I don't actually have any way to *test* the end result, and I was really hoping for some ACK's for this. Oh, well. Not compiling certainly doesn't make things easier to test, so I'm committing this without the acks after having waited for four hours... Plus it's what I would have done for the merge had I noticed the semantic conflict.. Reported-by: Dave Jones <davej@redhat.com> Cc: Sage Weil <sage@inktank.com> Cc: Guangliang Zhao <lucienchao@gmail.com> Cc: Li Wang <li.wang@ubuntykylin.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'for-linus' of ↵Linus Torvalds2014-01-2838-679/+2261
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull ceph updates from Sage Weil: "This is a big batch. From Ilya we have: - rbd support for more than ~250 mapped devices (now uses same scheme that SCSI does for device major/minor numbering) - crush updates for new mapping behaviors (will be needed for coming erasure coding support, among other things) - preliminary support for tiered storage pools There is also a big series fixing a pile cephfs bugs with clustered MDSs from Yan Zheng, ACL support for cephfs from Guangliang Zhao, ceph fscache improvements from Li Wang, improved behavior when we get ENOSPC from Josh Durgin, some readv/writev improvements from Majianpeng, and the usual mix of small cleanups" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (76 commits) ceph: cast PAGE_SIZE to size_t in ceph_sync_write() ceph: fix dout() compile warnings in ceph_filemap_fault() libceph: support CEPH_FEATURE_OSD_CACHEPOOL feature libceph: follow redirect replies from osds libceph: rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid} libceph: follow {read,write}_tier fields on osd request submission libceph: add ceph_pg_pool_by_id() libceph: CEPH_OSD_FLAG_* enum update libceph: replace ceph_calc_ceph_pg() with ceph_oloc_oid_to_pg() libceph: introduce and start using oid abstraction libceph: rename MAX_OBJ_NAME_SIZE to CEPH_MAX_OID_NAME_LEN libceph: move ceph_file_layout helpers to ceph_fs.h libceph: start using oloc abstraction libceph: dout() is missing a newline libceph: add ceph_kv{malloc,free}() and switch to them libceph: support CEPH_FEATURE_EXPORT_PEER ceph: add imported caps when handling cap export message ceph: add open export target session helper ceph: remove exported caps when handling cap import message ceph: handle session flush message ...
| * | ceph: cast PAGE_SIZE to size_t in ceph_sync_write()Ilya Dryomov2014-01-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Use min_t(size_t, ...) instead of plain min(), which does strict type checking, to avoid compile warning on i386. Cc: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | ceph: fix dout() compile warnings in ceph_filemap_fault()Ilya Dryomov2014-01-281-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PAGE_CACHE_SIZE is unsigned long on all architectures, however size_t is either unsigned int or unsigned long. Rather than change format strings, cast PAGE_CACHE_SIZE to size_t to be in line with dout()s in ceph_page_mkwrite(). Cc: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | libceph: support CEPH_FEATURE_OSD_CACHEPOOL featureIlya Dryomov2014-01-271-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | Announce our (limited, see previous commit) support for CACHEPOOL feature. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | libceph: follow redirect replies from osdsIlya Dryomov2014-01-272-9/+164
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Follow redirect replies from osds, for details see ceph.git commit fbbe3ad1220799b7bb00ea30fce581c5eadaf034. v1 (current) version of redirect reply consists of oloc and oid, which expands to pool, key, nspace, hash and oid. However, server-side code that would populate anything other than pool doesn't exist yet, and hence this commit adds support for pool redirects only. To make sure that future server-side updates don't break us, we decode all fields and, if any of key, nspace, hash or oid have a non-default value, error out with "corrupt osd_op_reply ..." message. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | libceph: rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid}Ilya Dryomov2014-01-274-23/+23
| | | | | | | | | | | | | | | | | | | | | | | | Rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid} before introducing r_target_{oloc,oid} needed for redirects. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | libceph: follow {read,write}_tier fields on osd request submissionIlya Dryomov2014-01-273-5/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Overwrite ceph_osd_request::r_oloc.pool with read_tier for read ops and write_tier for write and read+write ops (aka basic tiering support). {read,write}_tier are part of pg_pool_t since v9. This commit bumps our pg_pool_t decode compat version from v7 to v9, all new fields except for {read,write}_tier are ignored. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | libceph: add ceph_pg_pool_by_id()Ilya Dryomov2014-01-272-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | "Lookup pool info by ID" function is hidden in osdmap.c. Expose it to the rest of libceph. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | libceph: CEPH_OSD_FLAG_* enum updateIlya Dryomov2014-01-271-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | Update CEPH_OSD_FLAG_* enum. (We need CEPH_OSD_FLAG_IGNORE_OVERLAY to support tiering). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | libceph: replace ceph_calc_ceph_pg() with ceph_oloc_oid_to_pg()Ilya Dryomov2014-01-274-18/+30
| | | | | | | | | | | | | | | | | | | | | | | | Switch ceph_calc_ceph_pg() to new oloc and oid abstractions and rename it to ceph_oloc_oid_to_pg() to make its purpose more clear. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | libceph: introduce and start using oid abstractionIlya Dryomov2014-01-275-25/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for tiering support, which would require having two (base and target) object names for each osd request and also copying those names around, introduce struct ceph_object_id (oid) and a couple helpers to facilitate those copies and encapsulate the fact that object name is not necessarily a NUL-terminated string. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | libceph: rename MAX_OBJ_NAME_SIZE to CEPH_MAX_OID_NAME_LENIlya Dryomov2014-01-273-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | In preparation for adding oid abstraction, rename MAX_OBJ_NAME_SIZE to CEPH_MAX_OID_NAME_LEN. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | libceph: move ceph_file_layout helpers to ceph_fs.hIlya Dryomov2014-01-272-27/+23
| | | | | | | | | | | | | | | | | | | | | Move ceph_file_layout helper macros and inline functions to ceph_fs.h. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | libceph: start using oloc abstractionIlya Dryomov2014-01-274-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of relying on pool fields in ceph_file_layout (for mapping) and ceph_pg (for enconding), start using ceph_object_locator (oloc) abstraction. Note that userspace oloc currently consists of pool, key, nspace and hash fields, while this one contains only a pool. This is OK, because at this point we only send (i.e. encode) olocs and never have to receive (i.e. decode) them. This makes keeping a copy of ceph_file_layout in every osd request unnecessary, so ceph_osd_request::r_file_layout field is nuked. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | libceph: dout() is missing a newlineIlya Dryomov2014-01-261-2/+2
| | | | | | | | | | | | | | | | | | Add a missing newline to a dout() in __reset_osd(). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
| * | libceph: add ceph_kv{malloc,free}() and switch to themIlya Dryomov2014-01-266-33/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Encapsulate kmalloc vs vmalloc memory allocation and freeing logic into two helpers, ceph_kvmalloc() and ceph_kvfree(), and switch to them. ceph_kvmalloc() kmalloc()'s a maximum of 8 pages, anything bigger is vmalloc()'ed with __GFP_HIGHMEM set. This changes the existing behaviour: - for buffers (ceph_buffer_new()), from trying to kmalloc() everything and using vmalloc() just as a fallback - for messages (ceph_msg_new()), from going to vmalloc() for anything bigger than a page - for messages (ceph_msg_new()), from disallowing vmalloc() to use high memory Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | libceph: support CEPH_FEATURE_EXPORT_PEERYan, Zheng2014-01-211-1/+2
| | | | | | | | | | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | ceph: add imported caps when handling cap export messageYan, Zheng2014-01-213-82/+146
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Version 3 cap export message includes information about the imported caps. It allows us to add the imported caps if the corresponding cap import message still hasn't been received. This allow us to handle situation that the importer MDS crashes and the cap import message is missing. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | ceph: add open export target session helperYan, Zheng2014-01-212-15/+38
| | | | | | | | | | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | ceph: remove exported caps when handling cap import messageYan, Zheng2014-01-212-28/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Version 3 cap import message includes the ID of the exported caps. It allow us to remove the exported caps if we still haven't received the corresponding cap export message. We remove the exported caps because they are stale, keeping them can compromise consistence. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | ceph: handle session flush messageYan, Zheng2014-01-213-0/+23
| | | | | | | | | | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | ceph: check inode caps in ceph_d_revalidateYan, Zheng2014-01-213-3/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | Some inodes in readdir reply may have no caps. Getattr mds request for these inodes can return -ESTALE. The fix is consider dentry that links to inode with no caps as invalid. Invalid dentry causes a lookup request to send to the mds, the MDS will send caps back. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | ceph: handle -ESTALE replyYan, Zheng2014-01-211-20/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | Send requests that operate on path to directory's auth MDS if mode == USE_AUTH_MDS. Always retry using the auth MDS if got -ESTALE reply from non-auth MDS. Also clean up the code that handles auth MDS change. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | ceph: fix trim capsYan, Zheng2014-01-211-6/+11
| | | | | | | | | | | | | | | | | | | | | | | | - don't trim auth cap if there are flusing caps - don't trim auth cap if any 'write' cap is wanted - allow trimming non-auth cap even if the inode is dirty Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | ceph: fix cache revoke raceYan, Zheng2014-01-213-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | handle following sequence of events: - non-auth MDS revokes Fc cap. queue invalidate work - auth MDS issues Fc cap through request reply. i_rdcache_gen gets increased. - invalidate work runs. it finds i_rdcache_revoking != i_rdcache_gen, so it does nothing. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | ceph: use ceph_seq_cmp() to compare migrate_seqYan, Zheng2014-01-211-1/+1
| | | | | | | | | | | | Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | ceph: handle cap export race in try_flush_caps()Yan, Zheng2014-01-211-8/+8
| | | | | | | | | | | | | | | | | | auth cap may change after releasing the i_ceph_lock Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
| * | ceph: trivial comment fixJ. Bruce Fields2014-01-161-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | "disconnected" is too easily confused with "DCACHE_DISCONNECTED". I think "unhashed" is the more precise term here. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | libceph: fix preallocation check in get_reply()Ilya Dryomov2014-01-142-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The check that makes sure that we have enough memory allocated to read in the entire header of the message in question is currently busted. It compares front_len of the incoming message with iov_len field of ceph_msg::front structure, which is used primarily to indicate the amount of data already read in, and not the size of the allocated buffer. Under certain conditions (e.g. a short read from a socket followed by that socket's shutdown and owning ceph_connection reset) this results in a warning similar to [85688.975866] libceph: get_reply front 198 > preallocated 122 (4#0) and, through another bug, leads to forever hung tasks and forced reboots. Fix this by comparing front_len with front_alloc_len field of struct ceph_msg, which stores the actual size of the buffer. Fixes: http://tracker.ceph.com/issues/5425 Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | libceph: rename front to front_len in get_reply()Ilya Dryomov2014-01-141-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | Rename front local variable to front_len in get_reply() to make its purpose more clear. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | libceph: rename ceph_msg::front_max to front_alloc_lenIlya Dryomov2014-01-143-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | Rename front_max field of struct ceph_msg to front_alloc_len to make its purpose more clear. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | libceph: use CEPH_MON_PORT when the specified port is 0Ilya Dryomov2013-12-311-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to userspace, don't bail with "parse_ips bad ip ..." if the specified port is port 0, instead use port CEPH_MON_PORT (6789, the default monitor port). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | crush: support new indep mode and SET_* steps (crush v2) by defaultIlya Dryomov2013-12-311-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | Add CRUSH_V2 feature (new indep mode and SET_* steps) to a set of features supported by default. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | crush: fix crush_choose_firstn commentIlya Dryomov2013-12-311-1/+5
| | | | | | | | | | | | | | | | | | | | | Reflects ceph.git commit 8b38f10bc2ee3643a33ea5f9545ad5c00e4ac5b4. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | crush: attempts -> triesIlya Dryomov2013-12-311-8/+8
| | | | | | | | | | | | | | | | | | | | | Reflects ceph.git commit ea3a0bb8b773360d73b8b77fa32115ef091c9857. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | crush: add set_choose_local_[fallback_]tries stepsIlya Dryomov2013-12-312-5/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows all of the tunables to be overridden by a specific rule. Reflects ceph.git commits d129e09e57fbc61cfd4f492e3ee77d0750c9d292, 0497db49e5973b50df26251ed0e3f4ac7578e66e. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | crush: generalize descend_onceIlya Dryomov2013-12-311-11/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The legacy behavior is to make the normal number of tries for the recursive chooseleaf call. The descend_once tunable changed this to making a single try and bail if we get a reject (note that it is impossible to collide in the recursive case). The new set_chooseleaf_tries lets you select the number of recursive chooseleaf attempts for indep mode, or default to 1. Use the same behavior for firstn, except default to total_tries when the legacy tunables are set (for compatibility). This makes the rule step override the (new) default of 1 recursive attempt, keeping behavior consistent with indep mode. Reflects ceph.git commit 685c6950ef3df325ef04ce7c986e36ca2514c5f1. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | crush: CHOOSE_LEAF -> CHOOSELEAF throughoutIlya Dryomov2013-12-312-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This aligns the internal identifier names with the user-visible names in the decompiled crush map language. Reflects ceph.git commit caa0e22e15e4226c3671318ba1f61314bf6da2a6. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | crush: add SET_CHOOSE_TRIES rule stepIlya Dryomov2013-12-312-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since we can specify the recursive retries in a rule, we may as well also specify the non-recursive tries too for completeness. Reflects ceph.git commit d1b97462cffccc871914859eaee562f2786abfd1. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | crush: apply chooseleaf_tries to firstn mode tooIlya Dryomov2013-12-312-5/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Parameterize the attempts for the _firstn choose method, and apply the rule-specified tries count to firstn mode as well. Note that we have slightly different behavior here than with indep: If the firstn value is not specified for firstn, we pass through the normal attempt count. This maintains compatibility with legacy behavior. Note that this is usually *not* actually N^2 work, though, because of the descend_once tunable. However, descend_once is unfortunately *not* the same thing as 1 chooseleaf try because it is only checked on a reject but not on a collision. Sigh. In contrast, for indep, if tries is not specified we default to 1 recursive attempt, because that is simply more sane, and we have the option to do so. The descend_once tunable has no effect for indep. Reflects ceph.git commit 64aeded50d80942d66a5ec7b604ff2fcbf5d7b63. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | crush: new SET_CHOOSE_LEAF_TRIES commandIlya Dryomov2013-12-312-10/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Explicitly control the number of sample attempts, and allow the number of tries in the recursive call to be explicitly controlled via the rule. This is important because the amount of time we want to spend looking for a solution may be rule dependent (e.g., higher for the wide indep pool than the rep pools). (We should do the same for the other tunables, by the way!) Reflects ceph.git commit c43c893be872f709c787bc57f46c0e97876ff681. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | crush: pass parent r value for indep callIlya Dryomov2013-12-311-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pass down the parent's 'r' value so that we will sample different values in the recursive call when the parent tries multiple times. This avoids doing useless work (calling multiple times and trying the same values). Reflects ceph.git commit 2731d3030d7a3e80922b7f1b7756f9a4a124bac5. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | crush: clarify numrep vs endposIlya Dryomov2013-12-311-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pass numrep (the width of the result) separately from the number of results we want *this* iteration. This makes things less awkward when we do a recursive call (for chooseleaf) and want only one item. Reflects ceph.git commit 1b567ee08972f268c11b43fc881e57b5984dd08b. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | crush: strip firstn conditionals out of crush_choose, renameIlya Dryomov2013-12-311-55/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that indep is handled by crush_choose_indep, rename crush_choose to crush_choose_firstn and remove all the conditionals. This ends up stripping out *lots* of code. Note that it *also* makes it obvious that the shenanigans we were playing with r' for uniform buckets were broken for firstn mode. This appears to have happened waaaay back in commit dae8bec9 (or earlier)... 2007. Reflects ceph.git commit 94350996cb2035850bcbece6a77a9b0394177ec9. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
| * | crush: add note about r in recursive chooseIlya Dryomov2013-12-311-0/+8
| | | | | | | | | | | | | | | | | | | | | Reflects ceph.git commit 4551fee9ad89d0427ed865d766d0d44004d3e3e1. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>