linux.git - Linux kernel mainline tree

	Commit message (Collapse)	Author	Age	Files	Lines
*	libceph: osd_state is 32 bits wide in luminous	Ilya Dryomov	2017-07-07	1	-1/+1
\| \| \| \|	Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
*	libceph: pg_upmap[_items] infrastructure	Ilya Dryomov	2017-07-07	1	-0/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	pg_temp and pg_upmap encodings are the same (PG -> array of osds), except for the incremental remove: it's an empty mapping in new_pg_temp for pg_temp and a separate old_pg_upmap set for pg_upmap. (This isn't to allow for empty pg_upmap mappings -- apparently, pg_temp just wasn't looked at as an example for pg_upmap encoding.) Reuse __decode_pg_temp() for decoding pg_upmap and new_pg_upmap. __decode_pg_temp() stores into pg_temp union member, but since pg_upmap union member is identical, reading through pg_upmap later is OK. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
*	libceph: respect RADOS_BACKOFF backoffs	Ilya Dryomov	2017-07-07	1	-0/+74
\| \| \| \|	Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
*	libceph: make sure need_resend targets reflect latest map	Ilya Dryomov	2017-07-07	1	-1/+1
\| \| \| \| \| \| \| \| \|	Otherwise we may miss events like PG splits, pool deletions, etc when we get multiple incremental maps at once. Because check_pool_dne() can now be fed an unlinked request, finish_request() needed to be taught to handle unlinked requests. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
*	libceph: introduce ceph_spg, ceph_pg_to_primary_shard()	Ilya Dryomov	2017-07-07	1	-1/+10
\| \| \| \| \| \|	Store both raw pgid and actual spgid in ceph_osd_request_target. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
*	libceph: add an epoch_barrier field to struct ceph_osd_client	Jeff Layton	2017-05-04	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Cephfs can get cap update requests that contain a new epoch barrier in them. When that happens we want to pause all OSD traffic until the right map epoch arrives. Add an epoch_barrier field to ceph_osd_client that is protected by the osdc->lock rwsem. When the barrier is set, and the current OSD map epoch is below that, pause the request target when submitting the request or when revisiting it. Add a way for upper layers (cephfs) to update the epoch_barrier as well. If we get a new map, compare the new epoch against the barrier before kicking requests and request another map if the map epoch is still lower than the one we want. If we get a map with a full pool, or at quota condition, then set the barrier to the current epoch value. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
*	libceph: remove req->r_replay_version	Jeff Layton	2017-05-04	1	-3/+1
\| \| \| \| \| \| \| \| \| \|	Nothing uses this anymore with the removal of the ack vs. commit code. Remove the field and just encode zeroes into place in the request encoding. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
*	libceph: rados pool namespace support	Yan, Zheng	2016-07-28	1	-2/+10
\| \| \| \| \| \| \| \| \| \| \| \| \|	Add pool namesapce pointer to struct ceph_file_layout and struct ceph_object_locator. Pool namespace is used by when mapping object to PG, it's also used when composing OSD request. The namespace pointer in struct ceph_file_layout is RCU protected. So libceph can read namespace without taking lock. Signed-off-by: Yan, Zheng <zyan@redhat.com> [idryomov@gmail.com: ceph_oloc_destroy(), misc minor changes] Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
*	libceph: support for subscribing to "mdsmap.<id>" maps	Ilya Dryomov	2016-05-26	1	-0/+1
\| \| \| \|	Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
*	libceph: take osdc->lock in osdmap_show() and dump flags in hex	Ilya Dryomov	2016-05-26	1	-5/+5
\| \| \| \| \| \| \|	There is now about a dozen CEPH_OSDMAP_* flags. This is a debugging interface, so just dump in hex instead of spelling each flag out. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
*	libceph: support for sending notifies	Ilya Dryomov	2016-05-26	1	-2/+3
\| \| \| \| \| \| \| \| \| \|	Implement ceph_osdc_notify() for sending notifies. Due to the fact that the current messenger can't do read-in into pagelists (it can only do write-out from them), I had to go with a page vector for a NOTIFY_COMPLETE payload, for now. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
*	libceph, rbd: ceph_osd_linger_request, watch/notify v2	Ilya Dryomov	2016-05-26	1	-0/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds support and switches rbd to a new, more reliable version of watch/notify protocol. As with the OSD client update, this is mostly about getting the right structures linked into the right places so that reconnects are properly sent when needed. watch/notify v2 also requires sending regular pings to the OSDs - send_linger_ping(). A major change from the old watch/notify implementation is the introduction of ceph_osd_linger_request - linger requests no longer piggy back on ceph_osd_request. ceph_osd_event has been merged into ceph_osd_linger_request. All the details are now hidden within libceph, the interface consists of a simple pair of watch/unwatch functions and ceph_osdc_notify_ack(). ceph_osdc_watch() does return ceph_osd_linger_request, but only to keep the lifetime management simple. ceph_osdc_notify_ack() accepts an optional data payload, which is relayed back to the notifier. Portions of this patch are loosely based on work by Douglas Fuller <dfuller@redhat.com> and Mike Christie <michaelc@cs.wisc.edu>. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
*	libceph: a major OSD client update	Ilya Dryomov	2016-05-26	1	-8/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a major sync up, up to ~Jewel. The highlights are: - per-session request trees (vs a global per-client tree) - per-session locking (vs a global per-client rwlock) - homeless OSD session - no ad-hoc global per-client lists - support for pool quotas - foundation for watch/notify v2 support - foundation for map check (pool deletion detection) support The switchover is incomplete: lingering requests can be setup and teared down but aren't ever reestablished. This functionality is restored with the introduction of the new lingering infrastructure (ceph_osd_linger_request, linger_work, etc) in a later commit. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
*	libceph: switch to calc_target(), part 2	Ilya Dryomov	2016-05-26	1	-23/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The crux of this is getting rid of ceph_osdc_build_request(), so that MOSDOp can be encoded not before but after calc_target() calculates the actual target. Encoding now happens within ceph_osdc_start_request(). Also nuked is the accompanying bunch of pointers into the encoded buffer that was used to update fields on each send - instead, the entire front is re-encoded. If we want to support target->name_len != base->name_len in the future, there is no other way, because oid is surrounded by other fields in the encoded buffer. Encoding OSD ops and adding data items to the request message were mixed together in osd_req_encode_op(). While we want to re-encode OSD ops, we don't want to add duplicate data items to the message when resending, so all call to ceph_osdc_msg_data_add() are factored out into a new setup_request_data(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
*	libceph: switch to calc_target(), part 1	Ilya Dryomov	2016-05-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Replace __calc_request_pg() and most of __map_request() with calc_target() and start using req->r_t. ceph_osdc_build_request() however still encodes base_oid, because it's called before calc_target() is and target_oid is empty at that point in time; a printf in osdc_show() also shows base_oid. This is fixed in "libceph: switch to calc_target(), part 2". Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
*	libceph: pi->min_size, pi->last_force_request_resend	Ilya Dryomov	2016-05-26	1	-4/+6
\| \| \| \| \| \| \|	Add and decode pi->min_size and pi->last_force_request_resend. These are going to be used by calc_target(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
*	libceph: variable-sized ceph_object_id	Ilya Dryomov	2016-05-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently ceph_object_id can hold object names of up to 100 (CEPH_MAX_OID_NAME_LEN) characters. This is enough for all use cases, expect one - long rbd image names: - a format 1 header is named "<imgname>.rbd" - an object that points to a format 2 header is named "rbd_id.<imgname>" We operate on these potentially long-named objects during rbd map, and, for format 1 images, during header refresh. (A format 2 header name is a small system-generated string.) Lift this 100 character limit by making ceph_object_id be able to point to an externally-allocated string. Apart from being able to work with almost arbitrarily-long named objects, this allows us to reduce the size of ceph_object_id from >100 bytes to 64 bytes. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
*	libceph: revamp subs code, switch to SUBSCRIBE2 protocol	Ilya Dryomov	2016-03-25	1	-6/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It is currently hard-coded in the mon_client that mdsmap and monmap subs are continuous, while osdmap sub is always "onetime". To better handle full clusters/pools in the osd_client, we need to be able to issue continuous osdmap subs. Revamp subs code to allow us to specify for each sub whether it should be continuous or not. Although not strictly required for the above, switch to SUBSCRIBE2 protocol while at it, eliminating the ambiguity between a request for "every map since X" and a request for "just the latest" when we don't have a map yet (i.e. have epoch 0). SUBSCRIBE2 feature bit is now required - it's been supported since pre-argonaut (2010). Move "got mdsmap" call to the end of ceph_mdsc_handle_map() - calling in before we validate the epoch and successfully install the new map can mess up mon_client sub state. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
*	libceph: expose client options through debugfs	Ilya Dryomov	2015-04-20	1	-0/+24
\| \| \| \| \| \|	Add a client_options attribute for showing libceph options. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
*	libceph: nuke pool op infrastructure	Ilya Dryomov	2015-02-19	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \|	On Mon, Dec 22, 2014 at 5:35 PM, Sage Weil <sage@newdream.net> wrote: > On Mon, 22 Dec 2014, Ilya Dryomov wrote: >> Actually, pool op stuff has been unused for over two years - looks like >> it was added for rbd create_snap and that got ripped out in 2012. It's >> unlikely we'd ever need to manage pools or snaps from the kernel client >> so I think it makes sense to nuke it. Sage? > > Yep! Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
*	libceph: separate multiple ops with commas in debugfs output	Ilya Dryomov	2014-10-14	1	-1/+2
\| \| \| \| \| \| \| \|	For requests with multiple ops, separate ops with commas instead of \t, which is a field separator here. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com>
*	libceph: mon_get_version request infrastructure	Ilya Dryomov	2014-06-06	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add support for mon_get_version requests to libceph. This reuses much of the ceph_mon_generic_request infrastructure, with one exception. Older OSDs don't set mon_get_version reply hdr->tid even if the original request had a non-zero tid, which makes it impossible to lookup ceph_mon_generic_request contexts by tid in get_generic_reply() for such replies. As a workaround, we allocate a reply message on the reply path. This can probably interfere with revoke, but I don't see a better way. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
*	libceph: recognize poolop requests in debugfs	Ilya Dryomov	2014-06-06	1	-2/+4
\| \| \| \| \| \| \| \|	Recognize poolop requests in debugfs monc dump, fix prink format specifiers - tid is unsigned. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
*	libceph: dump pool {read,write}_tier to debugfs	Ilya Dryomov	2014-04-04	1	-3/+3
\| \| \| \| \| \| \|	Dump pool {read,write}_tier to debugfs. While at it, fixup printk type specifiers and remove the unnecessary cast to unsigned long long. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com>
*	libceph: primary_affinity infrastructure	Ilya Dryomov	2014-04-04	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	Add primary_affinity infrastructure. primary_affinity values are stored in an max_osd-sized array, hanging off ceph_osdmap, similar to a osd_weight array. Introduce {get,set}_primary_affinity() helpers, primarily to return CEPH_OSD_DEFAULT_PRIMARY_AFFINITY when no affinity has been set and to abstract out osd_primary_affinity array allocation and initialization. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
*	libceph: primary_temp infrastructure	Ilya Dryomov	2014-04-04	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add primary_temp mappings infrastructure. struct ceph_pg_mapping is overloaded, primary_temp mappings are stored in an rb-tree, rooted at ceph_osdmap, in a manner similar to pg_temp mappings. Dump primary_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap, one 'primary_temp <pgid> <osd>' per line, e.g: primary_temp 2.6 4 Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
*	libceph: generalize ceph_pg_mapping	Ilya Dryomov	2014-04-04	1	-2/+2
\| \| \| \| \| \| \| \|	In preparation for adding support for primary_temp mappings, generalize struct ceph_pg_mapping so it can hold mappings other than pg_temp. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
*	libceph: dump pg_temp mappings to debugfs	Ilya Dryomov	2014-04-04	1	-0/+11
\| \| \| \| \| \| \| \| \| \|	Dump pg_temp mappings to /sys/kernel/debug/ceph/<client>/osdmap, one 'pg_temp <pgid> [<osd>, ..., <osd>]' per line, e.g: pg_temp 2.6 [2,3,4] Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
*	libceph: do not prefix osd lines with \t in debugfs output	Ilya Dryomov	2014-04-04	1	-1/+1
\| \| \| \| \| \| \| \|	To save screen space in anticipation of more fields (e.g. primary affinity). Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
*	libceph: refer to osdmap directly in osdmap_show()	Ilya Dryomov	2014-04-04	1	-12/+14
\| \| \| \| \| \| \|	To make it more readable and save screen space. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Alex Elder <elder@linaro.org>
*	libceph: rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid}	Ilya Dryomov	2014-01-27	1	-2/+2
\| \| \| \| \| \| \| \|	Rename ceph_osd_request::r_{oloc,oid} to r_base_{oloc,oid} before introducing r_target_{oloc,oid} needed for redirects. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
*	libceph: introduce and start using oid abstraction	Ilya Dryomov	2014-01-27	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \|	In preparation for tiering support, which would require having two (base and target) object names for each osd request and also copying those names around, introduce struct ceph_object_id (oid) and a couple helpers to facilitate those copies and encapsulate the fact that object name is not necessarily a NUL-terminated string. Signed-off-by: Ilya Dryomov <ilya.dryomov@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
*	libceph: keep source rather than message osd op array	Alex Elder	2013-05-01	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	An osd request keeps a pointer to the osd operations (ops) array that it builds in its request message. In order to allow each op in the array to have its own distinct data, we will need to keep track of each op's data, and that information does not go over the wire. As long as we're tracking the data we might as well just track the entire (source) op definition for each of the ops. And if we're doing that, we'll have no more need to keep a pointer to the wire-encoded version. This patch makes the array of source ops be kept with the osd request structure, and uses that instead of the version encoded in the message in places where that was previously used. The array will be embedded in the request structure, and the maximum number of ops we ever actually use is currently 2. So reduce CEPH_OSD_MAX_OP to 2 to reduce the size of the structure. The result of doing this sort of ripples back up, and as a result various function parameters and local variables become unnecessary. Make r_num_ops be unsigned, and move the definition of struct ceph_osd_req_op earlier to ensure it's defined where needed. It does not yet add per-op data, that's coming soon. This resolves: http://tracker.ceph.com/issues/4656 Signed-off-by: Alex Elder <elder@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
*	libceph: update osd request/reply encoding	Sage Weil	2013-02-26	1	-14/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use the new version of the encoding for osd requests and replies. In the process, update the way we are tracking request ops and reply lengths and results in the struct ceph_osd_request. Update the rbd and fs/ceph users appropriately. The main changes are: - we keep pointers into the request memory for fields we need to update each time the request is sent out over the wire - we keep information about the result in an array in the request struct where the users can easily get at it. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
*	ceph: update support for PGID64, PGPOOL3, OSDENC protocol features	Sage Weil	2013-02-26	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \|	Support (and require) the PGID64, PGPOOL3, and OSDENC protocol features. These have been present in ceph.git since v0.42, Feb 2012. Require these features to simplify support; nobody is running older userspace. Note that the new request and reply encoding is still not in place, so the new code is not yet functional. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
*	libceph: decode into cpu-native ceph_pg type	Sage Weil	2013-02-26	1	-3/+2
\| \| \| \| \| \| \| \| \|	Always decode data into our cpu-native ceph_pg type that has the correct field widths. Limit any remaining uses of ceph_pg_v1 to dealing with the legacy protocol. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <elder@inktank.com>
*	libceph: delay debugfs initialization until we learn global_id	Sage Weil	2012-08-20	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \|	The debugfs directory includes the cluster fsid and our unique global_id. We need to delay the initialization of the debug entry until we have learned both the fsid and our global_id from the monitor or else the second client can't create its debugfs entry and will fail (and multiple client instances aren't properly reflected in debugfs). Reported by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
*	net: cleanup unsigned to unsigned int	Eric Dumazet	2012-04-15	1	-3/+3
\| \| \| \| \| \| \|	Use of "unsigned int" is preferred to bare "unsigned" in net tree. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	rbd: introduce rados block device (rbd), based on libceph	Yehuda Sadeh	2010-10-20	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \|	The rados block device (rbd), based on osdblk, creates a block device that is backed by objects stored in the Ceph distributed object storage cluster. Each device consists of a single metadata object and data striped over many data objects. The rbd driver supports read-only snapshots. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
*	ceph: factor out libceph from Ceph file system	Yehuda Sadeh	2010-10-20	1	-0/+268
	This factors out protocol and low-level storage parts of ceph into a separate libceph module living in net/ceph and include/linux/ceph. This is mostly a matter of moving files around. However, a few key pieces of the interface change as well: - ceph_client becomes ceph_fs_client and ceph_client, where the latter captures the mon and osd clients, and the fs_client gets the mds client and file system specific pieces. - Mount option parsing and debugfs setup is correspondingly broken into two pieces. - The mon client gets a generic handler callback for otherwise unknown messages (mds map, in this case). - The basic supported/required feature bits can be expanded (and are by ceph_fs_client). No functional change, aside from some subtle error handling cases that got cleaned up in the refactoring process. Signed-off-by: Sage Weil <sage@newdream.net>