summaryrefslogtreecommitdiffstats
path: root/net/ceph
Commit message (Collapse)AuthorAgeFilesLines
...
| * libceph, rbd: ignore addr->type while comparing in some casesIlya Dryomov2020-12-141-2/+4
| | | | | | | | | | | | | | | | For libceph, this ensures that libceph instance sharing (share option) continues to work. For rbd, this avoids blocklisting alive lock owners (locker addr is always LEGACY, while watcher addr is ANY in nautilus). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph, ceph: get and handle cluster maps with addrvecsIlya Dryomov2020-12-144-55/+195
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for msgr2, make the cluster send us maps with addrvecs including both LEGACY and MSGR2 addrs instead of a single LEGACY addr. This means advertising support for SERVER_NAUTILUS and also some older features: SERVER_MIMIC, MONENC and MONNAMES. MONNAMES and MONENC are actually pre-argonaut, we just never updated ceph_monmap_decode() for them. Decoding is unconditional, see commit 23c625ce3065 ("libceph: assume argonaut on the server side"). SERVER_MIMIC doesn't bear any meaning for the kernel client. Since ceph_decode_entity_addrvec() is guarded by encoding version checks (and in msgr2 case it is guarded implicitly by the fact that server is speaking msgr2), we assume MSG_ADDR2 for it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: factor out finish_auth()Ilya Dryomov2020-12-141-22/+30
| | | | | | | | | | | | | | | | In preparation for msgr2, factor out finish_auth() so it is suitable for both existing MAuth message based authentication and upcoming msgr2 authentication exchange. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: drop ac->ops->name fieldIlya Dryomov2020-12-142-2/+0
| | | | | | | | Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: amend cephx init_protocol() and build_request()Ilya Dryomov2020-12-142-28/+49
| | | | | | | | | | | | | | | | | | | | In msgr2, initial authentication happens with an exchange of msgr2 control frames -- MAuth message and struct ceph_mon_request_header aren't used. Make that optional. Stop reporting cephx protocol as "x". Use "cephx" instead. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph, ceph: incorporate nautilus cephx changesIlya Dryomov2020-12-146-48/+194
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - request service tickets together with auth ticket. Currently we get auth ticket via CEPHX_GET_AUTH_SESSION_KEY op and then request service tickets via CEPHX_GET_PRINCIPAL_SESSION_KEY op in a separate message. Since nautilus, desired service tickets are shared togther with auth ticket in CEPHX_GET_AUTH_SESSION_KEY reply. - propagate session key and connection secret, if any. In preparation for msgr2, update handle_reply() and verify_authorizer_reply() auth ops to propagate session key and connection secret. Since nautilus, if secure mode is negotiated, connection secret is shared either in CEPHX_GET_AUTH_SESSION_KEY reply (for mons) or in a final authorizer reply (for osds and mdses). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: safer en/decoding of cephx requests and repliesIlya Dryomov2020-12-141-21/+26
| | | | | | | | Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: more insight into ticket expiry and invalidationIlya Dryomov2020-12-141-14/+25
| | | | | | | | | | | | | | | | Make it clear that "need" is a union of "missing" and "have, but up for renewal" and dout when the ticket goes missing due to expiry or invalidation by client. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: move msgr1 protocol specific fields to its own structIlya Dryomov2020-12-142-212/+216
| | | | | | | | | | | | A couple whitespace fixups, no functional changes. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: move msgr1 protocol implementation to its own fileIlya Dryomov2020-12-143-1496/+1504
| | | | | | | | | | | | | | | | | | | | | | A pure move, no other changes. Note that ceph_tcp_recv{msg,page}() and ceph_tcp_send{msg,page}() helpers are also moved. msgr2 will bring its own, more efficient, variants based on iov_iter. Switching msgr1 to them was considered but decided against to avoid subtle regressions. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: separate msgr1 protocol implementationIlya Dryomov2020-12-141-50/+88
| | | | | | | | | | | | | | | | | | | | | | In preparation for msgr2, define internal messenger <-> protocol interface (as opposed to external messenger <-> client interface, which is struct ceph_connection_operations) consisting of try_read(), try_write(), revoke(), revoke_incoming(), opened(), reset_session() and reset_protocol() ops. The semantics are exactly the same as they are now. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: export remaining protocol independent infrastructureIlya Dryomov2020-12-141-82/+75
| | | | | | | | | | | | | | In preparation for msgr2, make all protocol independent functions in messenger.c global. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: export zero_pageIlya Dryomov2020-12-141-8/+9
| | | | | | | | | | | | In preparation for msgr2, make zero_page global. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: rename and export con->flags bitsIlya Dryomov2020-12-141-43/+34
| | | | | | | | | | | | In preparation for msgr2, move the defines to the header file. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: rename and export con->state statesIlya Dryomov2020-12-141-51/+39
| | | | | | | | | | | | | | | | | | | | | | In preparation for msgr2, rename msgr1 specific states and move the defines to the header file. Also drop state transition comments. They don't cover all possible transitions (e.g. NEGOTIATING -> STANDBY, etc) and currently do more harm than good. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: make con->state an intIlya Dryomov2020-12-141-10/+6
| | | | | | | | | | | | | | unsigned long is a leftover from when con->state used to be a set of bits managed with set_bit(), clear_bit(), etc. Save a bit of memory. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: don't export ceph_messenger_{init_fini}() to modulesIlya Dryomov2020-12-141-2/+0
| | | | | | | | Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: make sure our addr->port is zero and addr->nonce is non-zeroIlya Dryomov2020-12-141-10/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | Our messenger instance addr->port is normally zero -- anything else is nonsensical because as a client we connect to multiple servers and don't listen on any port. However, a user can supply an arbitrary addr:port via ip option and the port is currently preserved. Zero it. Conversely, make sure our addr->nonce is non-zero. A zero nonce is special: in combination with a zero port, it is used to blocklist the entire ip. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: factor out ceph_con_get_out_msg()Ilya Dryomov2020-12-141-20/+39
| | | | | | | | | | | | | | Move the logic of grabbing the next message from the queue into its own function. Like ceph_con_in_msg_alloc(), this is protocol independent. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: change ceph_con_in_msg_alloc() to take hdrIlya Dryomov2020-12-141-5/+6
| | | | | | | | | | | | | | | | | | ceph_con_in_msg_alloc() is protocol independent, but con->in_hdr (and struct ceph_msg_header in general) is msgr1 specific. While the struct is deeply ingrained inside and outside the messenger, con->in_hdr field can be separated. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: change ceph_msg_data_cursor_init() to take cursorIlya Dryomov2020-12-141-4/+3
| | | | | | | | | | | | | | Make it possible to have local cursors and embed them outside struct ceph_msg. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: handle discarding acked and requeued messages separatelyIlya Dryomov2020-12-141-20/+54
| | | | | | | | | | | | | | Make it easier to follow and remove dependency on msgr1 specific CEPH_MSGR_TAG_SEQ. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: drop msg->ack_stamp fieldIlya Dryomov2020-12-141-1/+0
| | | | | | | | | | | | It is set in process_ack() but never used. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: remove redundant session reset log messageIlya Dryomov2020-12-141-4/+3
| | | | | | | | | | | | | | | | Stick with pr_info message because session reset isn't an error most of the time. When it is (i.e. if the server denies the reconnect attempt), we get a bunch of other pr_err messages. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: clear con->peer_global_seq on RESETSESSIONIlya Dryomov2020-12-141-3/+3
| | | | | | | | | | | | | | con->peer_global_seq is part of session state. Clear it when the server tells us to reset, not just in ceph_con_close(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: rename reset_connection() to ceph_con_reset_session()Ilya Dryomov2020-12-141-6/+4
| | | | | | | | | | | | With just session reset bits left, rename appropriately. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: split protocol reset bits out of reset_connection()Ilya Dryomov2020-12-141-26/+24
| | | | | | | | | | | | | | | | | | | | | | Move protocol reset bits into ceph_con_reset_protocol(), leaving just session reset bits. Note that con->out_skip is now reset on faults. This fixes a crash in the case of a stateful session getting a fault while in the middle of revoking a message. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: don't call reset_connection() on version/feature mismatchesIlya Dryomov2020-12-141-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A fault due to a version mismatch or a feature set mismatch used to be treated differently from other faults: the connection would get closed without trying to reconnect and there was a ->bad_proto() connection op for notifying about that. This changed a long time ago, see commits 6384bb8b8e88 ("libceph: kill bad_proto ceph connection op") and 0fa6ebc600bc ("libceph: fix protocol feature mismatch failure path"). Nowadays these aren't any different from other faults (i.e. we try to reconnect even though the mismatch won't resolve until the server is replaced). reset_connection() calls there are rather confusing because reset_connection() resets a session together an individual instance of the protocol. This is cleaned up in the next patch. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: lower exponential backoff delayIlya Dryomov2020-12-141-3/+9
| | | | | | | | | | | | | | | | | | | | | | The current setting allows the backoff to climb up to 5 minutes. This is too high -- it becomes hard to tell whether the client is stuck on something or just in backoff. In userspace, ms_max_backoff is defaulted to 15 seconds. Let's do the same. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: include middle_len in process_message() doutIlya Dryomov2020-12-141-1/+2
|/ | | | Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: clear con->out_msg on Policy::stateful_server faultsIlya Dryomov2020-10-121-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | con->out_msg must be cleared on Policy::stateful_server (!CEPH_MSG_CONNECT_LOSSY) faults. Not doing so botches the reconnection attempt, because after writing the banner the messenger moves on to writing the data section of that message (either from where it got interrupted by the connection reset or from the beginning) instead of writing struct ceph_msg_connect. This results in a bizarre error message because the server sends CEPH_MSGR_TAG_BADPROTOVER but we think we wrote struct ceph_msg_connect: libceph: mds0 (1)172.21.15.45:6828 socket error on write ceph: mds0 reconnect start libceph: mds0 (1)172.21.15.45:6829 socket closed (con state OPEN) libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch, my 32 != server's 32 libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch AFAICT this bug goes back to the dawn of the kernel client. The reason it survived for so long is that only MDS sessions are stateful and only two MDS messages have a data section: CEPH_MSG_CLIENT_RECONNECT (always, but reconnecting is rare) and CEPH_MSG_CLIENT_REQUEST (only when xattrs are involved). The connection has to get reset precisely when such message is being sent -- in this case it was the former. Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/47723 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
* libceph: format ceph_entity_addr nonces as unsignedIlya Dryomov2020-10-121-3/+3
| | | | | | Match the server side logs. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: move a dout in queue_con_delay()Ilya Dryomov2020-10-121-1/+1
| | | | | | | | | The queued con->work can start executing (and therefore logging) before we get to this "con->work has been queued" message, making the logs confusing. Move it up, with the meaning of "con->work is about to be queued". Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: switch to the new "osd blocklist add" commandIlya Dryomov2020-10-121-15/+52
| | | | Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph, rbd, ceph: "blacklist" -> "blocklist"Ilya Dryomov2020-10-121-4/+4
| | | | Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: multiple workspaces for CRUSH computationsIlya Dryomov2020-10-121-15/+151
| | | | | | | | | | | | Replace a global map->crush_workspace (protected by a global mutex) with a list of workspaces, up to the number of CPUs + 1. This is based on a patch from Robin Geuze <robing@nl.team.blue>. Robin and his team have observed a 10-20% increase in IOPS on all queue depths and lower CPU usage as well on a high-end all-NVMe 100GbE cluster. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: use sendpage_ok() in ceph_tcp_sendpage()Coly Li2020-10-021-1/+1
| | | | | | | | | | | | | | In libceph, ceph_tcp_sendpage() does the following checks before handle the page by network layer's zero copy sendpage method, if (page_count(page) >= 1 && !PageSlab(page)) This check is exactly what sendpage_ok() does. This patch replace the open coded checks by sendpage_ok() as a code cleanup. Signed-off-by: Coly Li <colyli@suse.de> Acked-by: Jeff Layton <jlayton@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva2020-08-235-16/+16
| | | | | | | | | | Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
* libceph: replace HTTP links with HTTPS onesAlexander A. Klimov2020-08-034-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | Rationale: Reduces attack surface on kernel devs opening the links for MITM as HTTPS traffic is much harder to manipulate. Deterministic algorithm: For each file: If not .svg: For each line: If doesn't contain `\bxmlns\b`: For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`: If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`: If both the HTTP and HTTPS versions return 200 OK and serve the same content: Replace HTTP with HTTPS. [ idryomov: Do the same for the CRUSH paper and replace ceph.newdream.net with ceph.io. ] Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: just have osd_req_op_init() return a pointerJeff Layton2020-08-031-23/+16
| | | | | | | | | | | The caller can just ignore the return. No need for this wrapper that just casts the other function to void. [ idryomov: argument alignment ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: dump class and method names on method callsIlya Dryomov2020-08-031-0/+3
| | | | Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: use target_copy() in send_linger()Ilya Dryomov2020-08-031-3/+1
| | | | | | | | | | | Instead of copying just oloc, oid and flags, copy the entire linger target. This is more for consistency than anything else, as send_linger() -> submit_request() -> __submit_request() sends the request regardless of what calc_target() says (i.e. both on CALC_TARGET_NO_ACTION and CALC_TARGET_NEED_RESEND). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
* libceph: don't omit used_replica in target_copy()Ilya Dryomov2020-06-161-0/+1
| | | | | | | | | | | Currently target_copy() is used only for sending linger pings, so this doesn't come up, but generally omitting used_replica can hang the client as we wouldn't notice the acting set change (legacy_change in calc_target()) or trigger a warning in handle_reply(). Fixes: 117d96a04f00 ("libceph: support for balanced and localized reads") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
* libceph: don't omit recovery_deletes in target_copy()Ilya Dryomov2020-06-161-0/+1
| | | | | | | | | | Currently target_copy() is used only for sending linger pings, so this doesn't come up, but generally omitting recovery_deletes can result in unneeded resends (force_resend in calc_target()). Fixes: ae78dd8139ce ("libceph: make RECOVERY_DELETES feature create a new interval") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
* libceph: move away from global osd_req_flagsIlya Dryomov2020-06-162-13/+8
| | | | | | | | | | | | | | | | | | | | | | | osd_req_flags is overly general and doesn't suit its only user (read_from_replica option) well: - applying osd_req_flags in account_request() affects all OSD requests, including linger (i.e. watch and notify). However, linger requests should always go to the primary even though some of them are reads (e.g. notify has side effects but it is a read because it doesn't result in mutation on the OSDs). - calls to class methods that are reads are allowed to go to the replica, but most such calls issued for "rbd map" and/or exclusive lock transitions are requested to be resent to the primary via EAGAIN, doubling the latency. Get rid of global osd_req_flags and set read_from_replica flag only on specific OSD requests instead. Fixes: 8ad44d5e0d1e ("libceph: read_from_replica option") Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
* Merge tag 'ceph-for-5.8-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds2020-06-085-60/+490
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull ceph updates from Ilya Dryomov: "The highlights are: - OSD/MDS latency and caps cache metrics infrastructure for the filesytem (Xiubo Li). Currently available through debugfs and will be periodically sent to the MDS in the future. - support for replica reads (balanced and localized reads) for rbd and the filesystem (myself). The default remains to always read from primary, users can opt-in with the new crush_location and read_from_replica options. Note that reading from replica is safe for general use only since Octopus. - support for RADOS allocation hint flags (myself). Currently used by rbd to propagate the compressible/incompressible hint given with the new compression_hint map option and ready for passing on more advanced hints, e.g. based on fadvise() from the filesystem. - support for efficient cross-quota-realm renames (Luis Henriques) - assorted cap handling improvements and cleanups, particularly untangling some of the locking (Jeff Layton)" * tag 'ceph-for-5.8-rc1' of git://github.com/ceph/ceph-client: (29 commits) rbd: compression_hint option libceph: support for alloc hint flags libceph: read_from_replica option libceph: support for balanced and localized reads libceph: crush_location infrastructure libceph: decode CRUSH device/bucket types and names libceph: add non-asserting rbtree insertion helper ceph: skip checking caps when session reconnecting and releasing reqs ceph: make sure mdsc->mutex is nested in s->s_mutex to fix dead lock ceph: don't return -ESTALE if there's still an open file libceph, rbd: replace zero-length array with flexible-array ceph: allow rename operation under different quota realms ceph: normalize 'delta' parameter usage in check_quota_exceeded ceph: ceph_kick_flushing_caps needs the s_mutex ceph: request expedited service on session's last cap flush ceph: convert mdsc->cap_dirty to a per-session list ceph: reset i_requested_max_size if file write is not wanted ceph: throw a warning if we destroy session with mutex still locked ceph: fix potential race in ceph_check_caps ceph: document what protects i_dirty_item and i_flushing_item ...
| * libceph: support for alloc hint flagsIlya Dryomov2020-06-011-1/+7
| | | | | | | | | | | | | | | | | | Allow indicating future I/O pattern via flags. This is supported since Kraken (and bluestore persists flags together with expected_object_size and expected_write_size). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jason Dillaman <dillaman@redhat.com>
| * libceph: read_from_replica optionIlya Dryomov2020-06-012-1/+43
| | | | | | | | | | | | | | | | | | Expose replica reads through read_from_replica=balance and read_from_replica=localize. The default is to read from primary (read_from_replica=no). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
| * libceph: support for balanced and localized readsIlya Dryomov2020-06-013-6/+189
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | OSD-side issues with reads from replica have been resolved in Octopus. Reading from replica should be safe wrt. unstable or uncommitted state now, so add support for balanced and localized reads. There are two cases when a read from replica can't be served: - OSD may silently drop the request, expecting the client to notice that the acting set has changed and resend via the usual means (handled with t->used_replica) - OSD may return EAGAIN, expecting the client to resend to the primary, ignoring replica read flags (see handle_reply()) Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
| * libceph: crush_location infrastructureIlya Dryomov2020-06-012-0/+152
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow expressing client's location in terms of CRUSH hierarchy as a set of (bucket type name, bucket name) pairs. The userspace syntax "crush_location = key1=value1 key2=value2" is incompatible with mount options and needed adaptation. Key-value pairs are separated by '|' and we use ':' instead of '=' to separate keys from values. So for: crush_location = host=foo rack=bar one would write: crush_location=host:foo|rack:bar As in userspace, "multipath" locations are supported, so indicating locality for parallel hierarchies is possible: crush_location=rack:foo1|rack:foo2|datacenter:bar Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>