summaryrefslogtreecommitdiffstats
path: root/net/ceph
Commit message (Collapse)AuthorAgeFilesLines
* libceph: don't access invalid memory in keepalive2 pathIlya Dryomov2015-09-171-4/+5
| | | | | | | | | | | | | | | | This struct ceph_timespec ceph_ts; ... con_out_kvec_add(con, sizeof(ceph_ts), &ceph_ts); wraps ceph_ts into a kvec and adds it to con->out_kvec array, yet ceph_ts becomes invalid on return from prepare_write_keepalive(). As a result, we send out bogus keepalive2 stamps. Fix this by encoding into a ceph_timespec member, similar to how acks are read and written. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2015-09-116-66/+111
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph update from Sage Weil: "There are a few fixes for snapshot behavior with CephFS and support for the new keepalive protocol from Zheng, a libceph fix that affects both RBD and CephFS, a few bug fixes and cleanups for RBD from Ilya, and several small fixes and cleanups from Jianpeng and others" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: improve readahead for file holes ceph: get inode size for each append write libceph: check data_len in ->alloc_msg() libceph: use keepalive2 to verify the mon session is alive rbd: plug rbd_dev->header.object_prefix memory leak rbd: fix double free on rbd_dev->header_name libceph: set 'exists' flag for newly up osd ceph: cleanup use of ceph_msg_get ceph: no need to get parent inode in ceph_open ceph: remove the useless judgement ceph: remove redundant test of head->safe and silence static analysis warnings ceph: fix queuing inode to mdsdir's snaprealm libceph: rename con_work() to ceph_con_workfn() libceph: Avoid holding the zero page on ceph_msgr_slab_init errors libceph: remove the unused macro AES_KEY_SIZE ceph: invalidate dirty pages after forced umount ceph: EIO all operations after forced umount
| * libceph: check data_len in ->alloc_msg()Ilya Dryomov2015-09-092-40/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Only ->alloc_msg() should check data_len of the incoming message against the preallocated ceph_msg, doing it in the messenger is not right. The contract is that either ->alloc_msg() returns a ceph_msg which will fit all of the portions of the incoming message, or it returns NULL and possibly sets skip, signaling whether NULL is due to an -ENOMEM. ->alloc_msg() should be the only place where we make the skip/no-skip decision. I stumbled upon this while looking at con/osd ref counting. Right now, if we get a non-extent message with a larger data portion than we are prepared for, ->alloc_msg() returns a ceph_msg, and then, when we skip it in the messenger, we don't put the con/osd ref acquired in ceph_con_in_msg_alloc() (which is normally put in process_message()), so this also fixes a memory leak. An existing BUG_ON in ceph_msg_data_cursor_init() ensures we don't corrupt random memory should a buggy ->alloc_msg() return an unfit ceph_msg. While at it, I changed the "unknown tid" dout() to a pr_warn() to make sure all skips are seen and unified format strings. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
| * libceph: use keepalive2 to verify the mon session is aliveYan, Zheng2015-09-083-13/+84
| | | | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: set 'exists' flag for newly up osdYan, Zheng2015-09-081-1/+1
| | | | | | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Sage Weil <sage@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: rename con_work() to ceph_con_workfn()Ilya Dryomov2015-09-081-3/+3
| | | | | | | | | | | | | | Even though it's static, con_work(), being a work func, shows up in various stacktraces a lot. Prefix it with ceph_. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: Avoid holding the zero page on ceph_msgr_slab_init errorsBenoît Canet2015-09-081-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ceph_msgr_slab_init may fail due to a temporary ENOMEM. Delay a bit the initialization of zero_page in ceph_msgr_init and reorder its cleanup in _ceph_msgr_exit so it's done in reverse order of setup. BUG_ON() will not suffer to be postponed in case it is triggered. Signed-off-by: Benoît Canet <benoit.canet@nodalink.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: remove the unused macro AES_KEY_SIZENicholas Krause2015-09-081-4/+0
| | | | | | | | | | | | | | | | | | This removes the no longer used macro AES_KEY_SIZE as no functions use this macro anymore and thus this macro can be removed due it no longer being required. Signed-off-by: Nicholas Krause <xerofoify@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* | fs: create and use seq_show_option for escapingKees Cook2015-09-041-2/+5
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many file systems that implement the show_options hook fail to correctly escape their output which could lead to unescaped characters (e.g. new lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This could lead to confusion, spoofed entries (resulting in things like systemd issuing false d-bus "mount" notifications), and who knows what else. This looks like it would only be the root user stepping on themselves, but it's possible weird things could happen in containers or in other situations with delegated mount privileges. Here's an example using overlay with setuid fusermount trusting the contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use of "sudo" is something more sneaky: $ BASE="ovl" $ MNT="$BASE/mnt" $ LOW="$BASE/lower" $ UP="$BASE/upper" $ WORK="$BASE/work/ 0 0 none /proc fuse.pwn user_id=1000" $ mkdir -p "$LOW" "$UP" "$WORK" $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt $ cat /proc/mounts none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0 none /proc fuse.pwn user_id=1000 0 0 $ fusermount -u /proc $ cat /proc/mounts cat: /proc/mounts: No such file or directory This fixes the problem by adding new seq_show_option and seq_show_option_n helpers, and updating the vulnerable show_option handlers to use them as needed. Some, like SELinux, need to be open coded due to unusual existing escape mechanisms. [akpm@linux-foundation.org: add lost chunk, per Kees] [keescook@chromium.org: seq_show_option should be using const parameters] Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Jan Kara <jack@suse.com> Acked-by: Paul Moore <paul@paul-moore.com> Cc: J. R. Okajima <hooanon05g@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* libceph: treat sockaddr_storage with uninitialized family as blankIlya Dryomov2015-07-091-7/+7
| | | | | | | | | | | | | | | | addr_is_blank() should return true if family is neither AF_INET nor AF_INET6. This is what its counterpart entity_addr_t::is_blank_ip() is doing and it is the right thing to do: in process_banner() we check if our address is blank and if it is "learn" it from our peer. As it is, we never learn our address and always send out a blank one. This goes way back to ceph.git commit dd732cbfc1c9 ("use sockaddr_storage; and some ipv6 support groundwork") from 2009. While at at, do not open-code ipv6_addr_any() and use INADDR_ANY constant instead of 0. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
* libceph: enable ceph in a non-default network namespaceIlya Dryomov2015-07-092-7/+19
| | | | | | | | | | | | | | Grab a reference on a network namespace of the 'rbd map' (in case of rbd) or 'mount' (in case of ceph) process and use that to open sockets instead of always using init_net and bailing if network namespace is anything but init_net. Be careful to not share struct ceph_client instances between different namespaces and don't add any code in the !CONFIG_NET_NS case. This is based on a patch from Hong Zhiguo <zhiguohong@tencent.com>. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2015-07-0210-119/+197
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph updates from Sage Weil: "We have a pile of bug fixes from Ilya, including a few patches that sync up the CRUSH code with the latest from userspace. There is also a long series from Zheng that fixes various issues with snapshots, inline data, and directory fsync, some simplification and improvement in the cap release code, and a rework of the caching of directory contents. To top it off there are a few small fixes and cleanups from Benoit and Hong" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (40 commits) rbd: use GFP_NOIO in rbd_obj_request_create() crush: fix a bug in tree bucket decode libceph: Fix ceph_tcp_sendpage()'s more boolean usage libceph: Remove spurious kunmap() of the zero page rbd: queue_depth map option rbd: store rbd_options in rbd_device rbd: terminate rbd_opts_tokens with Opt_err ceph: fix ceph_writepages_start() rbd: bump queue_max_segments ceph: rework dcache readdir crush: sync up with userspace crush: fix crash from invalid 'take' argument ceph: switch some GFP_NOFS memory allocation to GFP_KERNEL ceph: pre-allocate data structure that tracks caps flushing ceph: re-send flushing caps (which are revoked) in reconnect stage ceph: send TID of the oldest pending caps flush to MDS ceph: track pending caps flushing globally ceph: track pending caps flushing accurately libceph: fix wrong name "Ceph filesystem for Linux" ceph: fix directory fsync ...
| * crush: fix a bug in tree bucket decodeIlya Dryomov2015-07-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | struct crush_bucket_tree::num_nodes is u8, so ceph_decode_8_safe() should be used. -Wconversion catches this, but I guess it went unnoticed in all the noise it spews. The actual problem (at least for common crushmaps) isn't the u32 -> u8 truncation though - it's the advancement by 4 bytes instead of 1 in the crushmap buffer. Fixes: http://tracker.ceph.com/issues/2759 Cc: stable@vger.kernel.org Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com>
| * libceph: Fix ceph_tcp_sendpage()'s more boolean usageBenoît Canet2015-06-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | From struct ceph_msg_data_cursor in include/linux/ceph/messenger.h: bool last_piece; /* current is last piece */ In ceph_msg_data_next(): *last_piece = cursor->last_piece; A call to ceph_msg_data_next() is followed by: ret = ceph_tcp_sendpage(con->sock, page, page_offset, length, last_piece); while ceph_tcp_sendpage() is: static int ceph_tcp_sendpage(struct socket *sock, struct page *page, int offset, size_t size, bool more) The logic is inverted: correct it. Signed-off-by: Benoît Canet <benoit.canet@nodalink.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: Remove spurious kunmap() of the zero pageBenoît Canet2015-06-251-1/+0
| | | | | | | | | | | | | | | | | | ceph_tcp_sendpage already does the work of mapping/unmapping the zero page if needed. Signed-off-by: Benoît Canet <benoit.canet@nodalink.com> Reviewed-by: Alex Elder <elder@linaro.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * crush: sync up with userspaceIlya Dryomov2015-06-254-75/+115
| | | | | | | | | | | | | | | | | | | | | | | | .. up to ceph.git commit 1db1abc8328d ("crush: eliminate ad hoc diff between kernel and userspace"). This fixes a bunch of recently pulled coding style issues and makes includes a bit cleaner. A patch "crush:Make the function crush_ln static" from Nicholas Krause <xerofoify@gmail.com> is folded in as crush_ln() has been made static in userspace as well. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * crush: fix crash from invalid 'take' argumentIlya Dryomov2015-06-251-2/+9
| | | | | | | | | | | | | | | | | | Verify that the 'take' argument is a valid device or bucket. Otherwise ignore it (do not add the value to the working vector). Reflects ceph.git commit 9324d0a1af61e1c234cc48e2175b4e6320fff8f4. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: fix wrong name "Ceph filesystem for Linux"Hong Zhiguo2015-06-251-1/+1
| | | | | | | | | | | | | | | | modinfo libceph prints the module name "Ceph filesystem for Linux", which is same as the real fs module ceph. It's confusing. Signed-off-by: Hong Zhiguo <zhiguohong@tencent.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: a couple tweaks for wait loopsIlya Dryomov2015-06-252-5/+4
| | | | | | | | | | | | | | | | | | - return -ETIMEDOUT instead of -EIO in case of timeout - wait_event_interruptible_timeout() returns time left until timeout and since it can be almost LONG_MAX we had better assign it to long Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
| * libceph: store timeouts in jiffies, verify user inputIlya Dryomov2015-06-253-20/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are currently three libceph-level timeouts that the user can specify on mount: mount_timeout, osd_idle_ttl and osdkeepalive. All of these are in seconds and no checking is done on user input: negative values are accepted, we multiply them all by HZ which may or may not overflow, arbitrarily large jiffies then get added together, etc. There is also a bug in the way mount_timeout=0 is handled. It's supposed to mean "infinite timeout", but that's not how wait.h APIs treat it and so __ceph_open_session() for example will busy loop without much chance of being interrupted if none of ceph-mons are there. Fix all this by verifying user input, storing timeouts capped by msecs_to_jiffies() in jiffies and using the new ceph_timeout_jiffies() helper for all user-specified waits to handle infinite timeouts correctly. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
| * libceph: use kvfree() instead of open-coding itIlya Dryomov2015-06-251-4/+1
| | | | | | | | | | | | | | This one sneaked in through vfs tree with commit 2b777c9dd9eb ("ceph_sync_read: stop poking into iov_iter guts"). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: allow setting osd_req_op's flagsYan, Zheng2015-06-251-9/+15
| | | | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>
| * libceph: properly release STAT request's raw_data_inYan, Zheng2015-06-251-0/+3
| | | | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2015-06-011-13/+20
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/phy/amd-xgbe-phy.c drivers/net/wireless/iwlwifi/Kconfig include/net/mac80211.h iwlwifi/Kconfig and mac80211.h were both trivial overlapping changes. The drivers/net/phy/amd-xgbe-phy.c file got removed in 'net-next' and the bug fix that happened on the 'net' side is already integrated into the rest of the amd-xgbe driver. Signed-off-by: David S. Miller <davem@davemloft.net>
| * Revert "libceph: clear r_req_lru_item in __unregister_linger_request()"Ilya Dryomov2015-05-201-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit ba9d114ec5578e6e99a4dfa37ff8ae688040fd64. .. which introduced a regression that prevented all lingering requests requeued in kick_requests() from ever being sent to the OSDs, resulting in a lot of missed notifies. In retrospect it's pretty obvious that r_req_lru_item item in the case of lingering requests can be used not only for notarget, but also for unsent linkage due to how tightly actual map and enqueue operations are coupled in __map_request(). The assertion that was being silenced is taken care of in the previous ("libceph: request a new osdmap if lingering request maps to no osd") commit: by always kicking homeless lingering requests we ensure that none of them ends up on the notarget list outside of the critical section guarded by request_mutex. Cc: stable@vger.kernel.org # 3.18+, needs b0494532214b "libceph: request a new osdmap if lingering request maps to no osd" Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
| * libceph: request a new osdmap if lingering request maps to no osdIlya Dryomov2015-05-201-11/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit does two things. First, if there are any homeless lingering requests, we now request a new osdmap even if the osdmap that is being processed brought no changes, i.e. if a given lingering request turned homeless in one of the previous epochs and remained homeless in the current epoch. Not doing so leaves us with a stale osdmap and as a result we may miss our window for reestablishing the watch and lose notifies. MON=1 OSD=1: # cat linger-needmap.sh #!/bin/bash rbd create --size 1 test DEV=$(rbd map test) ceph osd out 0 rbd map dne/dne # obtain a new osdmap as a side effect (!) sleep 1 ceph osd in 0 rbd resize --size 2 test # rbd info test | grep size -> 2M # blockdev --getsize $DEV -> 1M N.B.: Not obtaining a new osdmap in between "osd out" and "osd in" above is enough to make it miss that resize notify, but that is a bug^Wlimitation of ceph watch/notify v1. Second, homeless lingering requests are now kicked just like those lingering requests whose mapping has changed. This is mainly to recognize that a homeless lingering request makes no sense and to preserve the invariant that a registered lingering request is not sitting on any of r_req_lru_item lists. This spares us a WARN_ON, which commit ba9d114ec557 ("libceph: clear r_req_lru_item in __unregister_linger_request()") tried to fix the _wrong_ way. Cc: stable@vger.kernel.org # 3.10+ Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
* | net: Add a struct net parameter to sock_create_kernEric W. Biederman2015-05-111-2/+2
|/ | | | | | | | This is long overdue, and is part of cleaning up how we allocate kernel sockets that don't reference count struct net. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* crush: straw2 bucket type with an efficient 64-bit crush_ln()Ilya Dryomov2015-04-224-0/+306
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is an improved straw bucket that correctly avoids any data movement between items A and B when neither A nor B's weights are changed. Said differently, if we adjust the weight of item C (including adding it anew or removing it completely), we will only see inputs move to or from C, never between other items in the bucket. Notably, there is not intermediate scaling factor that needs to be calculated. The mapping function is a simple function of the item weights. The below commits were squashed together into this one (mostly to avoid adding and then yanking a ~6000 lines worth of crush_ln_table): - crush: add a straw2 bucket type - crush: add crush_ln to calculate nature log efficently - crush: improve straw2 adjustment slightly - crush: change crush_ln to provide 32 more digits - crush: fix crush_get_bucket_item_weight and bucket destroy for straw2 - crush/mapper: fix divide-by-0 in straw2 (with div64_s64() for draw = ln / w and INT64_MIN -> S64_MIN - need to create a proper compat.h in ceph.git) Reflects ceph.git commits 242293c908e923d474910f2b8203fa3b41eb5a53, 32a1ead92efcd351822d22a5fc37d159c65c1338, 6289912418c4a3597a11778bcf29ed5415117ad9, 35fcb04e2945717cf5cfe150b9fa89cb3d2303a1, 6445d9ee7290938de1e4ee9563912a6ab6d8ee5f, b5921d55d16796e12d66ad2c4add7305f9ce2353. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* crush: ensuring at most num-rep osds are selectedIlya Dryomov2015-04-221-4/+12
| | | | | | | | | | | | | Crush temporary buffers are allocated as per replica size configured by the user. When there are more final osds (to be selected as per rule) than the replicas, buffer overlaps and it causes crash. Now, it ensures that at most num-rep osds are selected even if more number of osds are allowed by the rule. Reflects ceph.git commits 6b4d1aa99718e3b367496326c1e64551330fabc0, 234b066ba04976783d15ff2abc3e81b6cc06fb10. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* crush: drop unnecessary include from mapper.cIlya Dryomov2015-04-221-1/+0
| | | | Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: expose client options through debugfsIlya Dryomov2015-04-201-0/+24
| | | | | | Add a client_options attribute for showing libceph options. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph, ceph: split ceph_show_options()Ilya Dryomov2015-04-201-0/+37
| | | | | | | | | Split ceph_show_options() into two pieces and move the piece responsible for printing client (libceph) options into net/ceph. This way people adding a libceph option wouldn't have to remember to update code in fs/ceph. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* libceph: don't overwrite specific con error msgsIlya Dryomov2015-04-201-12/+13
| | | | | | | | | | | | | - specific con->error_msg messages (e.g. "protocol version mismatch") end up getting overwritten by a catch-all "socket error on read / write", introduced in commit 3a140a0d5c4b ("libceph: report socket read/write error message") - "bad message sequence # for incoming message" loses to "bad crc" due to the fact that -EBADMSG is used for both Fix it, and tidy up con->error_msg assignments and pr_errs while at it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* Revert "libceph: use memalloc flags for net IO"Ilya Dryomov2015-04-071-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 89baaa570ab0b476db09408d209578cfed700e9f. Dirty page throttling should be sufficient for us in the general case so there is no need to use __GFP_MEMALLOC - it would be needed only in the swap-over-rbd case, which we currently don't support. (It would probably take approximately the commit that is being reverted to add that support, but we would also need the "swap" option to distinguish from the general case and make sure swap ceph_client-s aren't shared with anything else.) See ceph-devel threads [1] and [2] for the details of why enabling pfmemalloc reserves for all cases is a bad thing. On top of potential system lockups related to drained emergency reserves, this turned out to cause ceph lockups in case peers are on the same host and communicating via loopback due to sk_filter() dropping pfmemalloc skbs on the receiving side because the receiving loopback socket is not tagged with SOCK_MEMALLOC. [1] "SOCK_MEMALLOC vs loopback" http://www.spinics.net/lists/ceph-devel/msg22998.html [2] "[PATCH] libceph: don't set memalloc flags in loopback case" http://www.spinics.net/lists/ceph-devel/msg23392.html Conflicts: net/ceph/messenger.c [ context: tcp_nodelay option ] Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Mel Gorman <mgorman@suse.de> Cc: Sage Weil <sage@redhat.com> Cc: stable@vger.kernel.org # 3.18+, needs backporting Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Acked-by: Mike Christie <michaelc@cs.wisc.edu> Acked-by: Mel Gorman <mgorman@suse.de>
* Merge branch 'for-linus' of ↵Linus Torvalds2015-02-196-162/+54
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull Ceph changes from Sage Weil: "On the RBD side, there is a conversion to blk-mq from Christoph, several long-standing bug fixes from Ilya, and some cleanup from Rickard Strandqvist. On the CephFS side there is a long list of fixes from Zheng, including improved session handling, a few IO path fixes, some dcache management correctness fixes, and several blocking while !TASK_RUNNING fixes. The core code gets a few cleanups and Chaitanya has added support for TCP_NODELAY (which has been used on the server side for ages but we somehow missed on the kernel client). There is also an update to MAINTAINERS to fix up some email addresses and reflect that Ilya and Zheng are doing most of the maintenance for RBD and CephFS these days. Do not be surprised to see a pull request come from one of them in the future if I am unavailable for some reason" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (27 commits) MAINTAINERS: update Ceph and RBD maintainers libceph: kfree() in put_osd() shouldn't depend on authorizer libceph: fix double __remove_osd() problem rbd: convert to blk-mq ceph: return error for traceless reply race ceph: fix dentry leaks ceph: re-send requests when MDS enters reconnecting stage ceph: show nocephx_require_signatures and notcp_nodelay options libceph: tcp_nodelay support rbd: do not treat standalone as flatten ceph: fix atomic_open snapdir ceph: properly mark empty directory as complete client: include kernel version in client metadata ceph: provide seperate {inode,file}_operations for snapdir ceph: fix request time stamp encoding ceph: fix reading inline data when i_size > PAGE_SIZE ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_close_sessions) ceph: avoid block operation when !TASK_RUNNING (ceph_get_caps) ceph: avoid block operation when !TASK_RUNNING (ceph_mdsc_sync) rbd: fix error paths in rbd_dev_refresh() ...
| * libceph: kfree() in put_osd() shouldn't depend on authorizerIlya Dryomov2015-02-191-2/+3
| | | | | | | | | | | | | | | | | | | | | | a255651d4cad ("ceph: ensure auth ops are defined before use") made kfree() in put_osd() conditional on the authorizer. A mechanical mistake most likely - fix it. Cc: Alex Elder <elder@linaro.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>
| * libceph: fix double __remove_osd() problemIlya Dryomov2015-02-191-8/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It turns out it's possible to get __remove_osd() called twice on the same OSD. That doesn't sit well with rb_erase() - depending on the shape of the tree we can get a NULL dereference, a soft lockup or a random crash at some point in the future as we end up touching freed memory. One scenario that I was able to reproduce is as follows: <osd3 is idle, on the osd lru list> <con reset - osd3> con_fault_finish() osd_reset() <osdmap - osd3 down> ceph_osdc_handle_map() <takes map_sem> kick_requests() <takes request_mutex> reset_changed_osds() __reset_osd() __remove_osd() <releases request_mutex> <releases map_sem> <takes map_sem> <takes request_mutex> __kick_osd_requests() __reset_osd() __remove_osd() <-- !!! A case can be made that osd refcounting is imperfect and reworking it would be a proper resolution, but for now Sage and I decided to fix this by adding a safe guard around __remove_osd(). Fixes: http://tracker.ceph.com/issues/8087 Cc: Sage Weil <sage@redhat.com> Cc: stable@vger.kernel.org # 3.9+: 7c6e6fc53e73: libceph: assert both regular and lingering lists in __remove_osd() Cc: stable@vger.kernel.org # 3.9+: cc9f1f518cec: libceph: change from BUG to WARN for __remove_osd() asserts Cc: stable@vger.kernel.org # 3.9+ Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>
| * libceph: tcp_nodelay supportChaitanya Huilgol2015-02-192-2/+28
| | | | | | | | | | | | | | | | | | | | | | TCP_NODELAY socket option set on connection sockets, disables Nagle’s algorithm and improves latency characteristics. tcp_nodelay(default)/notcp_nodelay option flags provided to enable/disable setting the socket option. Signed-off-by: Chaitanya Huilgol <chaitanya.huilgol@sandisk.com> [idryomov@redhat.com: NO_TCP_NODELAY -> TCP_NODELAY, minor adjustments] Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
| * libceph: use mon_client.c/put_generic_request() moreIlya Dryomov2015-02-191-2/+2
| | | | | | | | Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
| * libceph: nuke pool op infrastructureIlya Dryomov2015-02-193-148/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | On Mon, Dec 22, 2014 at 5:35 PM, Sage Weil <sage@newdream.net> wrote: > On Mon, 22 Dec 2014, Ilya Dryomov wrote: >> Actually, pool op stuff has been unused for over two years - looks like >> it was added for rbd create_snap and that got ripped out in 2012. It's >> unlikely we'd ever need to manage pools or snaps from the kernel client >> so I think it makes sense to nuke it. Sage? > > Yep! Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
* | mm: gup: use get_user_pages_unlockedAndrea Arcangeli2015-02-111-4/+2
|/ | | | | | | | | | | | This allows those get_user_pages calls to pass FAULT_FLAG_ALLOW_RETRY to the page fault in order to release the mmap_sem during the I/O. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Peter Feiner <pfeiner@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* libceph: fix sparse endianness warningsIlya Dryomov2015-01-082-2/+2
| | | | | | | The only real issue is the one in auth_x.c and it came with 3.19-rc1 merge. Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
* libceph: specify position of extent operationYan, Zheng2014-12-171-13/+6
| | | | | | | | | allow specifying position of extent operation in multi-operations osd request. This is required for cephfs to convert inline data to normal data (compare xattr, then write object). Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
* libceph: add CREATE osd operation supportYan, Zheng2014-12-171-20/+22
| | | | | | | | Add CEPH_OSD_OP_CREATE support. Also change libceph to not treat CEPH_OSD_OP_DELETE as an extent op and add an assert to that end. Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
* libceph: add SETXATTR/CMPXATTR osd operations supportYan, Zheng2014-12-171-0/+47
| | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
* libceph: require cephx message signature by defaultYan, Zheng2014-12-171-0/+13
| | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@redhat.com>
* libceph: message signature supportYan, Zheng2014-12-173-3/+103
| | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
* libceph: store session key in cephx authorizerYan, Zheng2014-12-172-7/+12
| | | | | | | Session key is required when calculating message signature. Save the session key in authorizer, this avoid lookup ticket handler for each message Signed-off-by: Yan, Zheng <zyan@redhat.com>
* libceph: nuke ceph_kvfree()Ilya Dryomov2014-12-173-11/+3
| | | | | | | | | Use kvfree() from linux/mm.h instead, which is identical. Also fix the ceph_buffer comment: we will allocate with kmalloc() up to 32k - the value of PAGE_ALLOC_COSTLY_ORDER, but that really is just an implementation detail so don't mention it at all. Signed-off-by: Ilya Dryomov <idryomov@redhat.com>
* libceph: change from BUG to WARN for __remove_osd() assertsIlya Dryomov2014-11-131-2/+2
| | | | | | | No reason to use BUG_ON for osd request list assertions. Signed-off-by: Ilya Dryomov <idryomov@redhat.com> Reviewed-by: Alex Elder <elder@linaro.org>