summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* block, scsi: Change the preempt-only flag into a counterBart Van Assche2018-09-264-27/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | The RQF_PREEMPT flag is used for three purposes: - In the SCSI core, for making sure that power management requests are executed even if a device is in the "quiesced" state. - For domain validation by SCSI drivers that use the parallel port. - In the IDE driver, for IDE preempt requests. Rename "preempt-only" into "pm-only" because the primary purpose of this mode is power management. Since the power management core may but does not have to resume a runtime suspended device before performing system-wide suspend and since a later patch will set "pm-only" mode as long as a block device is runtime suspended, make it possible to set "pm-only" mode from more than one context. Since with this change scsi_device_quiesce() is no longer idempotent, make that function return early if it is called for a quiesced queue. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Move power management code into a new source fileBart Van Assche2018-09-2611-239/+264
| | | | | | | | | | | | | | | | | | | | Move the code for runtime power management from blk-core.c into the new source file blk-pm.c. Move the corresponding declarations from <linux/blkdev.h> into <linux/blk-pm.h>. For CONFIG_PM=n, leave out the declarations of the functions that are not used in that mode. This patch not only reduces the number of #ifdefs in the block layer core code but also reduces the size of header file <linux/blkdev.h> and hence should help to reduce the build time of the Linux kernel if CONFIG_PM is not defined. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* xen: don't include <xen/xen.h> from <asm/io.h> and <asm/dma-mapping.h>Christoph Hellwig2018-09-269-7/+7
| | | | | | | | | Nothing Xen specific in these headers, which get included from a lot of code in the kernel. So prune the includes and move them to the Xen-specific files that actually use them instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: remove ARCH_BIOVEC_PHYS_MERGEABLEChristoph Hellwig2018-09-265-15/+3
| | | | | | | | Take the Xen check into the core code instead of delegating it to the architectures. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* xen: provide a prototype for xen_biovec_phys_mergeable in xen.hChristoph Hellwig2018-09-264-10/+4
| | | | | | | | Having multiple externs in arch headers is not a good way to provide a common interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* xen: remove the xen_biovec_phys_mergeable exportChristoph Hellwig2018-09-261-1/+0
| | | | | | | BIOVEC_PHYS_MERGEABLE is only called from core block code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* arm: remove the unused BIOVEC_MERGEABLE defineChristoph Hellwig2018-09-261-7/+0
| | | | | Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: don't include bug.h from bio.hChristoph Hellwig2018-09-241-1/+0
| | | | | | | No need to pull in the BUG() defintion. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: don't include io.h from bio.hChristoph Hellwig2018-09-241-3/+0
| | | | | | | | Now that we don't need an override for BIOVEC_PHYS_MERGEABLE there is no need to drag this header in. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: remove bvec_to_physChristoph Hellwig2018-09-243-8/+3
| | | | | | | | | | We only use it in biovec_phys_mergeable and a m68k paravirt driver, so just opencode it there. Also remove the pointless unsigned long cast for the offset in the opencoded instances. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: merge BIOVEC_SEG_BOUNDARY into biovec_phys_mergeableChristoph Hellwig2018-09-245-48/+17
| | | | | | | | These two checks should always be performed together, so merge them into a single helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: add a missing BIOVEC_SEG_BOUNDARY check in bio_add_pc_pageChristoph Hellwig2018-09-241-1/+3
| | | | | | | | | The actual recaculation of segments in __blk_recalc_rq_segments will do this check, so there is no point in forcing it if we know it won't succeed. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: simplify BIOVEC_PHYS_MERGEABLEChristoph Hellwig2018-09-248-30/+28
| | | | | | | | | | | Turn the macro into an inline, move it to blk.h and simplify the arch hooks a bit. Also rename the function to biovec_phys_mergeable as there is no need to shout. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: move req_gap_back_merge to blk.hChristoph Hellwig2018-09-242-19/+19
| | | | | | | No need to expose these helpers outside the block layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: move req_gap_{back,front}_merge to blk-merge.cChristoph Hellwig2018-09-242-69/+65
| | | | | | | | Keep it close to the actual users instead of exposing the function to all drivers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: move integrity_req_gap_{back,front}_merge to blk.hChristoph Hellwig2018-09-242-33/+33
| | | | | | | No need to expose these to drivers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Document the functions that iterate over requestsBart Van Assche2018-09-211-7/+64
| | | | | | | | | | | | | | Make it easier to understand the purpose of the functions that iterate over requests by documenting their purpose. Fix several minor spelling and grammer mistakes in comments in these functions. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: rename blkg_try_get to blkg_trygetDennis Zhou (Facebook)2018-09-214-12/+9
| | | | | | | | | | | blkg reference counting now uses percpu_ref rather than atomic_t. Let's make this consistent with css_tryget. This renames blkg_try_get to blkg_tryget and now returns a bool rather than the blkg or NULL. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: change blkg reference counting to use percpu_refDennis Zhou (Facebook)2018-09-212-35/+44
| | | | | | | | | | Now that every bio is associated with a blkg, this puts the use of blkg_get, blkg_try_get, and blkg_put on the hot path. This switches over the refcnt in blkg to use percpu_ref. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: cleanup and make blk_get_rl use blkg_lookup_createDennis Zhou (Facebook)2018-09-211-13/+21
| | | | | | | | | | | | | | | | | | | blk_get_rl is responsible for identifying which request_list a request should be allocated to. Try get logic was added earlier, but semantically the logic was not changed. This patch makes better use of the bio already having a reference to the blkg in the hot path. The cold path uses a better fallback of blkg_lookup_create rather than just blkg_lookup and then falling back to the q->root_rl. If lookup_create fails with anything but -ENODEV, it falls back to q->root_rl. A clarifying comment is added to explain why q->root_rl is used rather than the root blkg's rl. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: remove additional reference to the cssDennis Zhou (Facebook)2018-09-214-83/+81
| | | | | | | | | | | | | | | | The previous patch in this series removed carrying around a pointer to the css in blkg. However, the blkg association logic still relied on taking a reference on the css to ensure we wouldn't fail in getting a reference for the blkg. Here the implicit dependency on the css is removed. The association continues to rely on the tryget logic walking up the blkg tree. This streamlines the three ways that association can happen: normal, swap, and writeback. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: remove bio->bi_css and instead use bio->bi_blkgDennis Zhou (Facebook)2018-09-218-62/+25
| | | | | | | | | | | | | | Prior patches ensured that all bios are now associated with some blkg. This now makes bio->bi_css unnecessary as blkg maintains a reference to the blkcg already. This patch removes the field bi_css and transfers corresponding uses to access via bi_blkg. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: associate writeback bios with a blkgDennis Zhou (Facebook)2018-09-214-11/+14
| | | | | | | | | | | | One of the goals of this series is to remove a separate reference to the css of the bio. This can and should be accessed via bio_blkcg. In this patch, the wbc_init_bio call is changed such that it must be called after a queue has been associated with the bio. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: associate a blkg for pages being evicted by swapDennis Zhou (Facebook)2018-09-213-28/+68
| | | | | | | | | | | | | A prior patch in this series added blkg association to bios issued by cgroups. There are two other paths that we want to attribute work back to the appropriate cgroup: swap and writeback. Here we modify the way swap tags bios to include the blkg. Writeback will be tackle in the next patch. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: consolidate bio_issue_init to be a part of coreDennis Zhou (Facebook)2018-09-215-10/+13
| | | | | | | | | | | bio_issue_init among other things initializes the timestamp for an IO. Rather than have this logic handled by policies, this consolidates it to be on the init paths (normal, clone, bounce clone). Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: always associate a bio with a blkgDennis Zhou (Facebook)2018-09-215-40/+46
| | | | | | | | | | | | | | | | | Previously, blkg's were only assigned as needed by blk-iolatency and blk-throttle. bio->css was also always being associated while blkg was being looked up and then thrown away in blkcg_bio_issue_check. This patch begins the cleanup of bio->css and bio->bi_blkg by always associating a blkg in blkcg_bio_issue_check. This tries to create the blkg, but if it is not possible, falls back to using the root_blkg of the request_queue. Therefore, a bio will always be associated with a blkg. The duplicate association logic is removed from blk-throttle and blk-iolatency. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: convert blkg_lookup_create to find closest blkgDennis Zhou (Facebook)2018-09-213-15/+41
| | | | | | | | | | | | | | | | There are several scenarios where blkg_lookup_create can fail. Examples include the blkcg dying, request_queue is dying, or simply being OOM. At the end of the day, most handle this by simply falling back to the q->root_blkg and calling it a day. This patch implements the notion of closest blkg. During blkg_lookup_create, if it fails to create, return the closest blkg found or the q->root_blkg. blkg_try_get_closest is introduced and used during association so a bio is always attached to a blkg. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: update blkg_lookup_create to do lockingDennis Zhou (Facebook)2018-09-213-5/+32
| | | | | | | | | | | | | | | | | | | To know when to create a blkg, the general pattern is to do a blkg_lookup and if that fails, lock and then do a lookup again and if that fails finally create. It doesn't make much sense for everyone who wants to do creation to write this themselves. This changes blkg_lookup_create to do locking and implement this pattern. The old blkg_lookup_create is renamed to __blkg_lookup_create. If a call site wants to do its own error handling or already owns the queue lock, they can use __blkg_lookup_create. This will be used in upcoming patches. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blkcg: fix ref count issue with bio_blkcg using task_cssDennis Zhou (Facebook)2018-09-216-16/+107
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The accessor function bio_blkcg either returns the blkcg associated with the bio or finds one in the current context. This can cause an issue when trying to associate a bio with a blkcg. Particularly, it's the third case that is problematic: return css_to_blkcg(task_css(current, io_cgrp_id)); As the above may race against task migration and the cgroup exiting, it is not always ok to take a reference on the blkcg returned from bio_blkcg. This patch adds association ahead of calling bio_blkcg rather than after. This makes association a required and explicit step along the code paths for calling bio_blkcg. blk_get_rl is modified as well to get a reference to the blkcg it may use and blk_put_rl will always put the reference back. Association is also moved above the bio_blkcg call to ensure it will not return NULL in blk-iolatency. BFQ and CFQ utilize this flaw, but due to the complexity, I do not want to address this in this series. I've created a private version of the function with notes not to use it describing the flaw. Hopefully soon, that code can be cleaned up. Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Blk-throttle: update to use rbtree with leftmost node cachedLiu Bo2018-09-201-26/+15
| | | | | | | | As rbtree has native support of caching leftmost node, i.e. rb_root_cached, no need to do the caching by ourselves. Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: use bio_add_page in bio_iov_iter_get_pagesChristoph Hellwig2018-09-201-21/+20
| | | | | | | | | | | | Replace a nasty hack with a different nasty hack to prepare for multipage bio_vecs. By moving the temporary page array as far up as possible in the space allocated for the bio_vec array we can iterate forward over it and thus use bio_add_page. Using bio_add_page means we'll be able to merge physically contiguous pages once support for multipath bio_vecs is merged. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blok, bfq: do not plug I/O if all queues are weight-raisedPaolo Valente2018-09-141-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | To reduce latency for interactive and soft real-time applications, bfq privileges the bfq_queues containing the I/O of these applications. These privileged queues, referred-to as weight-raised queues, get a much higher share of the device throughput w.r.t. non-privileged queues. To preserve this higher share, the I/O of any non-weight-raised queue must be plugged whenever a sync weight-raised queue, while being served, remains temporarily empty. To attain this goal, bfq simply plugs any I/O (from any queue), if a sync weight-raised queue remains empty while in service. Unfortunately, this plugging typically lowers throughput with random I/O, on devices with internal queueing (because it reduces the filling level of the internal queues of the device). This commit addresses this issue by restricting the cases where plugging is performed: if a sync weight-raised queue remains empty while in service, then I/O plugging is performed only if some of the active bfq_queues are *not* weight-raised (which is actually the only circumstance where plugging is needed to preserve the higher share of the throughput of weight-raised queues). This restriction proved able to boost throughput in really many use cases needing only maximum throughput. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block, bfq: inject other-queue I/O into seeky idle queues on NCQ flashPaolo Valente2018-09-142-6/+88
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Achilles' heel of BFQ is its failing to reach a high throughput with sync random I/O on flash storage with internal queueing, in case the processes doing I/O have differentiated weights. The cause of this failure is as follows. If at least two processes do sync I/O, and have a different weight from each other, then BFQ plugs I/O dispatching every time one of these processes, while it is being served, remains temporarily without pending I/O requests. This plugging is necessary to guarantee that every process enjoys a bandwidth proportional to its weight; but it empties the internal queue(s) of the drive. And this kills throughput with random I/O. So, if some processes have differentiated weights and do both sync and random I/O, the end result is a throughput collapse. This commit tries to counter this problem by injecting the service of other processes, in a controlled way, while the process in service happens to have no I/O. This injection is performed only if the medium is non rotational and performs internal queueing, and the process in service does random I/O (service injection might be beneficial for sequential I/O too, we'll work on that). As an example of the benefits of this commit, on a PLEXTOR PX-256M5S SSD, and with five processes having differentiated weights and doing sync random 4KB I/O, this commit makes the throughput with bfq grow by 400%, from 25 to 100MB/s. This higher throughput is 10MB/s lower than that reached with none. As some less random I/O is added to the mix, the throughput becomes equal to or higher than that with none. This commit is a very first attempt to recover throughput without losing control, and certainly has many limitations. One is, e.g., that the processes whose service is injected are not chosen so as to distribute the extra bandwidth they receive in accordance to their weights. Thus there might be loss of weighted fairness in some cases. Anyway, this loss concerns extra service, which would not have been received at all without this commit. Other limitations and issues will probably show up with usage. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block, bfq: correctly charge and reset entity service in all casesPaolo Valente2018-09-141-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BFQ schedules entities (which represent either per-process queues or groups of queues) as a function of their timestamps. In particular, as a function of their (virtual) finish times. The finish time of an entity is computed as a function of the budget assigned to the entity, assuming, tentatively, that the entity, once in service, will receive an amount of service equal to its budget. Then, when the entity is expired because it finishes to be served, this finish time is updated as a function of the actual service received by the entity. This allows the entity to be correctly charged with only the service received, and then to be correctly re-scheduled. Yet an entity may receive service also while not being the entity in service (in the scheduling environment of its parent entity), for several reasons. If the entity remains with no backlog while receiving this 'unofficial' service, then it is expired. Also on such an expiration, the finish time of the entity should be updated to account for only the service actually received by the entity. Unfortunately, such an update is not performed for an entity expiring without being the entity in service. In a similar vein, the service counter of the entity in service is reset when the entity is expired, to be ready to be used for next service cycle. This reset too should be performed also in case an entity is expired because it remains empty after receiving service while not being the entity in service. But in this case the reset is not performed. This commit performs the above update of the finish time and reset of the service received, also for an entity expiring while not being the entity in service. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-iolatency: remove set but not used variables 'changed' and 'blkiolat'YueHaibing2018-09-131-5/+0
| | | | | | | | | | | | | | | Fixes gcc '-Wunused-but-set-variable' warning: block/blk-iolatency.c: In function 'scale_change': block/blk-iolatency.c:301:7: warning: variable 'changed' set but not used [-Wunused-but-set-variable] block/blk-iolatency.c: In function 'iolatency_set_limit': block/blk-iolatency.c:765:24: warning: variable 'blkiolat' set but not used [-Wunused-but-set-variable] Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* rsxx: Remove unnecessary parenthesesNathan Chancellor2018-09-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | Clang warns when more than one set of parentheses is used for a single conditional statement: drivers/block/rsxx/cregs.c:279:15: warning: equality comparison with extraneous parentheses [-Wparentheses-equality] if ((cmd->op == CREG_OP_READ)) { ~~~~~~~~^~~~~~~~~~~~~~~ drivers/block/rsxx/cregs.c:279:15: note: remove extraneous parentheses around the comparison to silence this warning if ((cmd->op == CREG_OP_READ)) { ~ ^ ~ drivers/block/rsxx/cregs.c:279:15: note: use '=' to turn this equality comparison into an assignment if ((cmd->op == CREG_OP_READ)) { ^~ = 1 warning generated. Reported-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: umem: replace spin_lock_bh with spin_lock in tasklet callbackjun qian2018-09-071-2/+2
| | | | | | | As you are already in a tasklet, it is unnecessary to call spin_lock_bh. Signed-off-by: jun qian <hangdianqj@163.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: remove bio_rewind_iter()Ming Lei2018-09-064-30/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | It is pointed that bio_rewind_iter() is one very bad API[1]: 1) bio size may not be restored after rewinding 2) it causes some bogus change, such as 5151842b9d8732 (block: reset bi_iter.bi_done after splitting bio) 3) rewinding really makes things complicated wrt. bio splitting 4) unnecessary updating of .bi_done in fast path [1] https://marc.info/?t=153549924200005&r=1&w=2 So this patch takes Kent's suggestion to restore one bio into its original state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(), given now bio_rewind_iter() is only used by bio integrity code. Cc: Dmitry Monakhov <dmonakhov@openvz.org> Cc: Hannes Reinecke <hare@suse.com> Suggested-by: Kent Overstreet <kent.overstreet@gmail.com> Acked-by: Kent Overstreet <kent.overstreet@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* drbd: Convert from ahash to shashKees Cook2018-09-065-88/+76
| | | | | | | | | | | | | | | | | | | | In preparing to remove all stack VLA usage from the kernel[1], this removes the discouraged use of AHASH_REQUEST_ON_STACK in favor of the smaller SHASH_DESC_ON_STACK by converting from ahash-wrapped-shash to direct shash. By removing a layer of indirection this both improves performance and reduces stack usage. The stack allocation will be made a fixed size in a later patch to the crypto subsystem. The bulk of the lines in this change are simple s/ahash/shash/, but the main logic differences are in drbd_csum_ee() and drbd_csum_bio(), which externalizes the page walking with k(un)map_atomic() instead of using scattergather. [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-linus-20180906' of git://git.kernel.dk/linux-blockLinus Torvalds2018-09-068-64/+111
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "Small collection of fixes that should go into this release. This contains: - Small series that fixes a race between blkcg teardown and writeback (Dennis Zhou) - Fix disallowing invalid block size settings from the nbd ioctl (me) - BFQ fix for a use-after-free on last release of a bfqg (Konstantin Khlebnikov) - Fix for the "don't warn for flush" fix (Mikulas)" * tag 'for-linus-20180906' of git://git.kernel.dk/linux-block: block: bfq: swap puts in bfqg_and_blkg_put block: don't warn when doing fsync on read-only devices nbd: don't allow invalid blocksize settings blkcg: use tryget logic when associating a blkg with a bio blkcg: delay blkg destruction until after writeback has finished Revert "blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()"
| * block: bfq: swap puts in bfqg_and_blkg_putKonstantin Khlebnikov2018-09-061-2/+2
| | | | | | | | | | | | | | | | | | Fix trivial use-after-free. This could be last reference to bfqg. Fixes: 8f9bebc33dd7 ("block, bfq: access and cache blkg data only when safe") Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: don't warn when doing fsync on read-only devicesMikulas Patocka2018-09-051-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is possible to call fsync on a read-only handle (for example, fsck.ext2 does it when doing read-only check), and this call results in kernel warning. The patch b089cfd95d32 ("block: don't warn for flush on read-only device") attempted to disable the warning, but it is buggy and it doesn't (op_is_flush tests flags, but bio_op strips off the flags). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 721c7fc701c7 ("block: fail op_is_write() requests to read-only partitions") Cc: stable@vger.kernel.org # 4.18 Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * nbd: don't allow invalid blocksize settingsJens Axboe2018-09-041-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | syzbot reports a divide-by-zero off the NBD_SET_BLKSIZE ioctl. We need proper validation of the input here. Not just if it's zero, but also if the value is a power-of-2 and in a valid range. Add that. Cc: stable@vger.kernel.org Reported-by: syzbot <syzbot+25dbecbec1e62c6b0dd4@syzkaller.appspotmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blkcg: use tryget logic when associating a blkg with a bioDennis Zhou (Facebook)2018-08-312-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a very small change a bio gets caught up in a really unfortunate race between a task migration, cgroup exiting, and itself trying to associate with a blkg. This is due to css offlining being performed after the css->refcnt is killed which triggers removal of blkgs that reach their blkg->refcnt of 0. To avoid this, association with a blkg should use tryget and fallback to using the root_blkg. Fixes: 08e18eab0c579 ("block: add bi_blkg to the bio for cgroups") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blkcg: delay blkg destruction until after writeback has finishedDennis Zhou (Facebook)2018-08-313-8/+94
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, blkcg destruction relies on a sequence of events: 1. Destruction starts. blkcg_css_offline() is called and blkgs release their reference to the blkcg. This immediately destroys the cgwbs (writeback). 2. With blkgs giving up their reference, the blkcg ref count should become zero and eventually call blkcg_css_free() which finally frees the blkcg. Jiufei Xue reported that there is a race between blkcg_bio_issue_check() and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent on the completion of all writeback associated with the blkcg. A count of the number of cgwbs is maintained and once that goes to zero, blkg destruction can follow. This should prevent premature blkg destruction related to writeback. The new process for blkcg cleanup is as follows: 1. Destruction starts. blkcg_css_offline() is called which offlines writeback. Blkg destruction is delayed on the cgwb_refcnt count to avoid punting potentially large amounts of outstanding writeback to root while maintaining any ongoing policies. Here, the base cgwb_refcnt is put back. 2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called and handles destruction of blkgs. This is where the css reference held by each blkg is released. 3. Once the blkcg ref count goes to zero, blkcg_css_free() is called. This finally frees the blkg. It seems in the past blk-throttle didn't do the most understandable things with taking data from a blkg while associating with current. So, the simplification and unification of what blk-throttle is doing caused this. Fixes: 08e18eab0c579 ("block: add bi_blkg to the bio for cgroups") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * Revert "blk-throttle: fix race between blkcg_bio_issue_check() and ↵Dennis Zhou (Facebook)2018-08-312-63/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cgroup_rmdir()" This reverts commit 4c6994806f708559c2812b73501406e21ae5dcd0. Destroying blkgs is tricky because of the nature of the relationship. A blkg should go away when either a blkcg or a request_queue goes away. However, blkg's pin the blkcg to ensure they remain valid. To break this cycle, when a blkcg is offlined, blkgs put back their css ref. This eventually lets css_free() get called which frees the blkcg. The above commit (4c6994806f70) breaks this order of events by trying to destroy blkgs in css_free(). As the blkgs still hold references to the blkcg, css_free() is never called. The race between blkcg_bio_issue_check() and cgroup_rmdir() will be addressed in the following patch by delaying destruction of a blkg until all writeback associated with the blkcg has been finished. Fixes: 4c6994806f70 ("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennisszhou@gmail.com> Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'apparmor-pr-2018-09-06' of ↵Linus Torvalds2018-09-061-1/+0
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor Pull apparmor fix from John Johansen: "A fix for an issue syzbot discovered last week: - Fix for bad debug check when converting secids to secctx" * tag 'apparmor-pr-2018-09-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor: apparmor: fix bad debug check in apparmor_secid_to_secctx()
| * | apparmor: fix bad debug check in apparmor_secid_to_secctx()John Johansen2018-09-031-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | apparmor_secid_to_secctx() has a bad debug statement tripping on a condition handle by the code. When kconfig SECURITY_APPARMOR_DEBUG is enabled the debug WARN_ON will trip when **secdata is NULL resulting in the following trace. ------------[ cut here ]------------ AppArmor WARN apparmor_secid_to_secctx: ((!secdata)): WARNING: CPU: 0 PID: 14826 at security/apparmor/secid.c:82 apparmor_secid_to_secctx+0x2b5/0x2f0 security/apparmor/secid.c:82 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 14826 Comm: syz-executor1 Not tainted 4.19.0-rc1+ #193 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 panic+0x238/0x4e7 kernel/panic.c:184 __warn.cold.8+0x163/0x1ba kernel/panic.c:536 report_bug+0x252/0x2d0 lib/bug.c:186 fixup_bug arch/x86/kernel/traps.c:178 [inline] do_error_trap+0x1fc/0x4d0 arch/x86/kernel/traps.c:296 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:316 invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:993 RIP: 0010:apparmor_secid_to_secctx+0x2b5/0x2f0 security/apparmor/secid.c:82 Code: c7 c7 40 66 58 87 e8 6a 6d 0f fe 0f 0b e9 6c fe ff ff e8 3e aa 44 fe 48 c7 c6 80 67 58 87 48 c7 c7 a0 65 58 87 e8 4b 6d 0f fe <0f> 0b e9 3f fe ff ff 48 89 df e8 fc a7 83 fe e9 ed fe ff ff bb f4 RSP: 0018:ffff8801ba1bed10 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff8801ba1beed0 RCX: ffffc9000227e000 RDX: 0000000000018482 RSI: ffffffff8163ac01 RDI: 0000000000000001 RBP: ffff8801ba1bed30 R08: ffff8801b80ec080 R09: ffffed003b603eca R10: ffffed003b603eca R11: ffff8801db01f657 R12: 0000000000000001 R13: 0000000000000000 R14: 0000000000000000 R15: ffff8801ba1beed0 security_secid_to_secctx+0x63/0xc0 security/security.c:1314 ctnetlink_secctx_size net/netfilter/nf_conntrack_netlink.c:621 [inline] ctnetlink_nlmsg_size net/netfilter/nf_conntrack_netlink.c:659 [inline] ctnetlink_conntrack_event+0x303/0x1470 net/netfilter/nf_conntrack_netlink.c:706 nf_conntrack_eventmask_report+0x55f/0x930 net/netfilter/nf_conntrack_ecache.c:151 nf_conntrack_event_report include/net/netfilter/nf_conntrack_ecache.h:112 [inline] nf_ct_delete+0x33c/0x5d0 net/netfilter/nf_conntrack_core.c:601 nf_ct_iterate_cleanup+0x48c/0x5e0 net/netfilter/nf_conntrack_core.c:1892 nf_ct_iterate_cleanup_net+0x23c/0x2d0 net/netfilter/nf_conntrack_core.c:1974 ctnetlink_flush_conntrack net/netfilter/nf_conntrack_netlink.c:1226 [inline] ctnetlink_del_conntrack+0x66c/0x850 net/netfilter/nf_conntrack_netlink.c:1258 nfnetlink_rcv_msg+0xd88/0x1070 net/netfilter/nfnetlink.c:228 netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2454 nfnetlink_rcv+0x1c0/0x4d0 net/netfilter/nfnetlink.c:560 netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline] netlink_unicast+0x5a0/0x760 net/netlink/af_netlink.c:1343 netlink_sendmsg+0xa18/0xfc0 net/netlink/af_netlink.c:1908 sock_sendmsg_nosec net/socket.c:621 [inline] sock_sendmsg+0xd5/0x120 net/socket.c:631 ___sys_sendmsg+0x7fd/0x930 net/socket.c:2114 __sys_sendmsg+0x11d/0x290 net/socket.c:2152 __do_sys_sendmsg net/socket.c:2161 [inline] __se_sys_sendmsg net/socket.c:2159 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2159 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x457089 Code: fd b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f7bc6e03c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f7bc6e046d4 RCX: 0000000000457089 RDX: 0000000000000000 RSI: 0000000020d65000 RDI: 0000000000000003 RBP: 00000000009300a0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff R13: 00000000004d4588 R14: 00000000004c8d5c R15: 0000000000000000 Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled Rebooting in 86400 seconds.. CC: <stable@vger.kernel.org> #4.18 Fixes: c092921219d2 ("apparmor: add support for mapping secids and using secctxes") Reported-by: syzbot+21016130b0580a9de3b5@syzkaller.appspotmail.com Signed-off-by: John Johansen <john.johansen@canonical.com>
* | | Merge tag 'trace-v4.19-rc2' of ↵Linus Torvalds2018-09-062-4/+8
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "This fixes two annoying bugs: - The first one is a side effect caused by using SRCU for rcuidle tracepoints. It seems that the perf was depending on the rcuidle tracepoints to make RCU watch when it wasn't. The real fix will be to have perf use SRCU instead of depending on RCU watching, but that can't be done until SRCU is safe to use in NMI context (Paul's working on that). - The second bug fix is for a bug that's been periodically making my tests fail randomly for some time. I haven't had time to track it down, but finally have. It has to do with stressing NMIs (via perf) while enabling or disabling ftrace function handling with lockdep enabled. If an interrupt happens and just as it returns, it sets lockdep back to "interrupts enabled" but before it returns an NMI is triggered, and if this happens while printk_nmi_enter has a breakpoint attached to it (because ftrace is converting it to or from nop to call fentry), the breakpoint trap also calls into lockdep, and since returning from the NMI to a interrupt handler, interrupts were disabled when the NMI went off, lockdep keeps its state as interrupts disabled when it returns back from the interrupt handler where interrupts are enabled. This causes lockdep_assert_irqs_enabled() to trigger a false positive" * tag 'trace-v4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: printk/tracing: Do not trace printk_nmi_enter() tracing: Add back in rcu_irq_enter/exit_irqson() for rcuidle tracepoints
| * | | printk/tracing: Do not trace printk_nmi_enter()Steven Rostedt (VMware)2018-09-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I hit the following splat in my tests: ------------[ cut here ]------------ IRQs not enabled as expected WARNING: CPU: 3 PID: 0 at kernel/time/tick-sched.c:982 tick_nohz_idle_enter+0x44/0x8c Modules linked in: ip6t_REJECT nf_reject_ipv6 ip6table_filter ip6_tables ipv6 CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.19.0-rc2-test+ #2 Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014 EIP: tick_nohz_idle_enter+0x44/0x8c Code: ec 05 00 00 00 75 26 83 b8 c0 05 00 00 00 75 1d 80 3d d0 36 3e c1 00 75 14 68 94 63 12 c1 c6 05 d0 36 3e c1 01 e8 04 ee f8 ff <0f> 0b 58 fa bb a0 e5 66 c1 e8 25 0f 04 00 64 03 1d 28 31 52 c1 8b EAX: 0000001c EBX: f26e7f8c ECX: 00000006 EDX: 00000007 ESI: f26dd1c0 EDI: 00000000 EBP: f26e7f40 ESP: f26e7f38 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010296 CR0: 80050033 CR2: 0813c6b0 CR3: 2f342000 CR4: 001406f0 Call Trace: do_idle+0x33/0x202 cpu_startup_entry+0x61/0x63 start_secondary+0x18e/0x1ed startup_32_smp+0x164/0x168 irq event stamp: 18773830 hardirqs last enabled at (18773829): [<c040150c>] trace_hardirqs_on_thunk+0xc/0x10 hardirqs last disabled at (18773830): [<c040151c>] trace_hardirqs_off_thunk+0xc/0x10 softirqs last enabled at (18773824): [<c0ddaa6f>] __do_softirq+0x25f/0x2bf softirqs last disabled at (18773767): [<c0416bbe>] call_on_stack+0x45/0x4b ---[ end trace b7c64aa79e17954a ]--- After a bit of debugging, I found what was happening. This would trigger when performing "perf" with a high NMI interrupt rate, while enabling and disabling function tracer. Ftrace uses breakpoints to convert the nops at the start of functions to calls to the function trampolines. The breakpoint traps disable interrupts and this makes calls into lockdep via the trace_hardirqs_off_thunk in the entry.S code. What happens is the following: do_idle { [interrupts enabled] <interrupt> [interrupts disabled] TRACE_IRQS_OFF [lockdep says irqs off] [...] TRACE_IRQS_IRET test if pt_regs say return to interrupts enabled [yes] TRACE_IRQS_ON [lockdep says irqs are on] <nmi> nmi_enter() { printk_nmi_enter() [traced by ftrace] [ hit ftrace breakpoint ] <breakpoint exception> TRACE_IRQS_OFF [lockdep says irqs off] [...] TRACE_IRQS_IRET [return from breakpoint] test if pt_regs say interrupts enabled [no] [iret back to interrupt] [iret back to code] tick_nohz_idle_enter() { lockdep_assert_irqs_enabled() [lockdep say no!] Although interrupts are indeed enabled, lockdep thinks it is not, and since we now do asserts via lockdep, it gives a false warning. The issue here is that printk_nmi_enter() is called before lockdep_off(), which disables lockdep (for this reason) in NMIs. By simply not allowing ftrace to see printk_nmi_enter() (via notrace annotation) we keep lockdep from getting confused. Cc: stable@vger.kernel.org Fixes: 42a0bb3f71383 ("printk/nmi: generic solution for safe printk in NMI") Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>