linux.git - Linux kernel mainline tree

	Commit message (Collapse)	Author	Age	Files	Lines
*	Merge tag 'for-5.14/dm-changes' of ↵	Linus Torvalds	2021-06-30	32	-592/+2411
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Various DM persistent-data library improvements and fixes that benefit both the DM thinp and cache targets. - A few small DM kcopyd efficiency improvements. - Significant zoned related block core, DM core and DM zoned target changes that culminate with adding zoned append emulation (which is required to properly fix DM crypt's zoned support). - Various DM writecache target changes that improve efficiency. Adds an optional "metadata_only" feature that only promotes bios flagged with REQ_META. But the most significant improvement is writecache's ability to pause writeback, for a confiurable time, if/when the working set is larger than the cache (and the cache is full) -- this ensures performance is no worse than the slower origin device. * tag 'for-5.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (35 commits) dm writecache: make writeback pause configurable dm writecache: pause writeback if cache full and origin being written directly dm io tracker: factor out IO tracker dm btree remove: assign new_root only when removal succeeds dm zone: fix dm_revalidate_zones() memory allocation dm ps io affinity: remove redundant continue statement dm writecache: add optional "metadata_only" parameter dm writecache: add "cleaner" and "max_age" to Documentation dm writecache: write at least 4k when committing dm writecache: flush origin device when writing and cache is full dm writecache: have ssd writeback wait if the kcopyd workqueue is busy dm writecache: use list_move instead of list_del/list_add in writecache_writeback() dm writecache: commit just one block, not a full page dm writecache: remove unused gfp_t argument from wc_add_block() dm crypt: Fix zoned block device support dm: introduce zone append emulation dm: rearrange core declarations for extended use from dm-zone.c block: introduce BIO_ZONE_WRITE_LOCKED bio flag block: introduce bio zone helpers block: improve handling of all zones reset operation ...
\| *	dm writecache: make writeback pause configurable	Mikulas Patocka	2021-06-28	2	-8/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 95b88f4d71cb953e02206be3c757083601391a0f ("dm writecache: pause writeback if cache full and origin being written directly") introduced a code that pauses cache flushing if we are issuing writes directly to the origin. Improve that initial commit by making the timeout code configurable (via the option "pause_writeback"). Also change the default from 1s to 3s because it performed better. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm writecache: pause writeback if cache full and origin being written directly	Mikulas Patocka	2021-06-25	1	-1/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implementation reuses dm_io_tracker, that until now was only used by dm-cache, to track if any writes were issued directly to the origin (due to cache being full) within the last second. If so writeback is paused for a second. This change improves performance for when the cache is full and IO is issued directly to the origin device (rather than through the cache). Depends-on: d53f1fafec9d ("dm writecache: do direct write if the cache is full") Suggested-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm io tracker: factor out IO tracker	Mike Snitzer	2021-06-25	2	-76/+75
\| \| \| \| \| \| \| \| \| \| \| \|	Allow other code to use dm_io_tracker. Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm btree remove: assign new_root only when removal succeeds	Hou Tao	2021-06-25	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	remove_raw() in dm_btree_remove() may fail due to IO read error (e.g. read the content of origin block fails during shadowing), and the value of shadow_spine::root is uninitialized, but the uninitialized value is still assign to new_root in the end of dm_btree_remove(). For dm-thin, the value of pmd->details_root or pmd->root will become an uninitialized value, so if trying to read details_info tree again out-of-bound memory may occur as showed below: general protection fault, probably for non-canonical address 0x3fdcb14c8d7520 CPU: 4 PID: 515 Comm: dmsetup Not tainted 5.13.0-rc6 Hardware name: QEMU Standard PC RIP: 0010:metadata_ll_load_ie+0x14/0x30 Call Trace: sm_metadata_count_is_more_than_one+0xb9/0xe0 dm_tm_shadow_block+0x52/0x1c0 shadow_step+0x59/0xf0 remove_raw+0xb2/0x170 dm_btree_remove+0xf4/0x1c0 dm_pool_delete_thin_device+0xc3/0x140 pool_message+0x218/0x2b0 target_message+0x251/0x290 ctl_ioctl+0x1c4/0x4d0 dm_ctl_ioctl+0xe/0x20 __x64_sys_ioctl+0x7b/0xb0 do_syscall_64+0x40/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixing it by only assign new_root when removal succeeds Signed-off-by: Hou Tao <houtao1@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm zone: fix dm_revalidate_zones() memory allocation	Damien Le Moal	2021-06-25	1	-2/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Make sure that the zone write pointer offset array is allocated with a vmalloc in dm_zone_revalidate_cb() by passing GFP_KERNEL gfp flag to kvcalloc(). However, since we do not want to trigger IOs while revalidating zones, change dm_revalidate_zones() to have the zone scan done in GFP_NOIO context using memalloc_noio_save/restore calls. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Fixes: bb37d77239af ("dm: introduce zone append emulation") Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm ps io affinity: remove redundant continue statement	Colin Ian King	2021-06-25	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The continue statement at the end of a for-loop has no effect, remove it. Addresses-Coverity: ("Continue has no effect") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm writecache: add optional "metadata_only" parameter	Mikulas Patocka	2021-06-25	1	-4/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a "metadata_only" parameter that when present: only metadata is promoted to the cache. This option improves performance for heavier REQ_META workloads (e.g. device-mapper-test-suite's "git clone and checkout" benchmark improves from 341s to 312s). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm writecache: write at least 4k when committing	Mikulas Patocka	2021-06-21	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SSDs perform badly with sub-4k writes (because they perfrorm read-modify-write internally), so make sure writecache writes at least 4k when committing. Fixes: 991bd8d7bc78 ("dm writecache: commit just one block, not a full page") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm writecache: flush origin device when writing and cache is full	Mikulas Patocka	2021-06-16	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit d53f1fafec9d086f1c5166436abefdaef30e0363 ("dm writecache: do direct write if the cache is full") changed dm-writecache, so that it writes directly to the origin device if the cache is full. Unfortunately, it doesn't forward flush requests to the origin device, so that there is a bug where flushes are being ignored. Fix this by adding missing flush forwarding. For PMEM mode, we fix this bug by disabling direct writes to the origin device, because it performs better. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: d53f1fafec9d ("dm writecache: do direct write if the cache is full") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm writecache: have ssd writeback wait if the kcopyd workqueue is busy	Mikulas Patocka	2021-06-15	2	-0/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Make dm-writecache wait if the kcopyd workqueue is busy (as will happen if waiting for page allocation or inside submit_bio). This change improves performance of "mkfs.ext2" by approximately 20% on one testbed. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm writecache: use list_move instead of list_del/list_add in ↵	Baokun Li	2021-06-14	1	-6/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	writecache_writeback() Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm writecache: commit just one block, not a full page	Mikulas Patocka	2021-06-14	1	-5/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some architectures have pages larger than 4k and committing a full page causes needless overhead. Fix this by writing a single block when committing the superblock. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm writecache: remove unused gfp_t argument from wc_add_block()	Mikulas Patocka	2021-06-14	1	-3/+3
\| \| \| \| \| \| \| \| \| \|	Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm crypt: Fix zoned block device support	Damien Le Moal	2021-06-04	1	-5/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Zone append BIOs (REQ_OP_ZONE_APPEND) always specify the start sector of the zone to be written instead of the actual sector location to write. The write location is determined by the device and returned to the host upon completion of the operation. This interface, while simple and efficient for writing into sequential zones of a zoned block device, is incompatible with the use of sector values to calculate a cypher block IV. All data written in a zone end up using the same IV values corresponding to the first sectors of the zone, but read operation will specify any sector within the zone resulting in an IV mismatch between encryption and decryption. To solve this problem, report to DM core that zone append operations are not supported. This result in the zone append operations being emulated using regular write operations. Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm: introduce zone append emulation	Damien Le Moal	2021-06-04	5	-54/+612
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For zoned targets that cannot support zone append operations, implement an emulation using regular write operations. If the original BIO submitted by the user is a zone append operation, change its clone into a regular write operation directed at the target zone write pointer position. To do so, an array of write pointer offsets (write pointer position relative to the start of a zone) is added to struct mapped_device. All operations that modify a sequential zone write pointer (writes, zone reset, zone finish and zone append) are intersepted in __map_bio() and processed using the new functions dm_zone_map_bio(). Detection of the target ability to natively support zone append operations is done from dm_table_set_restrictions() by calling the function dm_set_zones_restrictions(). A target that does not support zone append operation, either by explicitly declaring it using the new struct dm_target field zone_append_not_supported, or because the device table contains a non-zoned device, has its mapped device marked with the new flag DMF_ZONE_APPEND_EMULATED. The helper function dm_emulate_zone_append() is introduced to test a mapped device for this new flag. Atomicity of the zones write pointer tracking and updates is done using a zone write locking mechanism based on a bitmap. This is similar to the block layer method but based on BIOs rather than struct request. A zone write lock is taken in dm_zone_map_bio() for any clone BIO with an operation type that changes the BIO target zone write pointer position. The zone write lock is released if the clone BIO is failed before submission or when dm_zone_endio() is called when the clone BIO completes. The zone write lock bitmap of the mapped device, together with a bitmap indicating zone types (conv_zones_bitmap) and the write pointer offset array (zwp_offset) are allocated and initialized with a full device zone report in dm_set_zones_restrictions() using the function dm_revalidate_zones(). For failed operations that may have modified a zone write pointer, the zone write pointer offset is marked as invalid in dm_zone_endio(). Zones with an invalid write pointer offset are checked and the write pointer updated using an internal report zone operation when the faulty zone is accessed again by the user. All functions added for this emulation have a minimal overhead for zoned targets natively supporting zone append operations. Regular device targets are also not affected. The added code also does not impact builds with CONFIG_BLK_DEV_ZONED disabled by stubbing out all dm zone related functions. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm: rearrange core declarations for extended use from dm-zone.c	Damien Le Moal	2021-06-04	2	-52/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move the definitions of struct dm_target_io, struct dm_io and the bits of the flags field of struct mapped_device from dm.c to dm-core.h to make them usable from dm-zone.c. For the same reason, declare dec_pending() in dm-core.h after renaming it to dm_io_dec_pending(). And for symmetry of the function names, introduce the inline helper dm_io_inc_pending() instead of directly using atomic_inc() calls. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm: Forbid requeue of writes to zones	Damien Le Moal	2021-06-04	3	-6/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A target map method requesting the requeue of a bio with DM_MAPIO_REQUEUE or completing it with DM_ENDIO_REQUEUE can cause unaligned write errors if the bio is a write operation targeting a sequential zone. If a zoned target request such a requeue, warn about it and kill the IO. The function dm_is_zone_write() is introduced to detect write operations to zoned targets. This change does not affect the target drivers supporting zoned devices and exposing a zoned device, namely dm-crypt, dm-linear and dm-flakey as none of these targets ever request a requeue. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm: Introduce dm_report_zones()	Damien Le Moal	2021-06-04	4	-14/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To simplify the implementation of the report_zones operation of a zoned target, introduce the function dm_report_zones() to set a target mapping start sector in struct dm_report_zones_args and call blkdev_report_zones(). This new function is exported and the report zones callback function dm_report_zones_cb() is not. dm-linear, dm-flakey and dm-crypt are modified to use dm_report_zones(). Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm: move zone related code to dm-zone.c	Damien Le Moal	2021-06-04	5	-89/+119
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move core and table code used for zoned targets and conditionally defined with #ifdef CONFIG_BLK_DEV_ZONED to the new file dm-zone.c. This file is conditionally compiled depending on CONFIG_BLK_DEV_ZONED. The small helper dm_set_zones_restrictions() is introduced to initialize a mapped device request queue zone attributes in dm_table_set_restrictions(). Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm: cleanup device_area_is_invalid()	Damien Le Moal	2021-06-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In device_area_is_invalid(), use bdev_is_zoned() instead of open coding the test on the zoned model returned by bdev_zoned_model(). Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm: Fix dm_accept_partial_bio() relative to zone management commands	Damien Le Moal	2021-06-04	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix dm_accept_partial_bio() to actually check that zone management commands are not passed as explained in the function documentation comment. Also, since a zone append operation cannot be split, add REQ_OP_ZONE_APPEND as a forbidden command. White lines are added around the group of BUG_ON() calls to make the code more legible. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm zoned: check zone capacity	Damien Le Moal	2021-06-04	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The dm-zoned target cannot support zoned block devices with zones that have a capacity smaller than the zone size (e.g. NVMe zoned namespaces) due to the current chunk zone mapping implementation as it is assumed that zones and chunks have the same size with all blocks usable. If a zoned drive is found to have zones with a capacity different from the zone size, fail the target initialization. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Cc: stable@vger.kernel.org # v5.9+ Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm table: Constify static struct blk_ksm_ll_ops	Rikard Falkeborn	2021-06-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The only usage of dm_ksm_ll_ops is to make a copy of it to the ksm_ll_ops field in the blk_keyslot_manager struct. Make it const to allow the compiler to put it in read-only memory. Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm writecache: interrupt writeback if suspended	Mikulas Patocka	2021-06-04	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the DM device is suspended, interrupt the writeback sequence so that there is no excessive suspend delay. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm writecache: don't split bios when overwriting contiguous cache content	Mikulas Patocka	2021-06-04	1	-8/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If dm-writecache overwrites existing cached data, it splits the incoming bio into many block-sized bios. The I/O scheduler does merge these bios into one large request but this needless splitting and merging causes performance degradation. Fix this by avoiding bio splitting if the cache target area that is being overwritten is contiguous. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm kcopyd: avoid spin_lock_irqsave from process context	Mikulas Patocka	2021-06-04	1	-9/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The functions "pop", "push_head", "do_work" can only be called from process context. Therefore, replace spin_lock_irq{save,restore} with spin_{lock,unlock}_irq. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm kcopyd: avoid useless atomic operations	Mikulas Patocka	2021-06-04	3	-12/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The functions set_bit and clear_bit are atomic. We don't need atomicity when making flags for dm-kcopyd. So, change them to direct manipulation of the flags. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm space map disk: cache a small number of index entries	Joe Thornber	2021-06-04	2	-6/+96
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The disk space map stores it's index entries in a btree, these are accessed very frequently, so having a few cached makes a big difference to performance. With this change provisioning a new block takes roughly 20% less cpu. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm space maps: improve performance with inc/dec on ranges of blocks	Joe Thornber	2021-06-04	15	-245/+774
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When we break sharing on btree nodes we typically need to increment the reference counts to every value held in the node. This can cause a lot of repeated calls to the space maps. Fix this by changing the interface to the space map inc/dec methods to take ranges of adjacent blocks to be operated on. For installations that are using a lot of snapshots this will reduce cpu overhead of fundamental operations such as provisioning a new block, or deleting a snapshot, by as much as 10 times. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm space maps: don't reset space map allocation cursor when committing	Joe Thornber	2021-06-04	2	-2/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Current commit code resets the place where the search for free blocks will begin back to the start of the metadata device. There are a couple of repercussions to this: - The first allocation after the commit is likely to take longer than normal as it searches for a free block in an area that is likely to have very few free blocks (if any). - Any free blocks it finds will have been recently freed. Reusing them means we have fewer old copies of the metadata to aid recovery from hardware error. Fix these issues by leaving the cursor alone, only resetting when the search hits the end of the metadata device. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
\| *	dm btree: improve btree residency	Joe Thornber	2021-06-04	3	-31/+439
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This commit improves the residency of btrees built in the metadata for dm-thin and dm-cache. When inserting a new entry into a full btree node the current code splits the node into two. This can result in very many half full nodes, particularly if the insertions are occurring in an ascending order (as happens in dm-thin with large writes). With this commit, when we insert into a full node we first try and move some entries to a neighbouring node that has space, failing that it tries to split two neighbouring nodes into three. Results are given below. 'Residency' is how full nodes are on average as a percentage. Average instruction counts for the operations are given to show the extra processing has little overhead. +--------------------------+--------------------------+ \| Before \| After \| +------------+-----------+-----------+--------------+-----------+--------------+ \| Test \| Phase \| Residency \| Instructions \| Residency \| Instructions \| +------------+-----------+-----------+--------------+-----------+--------------+ \| Ascending \| insert \| 50 \| 1876 \| 96 \| 1930 \| \| \| overwrite \| 50 \| 1789 \| 96 \| 1746 \| \| \| lookup \| 50 \| 778 \| 96 \| 778 \| \| Descending \| insert \| 50 \| 3024 \| 96 \| 3181 \| \| \| overwrite \| 50 \| 1789 \| 96 \| 1746 \| \| \| lookup \| 50 \| 778 \| 96 \| 778 \| \| Random \| insert \| 68 \| 3800 \| 84 \| 3736 \| \| \| overwrite \| 68 \| 4254 \| 84 \| 3911 \| \| \| lookup \| 68 \| 779 \| 84 \| 779 \| \| Runs \| insert \| 63 \| 2546 \| 82 \| 2815 \| \| \| overwrite \| 63 \| 2013 \| 82 \| 1986 \| \| \| lookup \| 63 \| 778 \| 82 \| 779 \| +------------+-----------+-----------+--------------+-----------+--------------+ Ascending - keys are inserted in ascending order. Descending - keys are inserted in descending order. Random - keys are inserted in random order. Runs - keys are split into ascending runs of ~20 length. Then the runs are shuffled. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Colin Ian King <colin.king@canonical.com> # contains_key() fix Signed-off-by: Mike Snitzer <snitzer@redhat.com>
* \|	Merge tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block	Linus Torvalds	2021-06-30	13	-83/+149
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pull block driver updates from Jens Axboe: "Pretty calm round, mostly just NVMe and a bit of MD: - NVMe updates (via Christoph) - improve the APST configuration algorithm (Alexey Bogoslavsky) - look for StorageD3Enable on companion ACPI device (Mario Limonciello) - allow selecting the network interface for TCP connections (Martin Belanger) - misc cleanups (Amit Engel, Chaitanya Kulkarni, Colin Ian King, Christoph) - move the ACPI StorageD3 code to drivers/acpi/ and add quirks for certain AMD CPUs (Mario Limonciello) - zoned device support for nvmet (Chaitanya Kulkarni) - fix the rules for changing the serial number in nvmet (Noam Gottlieb) - various small fixes and cleanups (Dan Carpenter, JK Kim, Chaitanya Kulkarni, Hannes Reinecke, Wesley Sheng, Geert Uytterhoeven, Daniel Wagner) - MD updates (Via Song) - iostats rewrite (Guoqing Jiang) - raid5 lock contention optimization (Gal Ofri) - Fall through warning fix (Gustavo) - Misc fixes (Gustavo, Jiapeng)" * tag 'for-5.14/drivers-2021-06-29' of git://git.kernel.dk/linux-block: (78 commits) nvmet: use NVMET_MAX_NAMESPACES to set nn value loop: Fix missing discard support when using LOOP_CONFIGURE nvme.h: add missing nvme_lba_range_type endianness annotations nvme: remove zeroout memset call for struct nvme-pci: remove zeroout memset call for struct nvmet: remove zeroout memset call for struct nvmet: add ZBD over ZNS backend support nvmet: add Command Set Identifier support nvmet: add nvmet_req_bio put helper for backends nvmet: add req cns error complete helper block: export blk_next_bio() nvmet: remove local variable nvmet: use nvme status value directly nvmet: use u32 type for the local variable nsid nvmet: use u32 for nvmet_subsys max_nsid nvmet: use req->cmd directly in file-ns fast path nvmet: use req->cmd directly in bdev-ns fast path nvmet: make ver stable once connection established nvmet: allow mn change if subsys not discovered nvmet: make sn stable once connection was established ...
\| * \|	md/raid5: avoid device_lock in read_one_chunk()	Gal Ofri	2021-06-14	1	-6/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is a lock contention on device_lock in read_one_chunk(). device_lock is taken to sync conf->active_aligned_reads and conf->quiesce. read_one_chunk() takes the lock, then waits for quiesce=0 (resumed) before incrementing active_aligned_reads. raid5_quiesce() takes the lock, sets quiesce=2 (in-progress), then waits for active_aligned_reads to be zero before setting quiesce=1 (suspended). Introduce a fast (lockless) path in read_one_chunk(): activate aligned read without taking device_lock. In case quiesce starts while activating the aligned-read in fast path, deactivate it and revert to old behavior (take device_lock and wait for quiesce to finish). Add smp store/load in raid5_quiesce()/read_one_chunk() respectively to gaurantee that read_one_chunk() does not miss an ongoing quiesce. My setups: 1. 8 local nvme drives (each up to 250k iops). 2. 8 ram disks (brd). Each setup with raid6 (6+2), 1024 io threads on a 96 cpu-cores (48 per socket) system. Record both iops and cpu spent on this contention with rand-read-4k. Record bw with sequential-read-128k. Note: in most cases cpu is still busy but due to "new" bottlenecks. nvme: \| iops \| cpu \| bw ----------------------------------------------- without patch \| 1.6M \| ~50% \| 5.5GB/s with patch \| 2M (throttled) \| 0% \| 16GB/s (throttled) ram (brd): \| iops \| cpu \| bw ----------------------------------------------- without patch \| 2M \| ~80% \| 24GB/s with patch \| 4M \| 0% \| 55GB/s CC: Song Liu <song@kernel.org> CC: Neil Brown <neilb@suse.de> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Gal Ofri <gal.ofri@storing.io> Signed-off-by: Song Liu <song@kernel.org>
\| * \|	md: add comments in md_integrity_register	Guoqing Jiang	2021-06-14	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Given it is not obvious for the error handling, let's try to add some comments here to make it clear. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
\| * \|	md: check level before create and exit io_acct_set	Guoqing Jiang	2021-06-14	1	-6/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The bio_set (io_acct_set) is used by personalities to clone bio and trace the timestamp of bio. Some personalities such as raid1/10 don't need the bio_set, so add check to not create it unconditionally. Also update the comment for md_account_bio to make it more clear. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
\| * \|	md: Constify attribute_group structs	Rikard Falkeborn	2021-06-14	4	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The attribute_group structs are never modified, they're only passed to sysfs_create_group() and sysfs_remove_group(). Make them const to allow the compiler to put them in read-only memory. Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com> Signed-off-by: Song Liu <song@kernel.org>
\| * \|	md: mark some personalities as deprecated	Guoqing Jiang	2021-06-14	4	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Mark the three personalities (linear, fault and multipath) as deprecated because: 1. people can use dm multipath or nvme multipath. 2. linear is already deprecated in MODULE_ALIAS. 3. no one actively using fault. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
\| * \|	md/raid10: enable io accounting	Guoqing Jiang	2021-06-14	2	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For raid10, we record the start time between split bio and clone bio, and finish the accounting in the final endio. Also introduce start_time in r10bio accordingly. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
\| * \|	md/raid1: enable io accounting	Guoqing Jiang	2021-06-14	2	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For raid1, we record the start time between split bio and clone bio, and finish the accounting in the final endio. Also introduce start_time in r1bio accordingly. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
\| * \|	md/raid1: rename print_msg with r1bio_existed	Guoqing Jiang	2021-06-14	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The caller of raid1_read_request could pass NULL or a valid pointer for "struct r1bio *r1_bio", so it actually means whether r1_bio is existed or not. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
\| * \|	md/raid5: avoid redundant bio clone in raid5_read_one_chunk	Guoqing Jiang	2021-06-14	1	-14/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After enable io accounting, chunk read bio could be cloned twice which is not good. To avoid such inefficiency, let's clone align_bio from io_acct_set too, then we need only call md_account_bio in make_request unconditionally. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
\| * \|	md/raid5: move checking badblock before clone bio in raid5_read_one_chunk	Guoqing Jiang	2021-06-14	1	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We don't need to clone bio if the relevant region has badblock. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
\| * \|	md: add io accounting for raid0 and raid5	Guoqing Jiang	2021-06-14	4	-3/+68
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We introduce a new bioset (io_acct_set) for raid0 and raid5 since they don't own clone infrastructure to accounting io. And the bioset is added to mddev instead of to raid0 and raid5 layer, because with this way, we can put common functions to md.h and reuse them in raid0 and raid5. Also struct md_io_acct is added accordingly which includes io start_time, the origin bio and cloned bio. Then we can call bio_{start,end}_io_acct to get related io status. Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
\| * \|	md: revert io stats accounting	Guoqing Jiang	2021-06-14	2	-46/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The commit 41d2d848e5c0 ("md: improve io stats accounting") could cause double fault problem per the report [1], and also it is not correct to change ->bi_end_io if md don't own it, so let's revert it. And io stats accounting will be replemented in later commits. [1]. https://lore.kernel.org/linux-raid/3bf04253-3fad-434a-63a7-20214e38cf26@gmail.com/T/#t Fixes: 41d2d848e5c0 ("md: improve io stats accounting") Signed-off-by: Guoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: Song Liu <song@kernel.org>
* \| \|	Merge tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block	Linus Torvalds	2021-06-30	5	-45/+26
\|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pull core block updates from Jens Axboe: - disk events cleanup (Christoph) - gendisk and request queue allocation simplifications (Christoph) - bdev_disk_changed cleanups (Christoph) - IO priority improvements (Bart) - Chained bio completion trace fix (Edward) - blk-wbt fixes (Jan) - blk-wbt enable/disable fix (Zhang) - Scheduler dispatch improvements (Jan, Ming) - Shared tagset scheduler improvements (John) - BFQ updates (Paolo, Luca, Pietro) - BFQ lock inversion fix (Jan) - Documentation improvements (Kir) - CLONE_IO block cgroup fix (Tejun) - Remove of ancient and deprecated block dump feature (zhangyi) - Discard merge fix (Ming) - Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas, Yang) * tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits) block: fix discard request merge block/mq-deadline: Remove a WARN_ON_ONCE() call blk-mq: update hctx->dispatch_busy in case of real scheduler blk: Fix lock inversion between ioc lock and bfqd lock bfq: Remove merged request already in bfq_requests_merged() block: pass a gendisk to bdev_disk_changed block: move bdev_disk_changed block: add the events* attributes to disk_attrs block: move the disk events code to a separate file block: fix trace completion for chained bio block/partitions/msdos: Fix typo inidicator -> indicator block, bfq: reset waker pointer with shared queues block, bfq: check waker only for queues with no in-flight I/O block, bfq: avoid delayed merge of async queues block, bfq: boost throughput by extending queue-merging times block, bfq: consider also creation time in delayed stable merge block, bfq: fix delayed stable merge check block, bfq: let also stably merged queues enjoy weight raising blk-wbt: make sure throttle is enabled properly blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled() ...
\| * \| \|	blk-mq: improve the blk_mq_init_allocated_queue interface	Christoph Hellwig	2021-06-11	1	-6/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Don't return the passed in request_queue but a normal error code, and drop the elevator_init argument in favor of just calling elevator_init_mq directly from dm-rq. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Link: https://lore.kernel.org/r/20210602065345.355274-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
\| * \| \|	block: move bd_mutex to struct gendisk	Christoph Hellwig	2021-06-01	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Replace the per-block device bd_mutex with a per-gendisk open_mutex, thus simplifying locking wherever we deal with partitions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Link: https://lore.kernel.org/r/20210525061301.2242282-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
\| * \| \|	md: convert to blk_alloc_disk/blk_cleanup_disk	Christoph Hellwig	2021-06-01	1	-16/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Convert the md driver to use the blk_alloc_disk and blk_cleanup_disk helpers to simplify gendisk and request_queue allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20210521055116.1053587-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
\| * \| \|	dm: convert to blk_alloc_disk/blk_cleanup_disk	Christoph Hellwig	2021-06-01	1	-9/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Convert the dm driver to use the blk_alloc_disk and blk_cleanup_disk helpers to simplify gendisk and request_queue allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20210521055116.1053587-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>