summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds2019-07-261-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | Pull vfs umount_tree() leak fix from Al Viro: "Fix braino introduced in 'switch the remnants of releasing the mountpoint away from fs_pin'. The most visible result is leaking struct mount when mounting btrfs, making it impossible to shut down" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix the struct mount leak in umount_tree()
| * fix the struct mount leak in umount_tree()Al Viro2019-07-261-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to drop everything we remove from the tree, whether mnt_has_parent() is true or not. Usually the bug manifests as a slow memory leak (leaked struct mount for initramfs); it becomes much more visible in mount_subtree() users, such as btrfs. There we leak a struct mount for btrfs superblock being mounted, which prevents fs shutdown on subsequent umount. Fixes: 56cbb429d911 ("switch the remnants of releasing the mountpoint away from fs_pin") Reported-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge tag 'for-linus-20190726' of git://git.kernel.dk/linux-blockLinus Torvalds2019-07-2620-92/+224
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: - Several io_uring fixes/improvements: - Blocking fix for O_DIRECT (me) - Latter page slowness for registered buffers (me) - Fix poll hang under certain conditions (me) - Defer sequence check fix for wrapped rings (Zhengyuan) - Mismatch in async inc/dec accounting (Zhengyuan) - Memory ordering issue that could cause stall (Zhengyuan) - Track sequential defer in bytes, not pages (Zhengyuan) - NVMe pull request from Christoph - Set of hang fixes for wbt (Josef) - Redundant error message kill for libahci (Ding) - Remove unused blk_mq_sched_started_request() and related ops (Marcos) - drbd dynamic alloc shash descriptor to reduce stack use (Arnd) - blkcg ->pd_stat() non-debug print (Tejun) - bcache memory leak fix (Wei) - Comment fix (Akinobu) - BFQ perf regression fix (Paolo) * tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits) io_uring: ensure ->list is initialized for poll commands Revert "nvme-pci: don't create a read hctx mapping without read queues" nvme: fix multipath crash when ANA is deactivated nvme: fix memory leak caused by incorrect subsystem free nvme: ignore subnqn for ADATA SX6000LNP drbd: dynamically allocate shash descriptor block: blk-mq: Remove blk_mq_sched_started_request and started_request bcache: fix possible memory leak in bch_cached_dev_run() io_uring: track io length in async_list based on bytes io_uring: don't use iov_iter_advance() for fixed buffers block: properly handle IOCB_NOWAIT for async O_DIRECT IO blk-mq: allow REQ_NOWAIT to return an error inline io_uring: add a memory barrier before atomic_read rq-qos: use a mb for got_token rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule rq-qos: don't reset has_sleepers on spurious wakeups rq-qos: fix missed wake-ups in rq_qos_throttle wait: add wq_has_single_sleeper helper block, bfq: check also in-flight I/O in dispatch plugging block: fix sysfs module parameters directory path in comment ...
| * \ Merge branch 'nvme-5.3' of git://git.infradead.org/nvme into for-linusJens Axboe2019-07-254-17/+15
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NVMe fixes from Christoph. * 'nvme-5.3' of git://git.infradead.org/nvme: Revert "nvme-pci: don't create a read hctx mapping without read queues" nvme: fix multipath crash when ANA is deactivated nvme: fix memory leak caused by incorrect subsystem free nvme: ignore subnqn for ADATA SX6000LNP
| | * | Revert "nvme-pci: don't create a read hctx mapping without read queues"yangerkun2019-07-231-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 0298d5435276e7795b0b939d74827f6e775e7009. With this patch, set 'poll_queues > hard queues' will lead to 'nr_read_queues = 0' in nvme_calc_irq_sets. Then poll_queues setting can fail since dev->tagset.nr_maps equals to 2 and nvme_pci_map_queues will not do map for poll queues. Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme: fix multipath crash when ANA is deactivatedMarta Rybczynska2019-07-232-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a crash with multipath activated. It happends when ANA log page is larger than MDTS and because of that ANA is disabled. The driver then tries to access unallocated buffer when connecting to a nvme target. The signature is as follows: [ 300.433586] nvme nvme0: ANA log page size (8208) larger than MDTS (8192). [ 300.435387] nvme nvme0: disabling ANA support. [ 300.437835] nvme nvme0: creating 4 I/O queues. [ 300.459132] nvme nvme0: new ctrl: NQN "nqn.0.0.0", addr 10.91.0.1:8009 [ 300.464609] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 300.466342] #PF error: [normal kernel read fault] [ 300.467385] PGD 0 P4D 0 [ 300.467987] Oops: 0000 [#1] SMP PTI [ 300.468787] CPU: 3 PID: 50 Comm: kworker/u8:1 Not tainted 5.0.20kalray+ #4 [ 300.470264] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 300.471532] Workqueue: nvme-wq nvme_scan_work [nvme_core] [ 300.472724] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core] [ 300.474038] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 08 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48 [ 300.477374] RSP: 0018:ffffa50e80fd7cb8 EFLAGS: 00010296 [ 300.478334] RAX: 0000000000000001 RBX: ffff9130f1872258 RCX: 0000000000000000 [ 300.479784] RDX: ffffffffc06c4c30 RSI: ffff9130edad4280 RDI: ffff9130f1872258 [ 300.481488] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000044 [ 300.483203] R10: 0000000000000220 R11: 0000000000000040 R12: ffff9130f18722c0 [ 300.484928] R13: ffff9130f18722d0 R14: ffff9130edad4280 R15: ffff9130f18722c0 [ 300.486626] FS: 0000000000000000(0000) GS:ffff9130f7b80000(0000) knlGS:0000000000000000 [ 300.488538] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 300.489907] CR2: 0000000000000008 CR3: 00000002365e6000 CR4: 00000000000006e0 [ 300.491612] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 300.493303] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 300.494991] Call Trace: [ 300.495645] nvme_mpath_add_disk+0x5c/0xb0 [nvme_core] [ 300.496880] nvme_validate_ns+0x2ef/0x550 [nvme_core] [ 300.498105] ? nvme_identify_ctrl.isra.45+0x6a/0xb0 [nvme_core] [ 300.499539] nvme_scan_work+0x2b4/0x370 [nvme_core] [ 300.500717] ? __switch_to_asm+0x35/0x70 [ 300.501663] process_one_work+0x171/0x380 [ 300.502340] worker_thread+0x49/0x3f0 [ 300.503079] kthread+0xf8/0x130 [ 300.503795] ? max_active_store+0x80/0x80 [ 300.504690] ? kthread_bind+0x10/0x10 [ 300.505502] ret_from_fork+0x35/0x40 [ 300.506280] Modules linked in: nvme_tcp nvme_rdma rdma_cm iw_cm ib_cm ib_core nvme_fabrics nvme_core xt_physdev ip6table_raw ip6table_mangle ip6table_filter ip6_tables xt_comment iptable_nat nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_CHECKSUM iptable_mangle iptable_filter veth ebtable_filter ebtable_nat ebtables iptable_raw vxlan ip6_udp_tunnel udp_tunnel sunrpc joydev pcspkr virtio_balloon br_netfilter bridge stp llc ip_tables xfs libcrc32c ata_generic pata_acpi virtio_net virtio_console net_failover virtio_blk failover ata_piix serio_raw libata virtio_pci virtio_ring virtio [ 300.514984] CR2: 0000000000000008 [ 300.515569] ---[ end trace faa2eefad7e7f218 ]--- [ 300.516354] RIP: 0010:nvme_parse_ana_log+0x21/0x140 [nvme_core] [ 300.517330] Code: 45 01 d2 d8 48 98 c3 66 90 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 08 48 8b af 20 0a 00 00 48 89 34 24 <66> 83 7d 08 00 0f 84 c6 00 00 00 44 8b 7d 14 49 89 d5 8b 55 10 48 [ 300.520353] RSP: 0018:ffffa50e80fd7cb8 EFLAGS: 00010296 [ 300.521229] RAX: 0000000000000001 RBX: ffff9130f1872258 RCX: 0000000000000000 [ 300.522399] RDX: ffffffffc06c4c30 RSI: ffff9130edad4280 RDI: ffff9130f1872258 [ 300.523560] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000044 [ 300.524734] R10: 0000000000000220 R11: 0000000000000040 R12: ffff9130f18722c0 [ 300.525915] R13: ffff9130f18722d0 R14: ffff9130edad4280 R15: ffff9130f18722c0 [ 300.527084] FS: 0000000000000000(0000) GS:ffff9130f7b80000(0000) knlGS:0000000000000000 [ 300.528396] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 300.529440] CR2: 0000000000000008 CR3: 00000002365e6000 CR4: 00000000000006e0 [ 300.530739] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 300.531989] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 300.533264] Kernel panic - not syncing: Fatal exception [ 300.534338] Kernel Offset: 0x17c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 300.536227] ---[ end Kernel panic - not syncing: Fatal exception ]--- Condition check refactoring from Christoph Hellwig. Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu> Tested-by: Jean-Baptiste Riaux <jbriaux@kalray.eu> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme: fix memory leak caused by incorrect subsystem freeLogan Gunthorpe2019-07-231-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When freeing the subsystem after finding another match with __nvme_find_get_subsystem(), use put_device() instead of __nvme_release_subsystem() which calls kfree() directly. Per the documentation, put_device() should always be used after device_initialization() is called. Otherwise, leaks like the one below which was detected by kmemleak may occur. Once the call of __nvme_release_subsystem() is removed it no longer makes sense to keep the helper, so fold it back into nvme_release_subsystem(). unreferenced object 0xffff8883d12bfbc0 (size 16): comm "nvme", pid 2635, jiffies 4294933602 (age 739.952s) hex dump (first 16 bytes): 6e 76 6d 65 2d 73 75 62 73 79 73 32 00 88 ff ff nvme-subsys2.... backtrace: [<000000007d8fc208>] __kmalloc_track_caller+0x16d/0x2a0 [<0000000081169e5f>] kvasprintf+0xad/0x130 [<0000000025626f25>] kvasprintf_const+0x47/0x120 [<00000000fa66ad36>] kobject_set_name_vargs+0x44/0x120 [<000000004881f8b3>] dev_set_name+0x98/0xc0 [<000000007124dae3>] nvme_init_identify+0x1995/0x38e0 [<000000009315020a>] nvme_loop_configure_admin_queue+0x4fa/0x5e0 [<000000001a63e766>] nvme_loop_create_ctrl+0x489/0xf80 [<00000000a46ecc23>] nvmf_dev_write+0x1a12/0x2220 [<000000002259b3d5>] __vfs_write+0x66/0x120 [<000000002f6df81e>] vfs_write+0x154/0x490 [<000000007e8cfc19>] ksys_write+0x10a/0x240 [<00000000ff5c7b85>] __x64_sys_write+0x73/0xb0 [<00000000fee6d692>] do_syscall_64+0xaa/0x470 [<00000000997e1ede>] entry_SYSCALL_64_after_hwframe+0x49/0xbe Fixes: ab9e00cc72fa ("nvme: track subsystems") Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
| | * | nvme: ignore subnqn for ADATA SX6000LNPMisha Nasledov2019-07-231-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ADATA SX6000LNP NVMe SSDs have the same subnqn and, due to this, a system with more than one of these SSDs will only have one usable. [ 0.942706] nvme nvme1: ignoring ctrl due to duplicate subnqn (nqn.2018-05.com.example:nvme:nvm-subsystem-OUI00E04C). [ 0.943017] nvme nvme1: Removing after probe failure status: -22 02:00.0 Non-Volatile memory controller [0108]: Realtek Semiconductor Co., Ltd. Device [10ec:5762] (rev 01) 71:00.0 Non-Volatile memory controller [0108]: Realtek Semiconductor Co., Ltd. Device [10ec:5762] (rev 01) There are no firmware updates available from the vendor, unfortunately. Applying the NVME_QUIRK_IGNORE_DEV_SUBNQN quirk for these SSDs resolves the issue, and they all work after this patch: /dev/nvme0n1 2J1120050420 ADATA SX6000LNP [...] /dev/nvme1n1 2J1120050540 ADATA SX6000LNP [...] Signed-off-by: Misha Nasledov <misha@nasledov.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
| * | | io_uring: ensure ->list is initialized for poll commandsJens Axboe2019-07-251-0/+2
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel reports that when testing an http server that uses io_uring to poll for incoming connections, sometimes it hard crashes. This is due to an uninitialized list member for the io_uring request. Normally this doesn't trigger and none of the test cases caught it. Reported-by: Daniel Kozak <kozzi11@gmail.com> Tested-by: Daniel Kozak <kozzi11@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | drbd: dynamically allocate shash descriptorArnd Bergmann2019-07-231-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Building with clang and KASAN, we get a warning about an overly large stack frame on 32-bit architectures: drivers/block/drbd/drbd_receiver.c:921:31: error: stack frame size of 1280 bytes in function 'conn_connect' [-Werror,-Wframe-larger-than=] We already allocate other data dynamically in this function, so just do the same for the shash descriptor, which makes up most of this memory. Link: https://lore.kernel.org/lkml/20190617132440.2721536-1-arnd@arndb.de/ Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Roland Kammerer <roland.kammerer@linbit.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: blk-mq: Remove blk_mq_sched_started_request and started_requestMarcos Paulo de Souza2019-07-233-12/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mq_sched_completed_request is a function that checks if the elevator related to the request has started_request implemented, but currently, none of the available IO schedulers implement started_request, so remove both. Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | bcache: fix possible memory leak in bch_cached_dev_run()Wei Yongjun2019-07-221-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | memory malloced in bch_cached_dev_run() and should be freed before leaving from the error handling cases, otherwise it will cause memory leak. Fixes: 0b13efecf5f2 ("bcache: add return value check to bch_cached_dev_run()") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: track io length in async_list based on bytesZhengyuan Liu2019-07-211-13/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We are using PAGE_SIZE as the unit to determine if the total len in async_list has exceeded max_pages, it's not fair for smaller io sizes. For example, if we are doing 1k-size io streams, we will never exceed max_pages since len >>= PAGE_SHIFT always gets zero. So use original bytes to make it more accurate. Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: don't use iov_iter_advance() for fixed buffersJens Axboe2019-07-211-2/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hrvoje reports that when a large fixed buffer is registered and IO is being done to the latter pages of said buffer, the IO submission time is much worse: reading to the start of the buffer: 11238 ns reading to the end of the buffer: 1039879 ns In fact, it's worse by two orders of magnitude. The reason for that is how io_uring figures out how to setup the iov_iter. We point the iter at the first bvec, and then use iov_iter_advance() to fast-forward to the offset within that buffer we need. However, that is abysmally slow, as it entails iterating the bvecs that we setup as part of buffer registration. There's really no need to use this generic helper, as we know it's a BVEC type iterator, and we also know that each bvec is PAGE_SIZE in size, apart from possibly the first and last. Hence we can just use a shift on the offset to find the right index, and then adjust the iov_iter appropriately. After this fix, the timings are: reading to the start of the buffer: 10135 ns reading to the end of the buffer: 1377 ns Or about an 755x improvement for the tail page. Reported-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Tested-by: Hrvoje Zeba <zeba.hrvoje@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: properly handle IOCB_NOWAIT for async O_DIRECT IOJens Axboe2019-07-211-8/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A caller is supposed to pass in REQ_NOWAIT if we can't block for any given operation, but O_DIRECT for block devices just ignore this. Hence we'll block for various resource shortages on the block layer side, like having to wait for requests. Use the new REQ_NOWAIT_INLINE to ask for this error to be returned inline, so we can handle it appropriately and return -EAGAIN to the caller. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | blk-mq: allow REQ_NOWAIT to return an error inlineJens Axboe2019-07-212-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By default, if a caller sets REQ_NOWAIT and we need to block, we'll return -EAGAIN through the bio->bi_end_io() callback. For some use cases, this makes it hard to use. Allow a caller to ask for inline return of errors related to blocking by also setting REQ_NOWAIT_INLINE. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: add a memory barrier before atomic_readZhengyuan Liu2019-07-181-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a hang issue while using fio to do some basic test. The issue can be easily reproduced using the below script: while true do fio --ioengine=io_uring -rw=write -bs=4k -numjobs=1 \ -size=1G -iodepth=64 -name=uring --filename=/dev/zero done After several minutes (or more), fio would block at io_uring_enter->io_cqring_wait in order to waiting for previously committed sqes to be completed and can't return to user anymore until we send a SIGTERM to fio. After receiving SIGTERM, fio hangs at io_ring_ctx_wait_and_kill with a backtrace like this: [54133.243816] Call Trace: [54133.243842] __schedule+0x3a0/0x790 [54133.243868] schedule+0x38/0xa0 [54133.243880] schedule_timeout+0x218/0x3b0 [54133.243891] ? sched_clock+0x9/0x10 [54133.243903] ? wait_for_completion+0xa3/0x130 [54133.243916] ? _raw_spin_unlock_irq+0x2c/0x40 [54133.243930] ? trace_hardirqs_on+0x3f/0xe0 [54133.243951] wait_for_completion+0xab/0x130 [54133.243962] ? wake_up_q+0x70/0x70 [54133.243984] io_ring_ctx_wait_and_kill+0xa0/0x1d0 [54133.243998] io_uring_release+0x20/0x30 [54133.244008] __fput+0xcf/0x270 [54133.244029] ____fput+0xe/0x10 [54133.244040] task_work_run+0x7f/0xa0 [54133.244056] do_exit+0x305/0xc40 [54133.244067] ? get_signal+0x13b/0xbd0 [54133.244088] do_group_exit+0x50/0xd0 [54133.244103] get_signal+0x18d/0xbd0 [54133.244112] ? _raw_spin_unlock_irqrestore+0x36/0x60 [54133.244142] do_signal+0x34/0x720 [54133.244171] ? exit_to_usermode_loop+0x7e/0x130 [54133.244190] exit_to_usermode_loop+0xc0/0x130 [54133.244209] do_syscall_64+0x16b/0x1d0 [54133.244221] entry_SYSCALL_64_after_hwframe+0x49/0xbe The reason is that we had added a req to ctx->pending_async at the very end, but it didn't get a chance to be processed. How could this happen? fio#cpu0 wq#cpu1 io_add_to_prev_work io_sq_wq_submit_work atomic_read() <<< 1 atomic_dec_return() << 1->0 list_empty(); <<< true; list_add_tail() atomic_read() << 0 or 1? As atomic_ops.rst states, atomic_read does not guarantee that the runtime modification by any other thread is visible yet, so we must take care of that with a proper implicit or explicit memory barrier. This issue was detected with the help of Jackie's <liuyun01@kylinos.cn> Fixes: 31b515106428 ("io_uring: allow workqueue item to handle multiple buffered requests") Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | rq-qos: use a mb for got_tokenJosef Bacik2019-07-181-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Oleg noticed that our checking of data.got_token is unsafe in the cleanup case, and should really use a memory barrier. Use a wmb on the write side, and a rmb() on the read side. We don't need one in the main loop since we're saved by set_current_state(). Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | rq-qos: set ourself TASK_UNINTERRUPTIBLE after we scheduleJosef Bacik2019-07-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | In case we get a spurious wakeup we need to make sure to re-set ourselves to TASK_UNINTERRUPTIBLE so we don't busy wait. Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | rq-qos: don't reset has_sleepers on spurious wakeupsJosef Bacik2019-07-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we raced with somebody else getting an inflight counter we could fail to get an inflight counter with no sleepers on the list, and thus need to go to sleep. In this case has_sleepers should be true because we are now relying on the waker to get our inflight counter for us. And in the case of spurious wakeups we'd still want this to be the case. So set has_sleepers to true if we went to sleep to make sure we're woken up the proper way. Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | rq-qos: fix missed wake-ups in rq_qos_throttleJosef Bacik2019-07-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We saw a hang in production with WBT where there was only one waiter in the throttle path and no outstanding IO. This is because of the has_sleepers optimization that is used to make sure we don't steal an inflight counter for new submitters when there are people already on the list. We can race with our check to see if the waitqueue has any waiters (this is done locklessly) and the time we actually add ourselves to the waitqueue. If this happens we'll go to sleep and never be woken up because nobody is doing IO to wake us up. Fix this by checking if the waitqueue has a single sleeper on the list after we add ourselves, that way we have an uptodate view of the list. Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | wait: add wq_has_single_sleeper helperJosef Bacik2019-07-181-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rq-qos sits in the io path so we want to take locks as sparingly as possible. To accomplish this we try not to take the waitqueue head lock unless we are sure we need to go to sleep, and we have an optimization to make sure that we don't starve out existing waiters. Since we check if there are existing waiters locklessly we need to be able to update our view of the waitqueue list after we've added ourselves to the waitqueue. Accomplish this by adding this helper to see if there is more than just ourselves on the list. Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block, bfq: check also in-flight I/O in dispatch pluggingPaolo Valente2019-07-181-24/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Consider a sync bfq_queue Q that remains empty while in service, and suppose that, when this happens, there is a fair amount of already in-flight I/O not belonging to Q. In such a situation, I/O dispatching may need to be plugged (until new I/O arrives for Q), for the following reason. The drive may decide to serve in-flight non-Q's I/O requests before Q's ones, thereby delaying the arrival of new I/O requests for Q (recall that Q is sync). If I/O-dispatching is not plugged, then, while Q remains empty, a basically uncontrolled amount of I/O from other queues may be dispatched too, possibly causing the service of Q's I/O to be delayed even longer in the drive. This problem gets more and more serious as the speed and the queue depth of the drive grow, because, as these two quantities grow, the probability to find no queue busy but many requests in flight grows too. If Q has the same weight and priority as the other queues, then the above delay is unlikely to cause any issue, because all queues tend to undergo the same treatment. So, since not plugging I/O dispatching is convenient for throughput, it is better not to plug. Things change in case Q has a higher weight or priority than some other queue, because Q's service guarantees may simply be violated. For this reason, commit 1de0c4cd9ea6 ("block, bfq: reduce idling only in symmetric scenarios") does plug I/O in such an asymmetric scenario. Plugging minimizes the delay induced by already in-flight I/O, and enables Q to recover the bandwidth it may lose because of this delay. Yet the above commit does not cover the case of weight-raised queues, for efficiency concerns. For weight-raised queues, I/O-dispatch plugging is activated simply if not all bfq_queues are weight-raised. But this check does not handle the case of in-flight requests, because a bfq_queue may become non busy *before* all its in-flight requests are completed. This commit performs I/O-dispatch plugging for weight-raised queues if there are some in-flight requests. As a practical example of the resulting recover of control, under write load on a Samsung SSD 970 PRO, gnome-terminal starts in 1.5 seconds after this fix, against 15 seconds before the fix (as a reference, gnome-terminal takes about 35 seconds to start with any of the other I/O schedulers). Fixes: 1de0c4cd9ea6 ("block, bfq: reduce idling only in symmetric scenarios") Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: fix sysfs module parameters directory path in commentAkinobu Mita2019-07-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | The runtime configurable module parameter files are located under /sys/module/MODULENAME/parameters, not /sys/module/MODULENAME. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | blkcg: allow blkcg_policy->pd_stat() to print non-debug info tooTejun Heo2019-07-163-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, ->pd_stat() is called only when moduleparam blkcg_debug_stats is set which prevents it from printing non-debug policy-specific statistics. Let's move debug testing down so that ->pd_stat() can print non-debug stat too. This patch doesn't cause any visible behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: fix counter inc/dec mismatch in async_listZhengyuan Liu2019-07-161-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We could queue a work for each req in defer and link list without increasing async_list->cnt, so we shouldn't decrease it while exiting from workqueue as well if we didn't process the req in async list. Thanks to Jens Axboe <axboe@kernel.dk> for his guidance. Fixes: 31b515106428 ("io_uring: allow workqueue item to handle multiple buffered requests") Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | ata: libahci_platform: remove redundant dev_err messageDing Xiang2019-07-161-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | devm_ioremap_resource already contains error message, so remove the redundant dev_err message Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | io_uring: fix the sequence comparison in io_sequence_deferZhengyuan Liu2019-07-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sq->cached_sq_head and cq->cached_cq_tail are both unsigned int. If cached_sq_head overflows before cached_cq_tail, then we may miss a barrier req. As cached_cq_tail always follows cached_sq_head, the NQ should be enough. Cc: stable@vger.kernel.org Fixes: de0617e46717 ("io_uring: add support for marking commands as draining") Signed-off-by: Zhengyuan Liu <liuzhengyuan@kylinos.cn> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | Merge tag 'sound-5.3-rc2' of ↵Linus Torvalds2019-07-269-34/+65
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "All relatively small changes: - a regression fix for PCM link code with CONFIG_REFCOUNT_FULL; stumbled on a slight difference between atomic_t and refcount_t - a couple of HD-audio stabilization patches addressing the too slow PM resume seen on some Intel chips - a series of ALSA compress-offload API fixes, including the regression by the previous capture stream support - trivial LINE6 USB-audio driver fixes, a new Conexant HD-audio chip coverage, and a fix in AC97 bus error path" * tag 'sound-5.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ALSA: hda - Add a conexant codec entry to let mute led work ALSA: hda - Fix intermittent CORB/RIRB stall on Intel chips ALSA: ac97: Fix double free of ac97_codec_device ALSA: compress: Be more restrictive about when a drain is allowed ALSA: compress: Don't allow paritial drain operations on capture streams ALSA: compress: Prevent bypasses of set_params ALSA: compress: Fix regression on compressed capture streams ALSA: line6: Fix a typo ALSA: pcm: Fix refcount_inc() on zero usage ALSA: line6: Fix wrong altsetting for LINE6_PODHD500_1 ALSA: hda - Optimize resume for codecs without jack detection
| * | | ALSA: hda - Add a conexant codec entry to let mute led workHui Wang2019-07-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This conexant codec isn't in the supported codec list yet, the hda generic driver can drive this codec well, but on a Lenovo machine with mute/mic-mute leds, we need to apply CXT_FIXUP_THINKPAD_ACPI to make the leds work. After adding this codec to the list, the driver patch_conexant.c will apply THINKPAD_ACPI to this machine. Cc: stable@vger.kernel.org Signed-off-by: Hui Wang <hui.wang@canonical.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
| * | | ALSA: hda - Fix intermittent CORB/RIRB stall on Intel chipsTakashi Iwai2019-07-251-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It turned out that the recent Intel HD-audio controller chips show a significant stall during the system PM resume intermittently. It doesn't happen so often and usually it may read back successfully after one or more seconds, but in some rare worst cases the driver went into fallback mode. After trial-and-error, we found out that the communication stall seems covered by issuing the sync after each verb write, as already done for AMD and other chipsets. So this patch enables the write-sync flag for the recent Intel chips, Skylake and onward, as a workaround. Also, since Broxton and co have the very same driver flags as Skylake, refer to the Skylake driver flags instead of defining the same contents again for simplification. BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=201901 Reported-and-tested-by: Todd Brandt <todd.e.brandt@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
| * | | ALSA: ac97: Fix double free of ac97_codec_deviceDing Xiang2019-07-231-9/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | put_device will call ac97_codec_release to free ac97_codec_device and other resources, so remove the kfree and other redundant code. Fixes: 74426fbff66e ("ALSA: ac97: add an ac97 bus") Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
| * | | ALSA: compress: Be more restrictive about when a drain is allowedCharles Keepax2019-07-231-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Draining makes little sense in the situation of hardware overrun, as the hardware will have consumed all its available samples. Additionally, draining whilst the stream is paused would presumably get stuck as no data is being consumed on the DSP side. Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Acked-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
| * | | ALSA: compress: Don't allow paritial drain operations on capture streamsCharles Keepax2019-07-231-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Partial drain and next track are intended for gapless playback and don't really have an obvious interpretation for a capture stream, so makes sense to not allow those operations on capture streams. Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Acked-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
| * | | ALSA: compress: Prevent bypasses of set_paramsCharles Keepax2019-07-231-6/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, whilst in SNDRV_PCM_STATE_OPEN it is possible to call snd_compr_stop, snd_compr_drain and snd_compr_partial_drain, which allow a transition to SNDRV_PCM_STATE_SETUP. The stream should only be able to move to the setup state once it has received a SNDRV_COMPRESS_SET_PARAMS ioctl. Fix this issue by not allowing those ioctls whilst in the open state. Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Acked-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
| * | | ALSA: compress: Fix regression on compressed capture streamsCharles Keepax2019-07-232-9/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A previous fix to the stop handling on compressed capture streams causes some knock on issues. The previous fix updated snd_compr_drain_notify to set the state back to PREPARED for capture streams. This causes some issues however as the handling for snd_compr_poll differs between the two states and some user-space applications were relying on the poll failing after the stream had been stopped. To correct this regression whilst still fixing the original problem the patch was addressing, update the capture handling to skip the PREPARED state rather than skipping the SETUP state as it has done until now. Fixes: 4f2ab5e1d13d ("ALSA: compress: Fix stop handling on compressed capture streams") Signed-off-by: Charles Keepax <ckeepax@opensource.cirrus.com> Acked-by: Vinod Koul <vkoul@kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
| * | | ALSA: line6: Fix a typoChristophe JAILLET2019-07-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | s/Vairax/Variax/ Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Takashi Iwai <tiwai@suse.de>
| * | | ALSA: pcm: Fix refcount_inc() on zero usageTakashi Iwai2019-07-191-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The recent rewrite of PCM link lock management introduced the refcount in snd_pcm_group object, managed by the kernel refcount_t API. This caused unexpected kernel warnings when the kernel is built with CONFIG_REFCOUNT_FULL=y. As the warning line indicates, the problem is obviously that we start with refcount=0 and do refcount_inc() for adding each PCM link, while refcount_t API doesn't like refcount_inc() performed on zero. For adapting the proper refcount_t usage, this patch changes the logic slightly: - The initial refcount is 1, assuming the single list entry - The refcount is incremented / decremented at each PCM link addition and deletion - ... which allows us concentrating only on the refcount as a release condition Fixes: f57f3df03a8e ("ALSA: pcm: More fine-grained PCM link locking") BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=204221 Reported-and-tested-by: Duncan Overbruck <kernel@duncano.de> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
| * | | ALSA: line6: Fix wrong altsetting for LINE6_PODHD500_1Kai-Heng Feng2019-07-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 7b9584fa1c0b ("staging: line6: Move altsetting to properties") set a wrong altsetting for LINE6_PODHD500_1 during refactoring. Set the correct altsetting number to fix the issue. BugLink: https://bugs.launchpad.net/bugs/1790595 Fixes: 7b9584fa1c0b ("staging: line6: Move altsetting to properties") Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de>
| * | | ALSA: hda - Optimize resume for codecs without jack detectionTakashi Iwai2019-07-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The codecs without jack detection also don't have to be resumed forcibly because, obviously, they have no jack. Skip the forced resume in such a case as optimization as well. Reviewed-by: Kai Vehmanen <kai.vehmanen@linux.intel.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
* | | | Merge tag 'iommu-fixes-v5.3-rc1' of ↵Linus Torvalds2019-07-266-101/+220
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull IOMMU fixes from Joerg Roedel: - revert an Intel VT-d patch that caused boot problems on some machines - fix AMD IOMMU interrupts with x2apic enabled - fix a potential crash when Intel VT-d domain allocation fails - fix crash in Intel VT-d driver when accessing a domain without a flush queue - formatting fix for new Intel VT-d debugfs code - fix for use-after-free bug in IOVA code - fix for a NULL-pointer dereference in Intel VT-d driver when PCI hotplug is used - compilation fix for one of the previous fixes * tag 'iommu-fixes-v5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu/amd: Add support for X2APIC IOMMU interrupts iommu/iova: Fix compilation error with !CONFIG_IOMMU_IOVA iommu/vt-d: Print pasid table entries MSB to LSB in debugfs iommu/iova: Remove stale cached32_node iommu/vt-d: Check if domain->pgd was allocated iommu/vt-d: Don't queue_iova() if there is no flush queue iommu/vt-d: Avoid duplicated pci dma alias consideration Revert "iommu/vt-d: Consolidate domain_init() to avoid duplication"
| * | | | iommu/amd: Add support for X2APIC IOMMU interruptsSuthikulpanit, Suravee2019-07-232-0/+99
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | AMD IOMMU requires IntCapXT registers to be setup in order to generate its own interrupts (for Event Log, PPR Log, and GA Log) with 32-bit APIC destination ID. Without this support, AMD IOMMU MSI interrupts will not be routed correctly when booting the system in X2APIC mode. Cc: Joerg Roedel <joro@8bytes.org> Fixes: 90fcffd9cf5e ('iommu/amd: Add support for IOMMU XT mode') Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
| * | | | iommu/iova: Fix compilation error with !CONFIG_IOMMU_IOVAJoerg Roedel2019-07-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The stub function for !CONFIG_IOMMU_IOVA needs to be 'static inline'. Fixes: effa467870c76 ('iommu/vt-d: Don't queue_iova() if there is no flush queue') Signed-off-by: Joerg Roedel <jroedel@suse.de>
| * | | | iommu/vt-d: Print pasid table entries MSB to LSB in debugfsSai Praneeth Prakhya2019-07-221-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit dd5142ca5d24 ("iommu/vt-d: Add debugfs support to show scalable mode DMAR table internals") prints content of pasid table entries from LSB to MSB where as other entries are printed MSB to LSB. So, to maintain uniformity among all entries and to not confuse the user, print MSB first. Cc: Joerg Roedel <joro@8bytes.org> Cc: Lu Baolu <baolu.lu@linux.intel.com> Cc: Sohil Mehta <sohil.mehta@intel.com> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> Fixes: dd5142ca5d24 ("iommu/vt-d: Add debugfs support to show scalable mode DMAR table internals") Signed-off-by: Joerg Roedel <jroedel@suse.de>
| * | | | iommu/iova: Remove stale cached32_nodeChris Wilson2019-07-221-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the cached32_node is allowed to be advanced above dma_32bit_pfn (to provide a shortcut into the limited range), we need to be careful to remove the to be freed node if it is the cached32_node. [ 48.477773] BUG: KASAN: use-after-free in __cached_rbnode_delete_update+0x68/0x110 [ 48.477812] Read of size 8 at addr ffff88870fc19020 by task kworker/u8:1/37 [ 48.477843] [ 48.477879] CPU: 1 PID: 37 Comm: kworker/u8:1 Tainted: G U 5.2.0+ #735 [ 48.477915] Hardware name: Intel Corporation NUC7i5BNK/NUC7i5BNB, BIOS BNKBL357.86A.0052.2017.0918.1346 09/18/2017 [ 48.478047] Workqueue: i915 __i915_gem_free_work [i915] [ 48.478075] Call Trace: [ 48.478111] dump_stack+0x5b/0x90 [ 48.478137] print_address_description+0x67/0x237 [ 48.478178] ? __cached_rbnode_delete_update+0x68/0x110 [ 48.478212] __kasan_report.cold.3+0x1c/0x38 [ 48.478240] ? __cached_rbnode_delete_update+0x68/0x110 [ 48.478280] ? __cached_rbnode_delete_update+0x68/0x110 [ 48.478308] __cached_rbnode_delete_update+0x68/0x110 [ 48.478344] private_free_iova+0x2b/0x60 [ 48.478378] iova_magazine_free_pfns+0x46/0xa0 [ 48.478403] free_iova_fast+0x277/0x340 [ 48.478443] fq_ring_free+0x15a/0x1a0 [ 48.478473] queue_iova+0x19c/0x1f0 [ 48.478597] cleanup_page_dma.isra.64+0x62/0xb0 [i915] [ 48.478712] __gen8_ppgtt_cleanup+0x63/0x80 [i915] [ 48.478826] __gen8_ppgtt_cleanup+0x42/0x80 [i915] [ 48.478940] __gen8_ppgtt_clear+0x433/0x4b0 [i915] [ 48.479053] __gen8_ppgtt_clear+0x462/0x4b0 [i915] [ 48.479081] ? __sg_free_table+0x9e/0xf0 [ 48.479116] ? kfree+0x7f/0x150 [ 48.479234] i915_vma_unbind+0x1e2/0x240 [i915] [ 48.479352] i915_vma_destroy+0x3a/0x280 [i915] [ 48.479465] __i915_gem_free_objects+0xf0/0x2d0 [i915] [ 48.479579] __i915_gem_free_work+0x41/0xa0 [i915] [ 48.479607] process_one_work+0x495/0x710 [ 48.479642] worker_thread+0x4c7/0x6f0 [ 48.479687] ? process_one_work+0x710/0x710 [ 48.479724] kthread+0x1b2/0x1d0 [ 48.479774] ? kthread_create_worker_on_cpu+0xa0/0xa0 [ 48.479820] ret_from_fork+0x1f/0x30 [ 48.479864] [ 48.479907] Allocated by task 631: [ 48.479944] save_stack+0x19/0x80 [ 48.479994] __kasan_kmalloc.constprop.6+0xc1/0xd0 [ 48.480038] kmem_cache_alloc+0x91/0xf0 [ 48.480082] alloc_iova+0x2b/0x1e0 [ 48.480125] alloc_iova_fast+0x58/0x376 [ 48.480166] intel_alloc_iova+0x90/0xc0 [ 48.480214] intel_map_sg+0xde/0x1f0 [ 48.480343] i915_gem_gtt_prepare_pages+0xb8/0x170 [i915] [ 48.480465] huge_get_pages+0x232/0x2b0 [i915] [ 48.480590] ____i915_gem_object_get_pages+0x40/0xb0 [i915] [ 48.480712] __i915_gem_object_get_pages+0x90/0xa0 [i915] [ 48.480834] i915_gem_object_prepare_write+0x2d6/0x330 [i915] [ 48.480955] create_test_object.isra.54+0x1a9/0x3e0 [i915] [ 48.481075] igt_shared_ctx_exec+0x365/0x3c0 [i915] [ 48.481210] __i915_subtests.cold.4+0x30/0x92 [i915] [ 48.481341] __run_selftests.cold.3+0xa9/0x119 [i915] [ 48.481466] i915_live_selftests+0x3c/0x70 [i915] [ 48.481583] i915_pci_probe+0xe7/0x220 [i915] [ 48.481620] pci_device_probe+0xe0/0x180 [ 48.481665] really_probe+0x163/0x4e0 [ 48.481710] device_driver_attach+0x85/0x90 [ 48.481750] __driver_attach+0xa5/0x180 [ 48.481796] bus_for_each_dev+0xda/0x130 [ 48.481831] bus_add_driver+0x205/0x2e0 [ 48.481882] driver_register+0xca/0x140 [ 48.481927] do_one_initcall+0x6c/0x1af [ 48.481970] do_init_module+0x106/0x350 [ 48.482010] load_module+0x3d2c/0x3ea0 [ 48.482058] __do_sys_finit_module+0x110/0x180 [ 48.482102] do_syscall_64+0x62/0x1f0 [ 48.482147] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 48.482190] [ 48.482224] Freed by task 37: [ 48.482273] save_stack+0x19/0x80 [ 48.482318] __kasan_slab_free+0x12e/0x180 [ 48.482363] kmem_cache_free+0x70/0x140 [ 48.482406] __free_iova+0x1d/0x30 [ 48.482445] fq_ring_free+0x15a/0x1a0 [ 48.482490] queue_iova+0x19c/0x1f0 [ 48.482624] cleanup_page_dma.isra.64+0x62/0xb0 [i915] [ 48.482749] __gen8_ppgtt_cleanup+0x63/0x80 [i915] [ 48.482873] __gen8_ppgtt_cleanup+0x42/0x80 [i915] [ 48.482999] __gen8_ppgtt_clear+0x433/0x4b0 [i915] [ 48.483123] __gen8_ppgtt_clear+0x462/0x4b0 [i915] [ 48.483250] i915_vma_unbind+0x1e2/0x240 [i915] [ 48.483378] i915_vma_destroy+0x3a/0x280 [i915] [ 48.483500] __i915_gem_free_objects+0xf0/0x2d0 [i915] [ 48.483622] __i915_gem_free_work+0x41/0xa0 [i915] [ 48.483659] process_one_work+0x495/0x710 [ 48.483704] worker_thread+0x4c7/0x6f0 [ 48.483748] kthread+0x1b2/0x1d0 [ 48.483787] ret_from_fork+0x1f/0x30 [ 48.483831] [ 48.483868] The buggy address belongs to the object at ffff88870fc19000 [ 48.483868] which belongs to the cache iommu_iova of size 40 [ 48.483920] The buggy address is located 32 bytes inside of [ 48.483920] 40-byte region [ffff88870fc19000, ffff88870fc19028) [ 48.483964] The buggy address belongs to the page: [ 48.484006] page:ffffea001c3f0600 refcount:1 mapcount:0 mapping:ffff8888181a91c0 index:0x0 compound_mapcount: 0 [ 48.484045] flags: 0x8000000000010200(slab|head) [ 48.484096] raw: 8000000000010200 ffffea001c421a08 ffffea001c447e88 ffff8888181a91c0 [ 48.484141] raw: 0000000000000000 0000000000120012 00000001ffffffff 0000000000000000 [ 48.484188] page dumped because: kasan: bad access detected [ 48.484230] [ 48.484265] Memory state around the buggy address: [ 48.484314] ffff88870fc18f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 48.484361] ffff88870fc18f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 48.484406] >ffff88870fc19000: fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc [ 48.484451] ^ [ 48.484494] ffff88870fc19080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 48.484530] ffff88870fc19100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108602 Fixes: e60aa7b53845 ("iommu/iova: Extend rbtree node caching") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Joerg Roedel <joro@8bytes.org> Cc: <stable@vger.kernel.org> # v4.15+ Reviewed-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
| * | | | iommu/vt-d: Check if domain->pgd was allocatedDmitry Safonov2019-07-221-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a couple of places where on domain_init() failure domain_exit() is called. While currently domain_init() can fail only if alloc_pgtable_page() has failed. Make domain_exit() check if domain->pgd present, before calling domain_unmap(), as it theoretically should crash on clearing pte entries in dma_pte_clear_level(). Cc: David Woodhouse <dwmw2@infradead.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Lu Baolu <baolu.lu@linux.intel.com> Cc: iommu@lists.linux-foundation.org Signed-off-by: Dmitry Safonov <dima@arista.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
| * | | | iommu/vt-d: Don't queue_iova() if there is no flush queueDmitry Safonov2019-07-223-5/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Intel VT-d driver was reworked to use common deferred flushing implementation. Previously there was one global per-cpu flush queue, afterwards - one per domain. Before deferring a flush, the queue should be allocated and initialized. Currently only domains with IOMMU_DOMAIN_DMA type initialize their flush queue. It's probably worth to init it for static or unmanaged domains too, but it may be arguable - I'm leaving it to iommu folks. Prevent queuing an iova flush if the domain doesn't have a queue. The defensive check seems to be worth to keep even if queue would be initialized for all kinds of domains. And is easy backportable. On 4.19.43 stable kernel it has a user-visible effect: previously for devices in si domain there were crashes, on sata devices: BUG: spinlock bad magic on CPU#6, swapper/0/1 lock: 0xffff88844f582008, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0 CPU: 6 PID: 1 Comm: swapper/0 Not tainted 4.19.43 #1 Call Trace: <IRQ> dump_stack+0x61/0x7e spin_bug+0x9d/0xa3 do_raw_spin_lock+0x22/0x8e _raw_spin_lock_irqsave+0x32/0x3a queue_iova+0x45/0x115 intel_unmap+0x107/0x113 intel_unmap_sg+0x6b/0x76 __ata_qc_complete+0x7f/0x103 ata_qc_complete+0x9b/0x26a ata_qc_complete_multiple+0xd0/0xe3 ahci_handle_port_interrupt+0x3ee/0x48a ahci_handle_port_intr+0x73/0xa9 ahci_single_level_irq_intr+0x40/0x60 __handle_irq_event_percpu+0x7f/0x19a handle_irq_event_percpu+0x32/0x72 handle_irq_event+0x38/0x56 handle_edge_irq+0x102/0x121 handle_irq+0x147/0x15c do_IRQ+0x66/0xf2 common_interrupt+0xf/0xf RIP: 0010:__do_softirq+0x8c/0x2df The same for usb devices that use ehci-pci: BUG: spinlock bad magic on CPU#0, swapper/0/1 lock: 0xffff88844f402008, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0 CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.43 #4 Call Trace: <IRQ> dump_stack+0x61/0x7e spin_bug+0x9d/0xa3 do_raw_spin_lock+0x22/0x8e _raw_spin_lock_irqsave+0x32/0x3a queue_iova+0x77/0x145 intel_unmap+0x107/0x113 intel_unmap_page+0xe/0x10 usb_hcd_unmap_urb_setup_for_dma+0x53/0x9d usb_hcd_unmap_urb_for_dma+0x17/0x100 unmap_urb_for_dma+0x22/0x24 __usb_hcd_giveback_urb+0x51/0xc3 usb_giveback_urb_bh+0x97/0xde tasklet_action_common.isra.4+0x5f/0xa1 tasklet_action+0x2d/0x30 __do_softirq+0x138/0x2df irq_exit+0x7d/0x8b smp_apic_timer_interrupt+0x10f/0x151 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:_raw_spin_unlock_irqrestore+0x17/0x39 Cc: David Woodhouse <dwmw2@infradead.org> Cc: Joerg Roedel <joro@8bytes.org> Cc: Lu Baolu <baolu.lu@linux.intel.com> Cc: iommu@lists.linux-foundation.org Cc: <stable@vger.kernel.org> # 4.14+ Fixes: 13cf01744608 ("iommu/vt-d: Make use of iova deferred flushing") Signed-off-by: Dmitry Safonov <dima@arista.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
| * | | | iommu/vt-d: Avoid duplicated pci dma alias considerationLu Baolu2019-07-221-53/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As we have abandoned the home-made lazy domain allocation and delegated the DMA domain life cycle up to the default domain mechanism defined in the generic iommu layer, we needn't consider pci alias anymore when mapping/unmapping the context entries. Without this fix, we see kernel NULL pointer dereference during pci device hot-plug test. Cc: Ashok Raj <ashok.raj@intel.com> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Kevin Tian <kevin.tian@intel.com> Fixes: fa954e6831789 ("iommu/vt-d: Delegate the dma domain to upper layer") Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reported-and-tested-by: Xu Pengfei <pengfei.xu@intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de>
| * | | | Revert "iommu/vt-d: Consolidate domain_init() to avoid duplication"Joerg Roedel2019-07-221-36/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 123b2ffc376e1b3e9e015c75175b61e88a8b8518. This commit reportedly caused boot failures on some systems and needs to be reverted for now. Signed-off-by: Joerg Roedel <jroedel@suse.de>
* | | | | Merge branch 'for-linus-5.3' of ↵Linus Torvalds2019-07-262-2/+7
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft Pull iscsi_ibft fix from Konrad Rzeszutek Wilk: "One tiny fix to enable iSCSI IBFT to be compiled under ARM" * 'for-linus-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/ibft: iscsi_ibft: make ISCSI_IBFT depend on ACPI instead of ISCSI_IBFT_FIND