summaryrefslogtreecommitdiffstats
path: root/drivers/vfio/pci
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'vfio-v6.3-rc1' of https://github.com/awilliam/linux-vfioLinus Torvalds2023-02-259-78/+321
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull VFIO updates from Alex Williamson: - Remove redundant resource check in vfio-platform (Angus Chen) - Use GFP_KERNEL_ACCOUNT for persistent userspace allocations, allowing removal of arbitrary kernel limits in favor of cgroup control (Yishai Hadas) - mdev tidy-ups, including removing the module-only build restriction for sample drivers, Kconfig changes to select mdev support, documentation movement to keep sample driver usage instructions with sample drivers rather than with API docs, remove references to out-of-tree drivers in docs (Christoph Hellwig) - Fix collateral breakages from mdev Kconfig changes (Arnd Bergmann) - Make mlx5 migration support match device support, improve source and target flows to improve pre-copy support and reduce downtime (Yishai Hadas) - Convert additional mdev sysfs case to use sysfs_emit() (Bo Liu) - Resolve copy-paste error in mdev mbochs sample driver Kconfig (Ye Xingchen) - Avoid propagating missing reset error in vfio-platform if reset requirement is relaxed by module option (Tomasz Duszynski) - Range size fixes in mlx5 variant driver for missed last byte and stricter range calculation (Yishai Hadas) - Fixes to suspended vaddr support and locked_vm accounting, excluding mdev configurations from the former due to potential to indefinitely block kernel threads, fix underflow and restore locked_vm on new mm (Steve Sistare) - Update outdated vfio documentation due to new IOMMUFD interfaces in recent kernels (Yi Liu) - Resolve deadlock between group_lock and kvm_lock, finally (Matthew Rosato) - Fix NULL pointer in group initialization error path with IOMMUFD (Yan Zhao) * tag 'vfio-v6.3-rc1' of https://github.com/awilliam/linux-vfio: (32 commits) vfio: Fix NULL pointer dereference caused by uninitialized group->iommufd docs: vfio: Update vfio.rst per latest interfaces vfio: Update the kdoc for vfio_device_ops vfio/mlx5: Fix range size calculation upon tracker creation vfio: no need to pass kvm pointer during device open vfio: fix deadlock between group lock and kvm lock vfio: revert "iommu driver notify callback" vfio/type1: revert "implement notify callback" vfio/type1: revert "block on invalid vaddr" vfio/type1: restore locked_vm vfio/type1: track locked_vm per dma vfio/type1: prevent underflow of locked_vm via exec() vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR vfio: platform: ignore missing reset if disabled at module init vfio/mlx5: Improve the target side flow to reduce downtime vfio/mlx5: Improve the source side flow upon pre_copy vfio/mlx5: Check whether VF is migratable samples: fix the prompt about SAMPLE_VFIO_MDEV_MBOCHS vfio/mdev: Use sysfs_emit() to instead of sprintf() vfio-mdev: add back CONFIG_VFIO dependency ...
| * vfio/mlx5: Fix range size calculation upon tracker creationYishai Hadas2023-02-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | Fix range size calculation to include the last byte of each range. In addition, log round up the length of the total ranges to be stricter. Fixes: c1d050b0d169 ("vfio/mlx5: Create and destroy page tracker object") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20230208152234.32370-1-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Improve the target side flow to reduce downtimeYishai Hadas2023-01-302-12/+105
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Improve the target side flow to reduce downtime as of below. - Support reading an optional record which includes the expected stop_copy size. - Once the source sends this record data, which expects to be sent as part of the pre_copy flow, prepare the data buffers that may be large enough to hold the final stop_copy data. The above reduces the migration downtime as the relevant stuff that is needed to load the image data is prepared ahead as part of pre_copy. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20230124144955.139901-4-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Improve the source side flow upon pre_copyYishai Hadas2023-01-303-34/+151
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Improve the source side flow upon pre_copy as of below. - Prepare the stop_copy buffers as part of moving to pre_copy. - Send to the target a record that includes the expected stop_copy size to let it optimize its stop_copy flow as well. As for sending the target this new record type (i.e. MLX5_MIGF_HEADER_TAG_STOP_COPY_SIZE) we split the current 64 header flags bits into 32 flags bits and another 32 tag bits, each record may have a tag and a flag whether it's optional or mandatory. Optional records will be ignored in the target. The above reduces the downtime upon stop_copy as the relevant data stuff is prepared ahead as part of pre_copy. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20230124144955.139901-3-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Check whether VF is migratableShay Drory2023-01-302-0/+28
| | | | | | | | | | | | | | | | | | | | Add a check whether VF is migratable. Only if VF is migratable, mark the VFIO device as migration capable. Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20230124144955.139901-2-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/hisi: Use GFP_KERNEL_ACCOUNT for userspace persistent allocationsYishai Hadas2023-01-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Use GFP_KERNEL_ACCOUNT for userspace persistent allocations. The GFP_KERNEL_ACCOUNT option lets the memory allocator know that this is untrusted allocation triggered from userspace and should be a subject of kmem accounting, and as such it is controlled by the cgroup mechanism. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230108154427.32609-5-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio: Use GFP_KERNEL_ACCOUNT for userspace persistent allocationsJason Gunthorpe2023-01-235-12/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use GFP_KERNEL_ACCOUNT for userspace persistent allocations. The GFP_KERNEL_ACCOUNT option lets the memory allocator know that this is untrusted allocation triggered from userspace and should be a subject of kmem accounting, and as such it is controlled by the cgroup mechanism. The way to find the relevant allocations was for example to look at the close_device function and trace back all the kfrees to their allocations. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230108154427.32609-4-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Allow loading of larger images than 512 MBYishai Hadas2023-01-232-14/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow loading of larger images than 512 MB by dropping the arbitrary hard-coded value that we have today and move to use the max device loading value which is for now 4GB. As part of that we move to use the GFP_KERNEL_ACCOUNT option upon allocating the persistent data of mlx5 and rely on the cgroup to provide the memory limit for the given user. The GFP_KERNEL_ACCOUNT option lets the memory allocator know that this is untrusted allocation triggered from userspace and should be a subject of kmem accounting, and as such it is controlled by the cgroup mechanism. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230108154427.32609-3-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Fix UBSAN noteYishai Hadas2023-01-231-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prevent calling roundup_pow_of_two() with value of 0 as it causes the below UBSAN note. Move this code and its few extra related lines to be called only when it's really applicable. UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13 shift exponent 64 is too large for 64-bit type 'long unsigned int' CPU: 15 PID: 1639 Comm: live_migration Not tainted 6.1.0-rc4 #1116 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x45/0x59 ubsan_epilogue+0x5/0x36 __ubsan_handle_shift_out_of_bounds.cold+0x61/0xef ? lock_is_held_type+0x98/0x110 ? rcu_read_lock_sched_held+0x3f/0x70 mlx5vf_create_rc_qp.cold+0xe4/0xf2 [mlx5_vfio_pci] mlx5vf_start_page_tracker+0x769/0xcd0 [mlx5_vfio_pci] vfio_device_fops_unl_ioctl+0x63f/0x700 [vfio] __x64_sys_ioctl+0x433/0x9a0 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd </TASK> Fixes: 79c3cf279926 ("vfio/mlx5: Init QP based resources for dirty tracking") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230108154427.32609-2-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* | mm: replace vma->vm_flags direct modifications with modifier callsSuren Baghdasaryan2023-02-091-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace direct modifications to vma->vm_flags with calls to modifier functions to be able to track flag changes and to keep vma locking correctness. [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo] Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* Merge tag 'vfio-v6.2-rc1' of https://github.com/awilliam/linux-vfioLinus Torvalds2022-12-156-244/+1222
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull VFIO updates from Alex Williamson: - Replace deprecated git://github.com link in MAINTAINERS (Palmer Dabbelt) - Simplify vfio/mlx5 with module_pci_driver() helper (Shang XiaoJing) - Drop unnecessary buffer from ACPI call (Rafael Mendonca) - Correct latent missing include issue in iova-bitmap and fix support for unaligned bitmaps. Follow-up with better fix through refactor (Joao Martins) - Rework ccw mdev driver to split private data from parent structure, better aligning with the mdev lifecycle and allowing us to remove a temporary workaround (Eric Farman) - Add an interface to get an estimated migration data size for a device, allowing userspace to make informed decisions, ex. more accurately predicting VM downtime (Yishai Hadas) - Fix minor typo in vfio/mlx5 array declaration (Yishai Hadas) - Simplify module and Kconfig through consolidating SPAPR/EEH code and config options and folding virqfd module into main vfio module (Jason Gunthorpe) - Fix error path from device_register() across all vfio mdev and sample drivers (Alex Williamson) - Define migration pre-copy interface and implement for vfio/mlx5 devices, allowing portions of the device state to be saved while the device continues operation, towards reducing the stop-copy state size (Jason Gunthorpe, Yishai Hadas, Shay Drory) - Implement pre-copy for hisi_acc devices (Shameer Kolothum) - Fixes to mdpy mdev driver remove path and error path on probe (Shang XiaoJing) - vfio/mlx5 fixes for incorrect return after copy_to_user() fault and incorrect buffer freeing (Dan Carpenter) * tag 'vfio-v6.2-rc1' of https://github.com/awilliam/linux-vfio: (42 commits) vfio/mlx5: error pointer dereference in error handling vfio/mlx5: fix error code in mlx5vf_precopy_ioctl() samples: vfio-mdev: Fix missing pci_disable_device() in mdpy_fb_probe() hisi_acc_vfio_pci: Enable PRE_COPY flag hisi_acc_vfio_pci: Move the dev compatibility tests for early check hisi_acc_vfio_pci: Introduce support for PRE_COPY state transitions hisi_acc_vfio_pci: Add support for precopy IOCTL vfio/mlx5: Enable MIGRATION_PRE_COPY flag vfio/mlx5: Fallback to STOP_COPY upon specific PRE_COPY error vfio/mlx5: Introduce multiple loads vfio/mlx5: Consider temporary end of stream as part of PRE_COPY vfio/mlx5: Introduce vfio precopy ioctl implementation vfio/mlx5: Introduce SW headers for migration states vfio/mlx5: Introduce device transitions of PRE_COPY vfio/mlx5: Refactor to use queue based data chunks vfio/mlx5: Refactor migration file state vfio/mlx5: Refactor MKEY usage vfio/mlx5: Refactor PD usage vfio/mlx5: Enforce a single SAVE command at a time vfio: Extend the device migration protocol with PRE_COPY ...
| * vfio/mlx5: error pointer dereference in error handlingDan Carpenter2022-12-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | This code frees the wrong "buf" variable and results in an error pointer dereference. Fixes: 34e2f27143d1 ("vfio/mlx5: Introduce multiple loads") Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/Y5IKia5SaiVxYmG5@kili Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: fix error code in mlx5vf_precopy_ioctl()Dan Carpenter2022-12-121-1/+4
| | | | | | | | | | | | | | | | | | | | | | The copy_to_user() function returns the number of bytes remaining to be copied but we want to return a negative error code here. Fixes: 0dce165b1adf ("vfio/mlx5: Introduce vfio precopy ioctl implementation") Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/Y5IKVknlf5Z5NPtU@kili Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * hisi_acc_vfio_pci: Enable PRE_COPY flagShameer Kolothum2022-12-061-1/+1
| | | | | | | | | | | | | | | | | | Now that we have everything to support the PRE_COPY state, enable it. Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20221123113236.896-5-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * hisi_acc_vfio_pci: Move the dev compatibility tests for early checkShameer Kolothum2022-12-062-12/+8
| | | | | | | | | | | | | | | | | | | | Instead of waiting till data transfer is complete to perform dev compatibility, do it as soon as we have enough data to perform the check. This will be useful when we enable the support for PRE_COPY. Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20221123113236.896-4-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * hisi_acc_vfio_pci: Introduce support for PRE_COPY state transitionsShameer Kolothum2022-12-061-3/+71
| | | | | | | | | | | | | | | | | | | | The saving_migf is open in PRE_COPY state if it is supported and reads initial device match data. hisi_acc_vf_stop_copy() is refactored to make use of common code. Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20221123113236.896-3-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * hisi_acc_vfio_pci: Add support for precopy IOCTLShameer Kolothum2022-12-062-0/+53
| | | | | | | | | | | | | | | | | | PRECOPY IOCTL in the case of HiSiIicon ACC driver can be used to perform the device compatibility check earlier during migration. Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20221123113236.896-2-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Enable MIGRATION_PRE_COPY flagShay Drory2022-12-061-0/+5
| | | | | | | | | | | | | | | | | | | | Now that everything has been set up for MIGRATION_PRE_COPY, enable it. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-15-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Fallback to STOP_COPY upon specific PRE_COPY errorShay Drory2022-12-063-3/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before a SAVE command is issued, a QUERY command is issued in order to know the device data size. In case PRE_COPY is used, the above commands are issued while the device is running. Thus, it is possible that between the QUERY and the SAVE commands the state of the device will be changed significantly and thus the SAVE will fail. Currently, if a SAVE command is failing, the driver will fail the migration. In the above case, don't fail the migration, but don't allow for new SAVEs to be executed while the device is in a RUNNING state. Once the device will be moved to STOP_COPY, SAVE can be executed again and the full device state will be read. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-14-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Introduce multiple loadsYishai Hadas2022-12-063-45/+257
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to support PRE_COPY, mlx5 driver transfers multiple states (images) of the device. e.g.: the source VF can save and transfer multiple states, and the target VF will load them by that order. This patch implements the changes for the target VF to decompose the header for each state and to write and load multiple states. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-13-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Consider temporary end of stream as part of PRE_COPYYishai Hadas2022-12-063-2/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During PRE_COPY the migration data FD may have a temporary "end of stream" that is reached when the initial_bytes were read and no other dirty data exists yet. For instance, this may indicate that the device is idle and not currently dirtying any internal state. When read() is done on this temporary end of stream the kernel driver should return ENOMSG from read(). Userspace can wait for more data or consider moving to STOP_COPY. To not block the user upon read() and let it get ENOMSG we add a new state named MLX5_MIGF_STATE_PRE_COPY on the migration file. In addition, we add the MLX5_MIGF_STATE_SAVE_LAST state to block the read() once we call the last SAVE upon moving to STOP_COPY. Any further error will be marked with MLX5_MIGF_STATE_ERROR and the user won't be blocked. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-12-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Introduce vfio precopy ioctl implementationYishai Hadas2022-12-062-0/+127
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vfio precopy ioctl returns an estimation of data available for transferring from the device. Whenever a user is using VFIO_MIG_GET_PRECOPY_INFO, track the current state of the device, and if needed, append the dirty data to the transfer FD data. This is done by saving a middle state. As mlx5 runs the SAVE command asynchronously, make sure to query for incremental data only once there is no active save command. Running both in parallel, might end-up with a failure in the incremental query command on un-tracked vhca. Also, a middle state will be saved only after the previous state has finished its SAVE command and has been fully transferred, this prevents endless use resources. Co-developed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-11-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Introduce SW headers for migration statesYishai Hadas2022-12-063-4/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As mentioned in the previous patches, mlx5 is transferring multiple states when the PRE_COPY protocol is used. This states mechanism requires the target VM to know the states' size in order to execute multiple loads. Therefore, add SW header, with the needed information, for each saved state the source VM is transferring to the target VM. This patch implements the source VM handling of the headers, following patch will implement the target VM handling of the headers. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-10-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Introduce device transitions of PRE_COPYYishai Hadas2022-12-063-18/+184
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to support PRE_COPY, mlx5 driver is transferring multiple states (images) of the device. e.g.: the source VF can save and transfer multiple states, and the target VF will load them by that order. The device is saving three kinds of states: 1) Initial state - when the device moves to PRE_COPY state. 2) Middle state - during PRE_COPY phase via VFIO_MIG_GET_PRECOPY_INFO. There can be multiple states of this type. 3) Final state - when the device moves to STOP_COPY state. After moving to PRE_COPY state, user is holding the saving migf FD and can use it. For example: user can start transferring data via read() callback. Also, user can switch from PRE_COPY to STOP_COPY whenever he sees it fits. This will invoke saving of final state. This means that mlx5 VFIO device can be switched to STOP_COPY without transferring any data in PRE_COPY state. Therefore, when the device moves to STOP_COPY, mlx5 will store the final state on a dedicated queue entry on the list. Co-developed-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-9-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Refactor to use queue based data chunksYishai Hadas2022-12-063-38/+136
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Refactor to use queue based data chunks on the migration file. The SAVE command adds a chunk to the tail of the queue while the read() API finds the required chunk and returns its data. In case the queue is empty but the state of the migration file is MLX5_MIGF_STATE_COMPLETE, read() may not be blocked but will return 0 to indicate end of file. This is a step towards maintaining multiple images and their meta data (i.e. headers) on the migration file as part of next patches from the series. Note: At that point, we still use a single chunk on the migration file but becomes ready to support multiple. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-8-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Refactor migration file stateYishai Hadas2022-12-063-8/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Refactor migration file state to be an emum which is mutual exclusive. As of that dropped the 'disabled' state as 'error' is the same from functional point of view. Next patches from the series will extend this enum for other relevant states. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-7-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Refactor MKEY usageYishai Hadas2022-12-063-113/+178
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch refactors MKEY usage such as its life cycle will be as of the migration file instead of allocating/destroying it upon each SAVE/LOAD command. This is a preparation step towards the PRE_COPY series where multiple images will be SAVED/LOADED. We achieve it by having a new struct named mlx5_vhca_data_buffer which holds the mkey and its related stuff as of sg_append_table, allocated_length, etc. The above fields were taken out from the migration file main struct, into mlx5_vhca_data_buffer dedicated struct with the proper helpers in place. For now we have a single mlx5_vhca_data_buffer per migration file. However, in coming patches we'll have multiple of them to support multiple images. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-6-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Refactor PD usageYishai Hadas2022-12-063-31/+71
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch refactors PD usage such as its life cycle will be as of the migration file instead of allocating/destroying it upon each SAVE/LOAD command. This is a preparation step towards the PRE_COPY series where multiple images will be SAVED/LOADED and a single PD can be simply reused. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-5-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Enforce a single SAVE command at a timeYishai Hadas2022-12-063-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enforce a single SAVE command at a time. As the SAVE command is an asynchronous one, we must enforce running only a single command at a time. This will preserve ordering between multiple calls and protect from races on the migration file data structure. This is a must for the next patches from the series where as part of PRE_COPY we may have multiple images to be saved and multiple SAVE commands may be issued from different flows. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-4-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio: Remove CONFIG_VFIO_SPAPR_EEHJason Gunthorpe2022-12-051-3/+3
| | | | | | | | | | | | | | | | | | | | | | We don't need a kconfig symbol for this, just directly test CONFIG_EEH in the few places that need it. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/4-v5-fc5346cacfd4+4c482-vfio_modules_jgg@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/pci: Move all the SPAPR PCI specific logic to vfio_pci_core.koJason Gunthorpe2022-12-051-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | The vfio_spapr_pci_eeh_open/release() functions are one line wrappers around an arch function. Just call them directly. This eliminates some weird exported symbols that don't need to exist. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Link: https://lore.kernel.org/r/1-v5-fc5346cacfd4+4c482-vfio_modules_jgg@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Fix a typo in mlx5vf_cmd_load_vhca_state()Yishai Hadas2022-11-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a typo in mlx5vf_cmd_load_vhca_state() to use the 'load' memory layout. As in/out sizes are equal for save and load commands there wasn't any functional issue. Fixes: f1d98f346ee3 ("vfio/mlx5: Expose migration commands over mlx5 device") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20221106174630.25909-3-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio: Add an option to get migration data sizeYishai Hadas2022-11-143-1/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an option to get migration data size by introducing a new migration feature named VFIO_DEVICE_FEATURE_MIG_DATA_SIZE. Upon VFIO_DEVICE_FEATURE_GET the estimated data length that will be required to complete STOP_COPY is returned. This option may better enable user space to consider before moving to STOP_COPY whether it can meet the downtime SLA based on the returned data. The patch also includes the implementation for mlx5 and hisi for this new option to make it feature complete for the existing drivers in this area. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Longfang Liu <liulongfang@huawei.com> Link: https://lore.kernel.org/r/20221106174630.25909-2-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio: Remove vfio_free_deviceEric Farman2022-11-101-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the "mess" sorted out, we should be able to inline the vfio_free_device call introduced by commit cb9ff3f3b84c ("vfio: Add helpers for unifying vfio_device life cycle") and remove them from driver release callbacks. Signed-off-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com> # vfio-ap part Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> Link: https://lore.kernel.org/r/20221104142007.1314999-8-farman@linux.ibm.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Switch to use module_pci_driver() macroShang XiaoJing2022-11-091-12/+1
| | | | | | | | | | | | | | | | | | | | | | Since pci provides the helper macro module_pci_driver(), we may replace the module_init/exit with it. Signed-off-by: Shang XiaoJing <shangxiaojing@huawei.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20220922123507.11222-1-shangxiaojing@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* | Merge tag 'v6.1-rc7' into iommufd.git for-nextJason Gunthorpe2022-12-021-5/+5
|\ \ | | | | | | | | | | | | | | | | | | | | | Resolve conflicts in drivers/vfio/vfio_main.c by using the iommfd version. The rc fix was done a different way when iommufd patches reworked this code. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
| * | vfio/pci: Check the device set open count on resetAnthony DeRossi2022-11-101-5/+5
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vfio_pci_dev_set_needs_reset() inspects the open_count of every device in the set to determine whether a reset is allowed. The current device always has open_count == 1 within vfio_pci_core_disable(), effectively disabling the reset logic. This field is also documented as private in vfio_device, so it should not be used to determine whether other devices in the set are open. Checking for vfio_device_set_open_count() > 1 on the device set fixes both issues. After commit 2cd8b14aaa66 ("vfio/pci: Move to the device set infrastructure"), failure to create a new file for a device would cause the reset to be skipped due to open_count being decremented after calling close_device() in the error path. After commit eadd86f835c6 ("vfio: Remove calls to vfio_group_add_container_user()"), releasing a device would always skip the reset due to an ordering change in vfio_device_fops_release(). Failing to reset the device leaves it in an unknown state, potentially causing errors when it is accessed later or bound to a different driver. This issue was observed with a Radeon RX Vega 56 [1002:687f] (rev c3) assigned to a Windows guest. After shutting down the guest, unbinding the device from vfio-pci, and binding the device to amdgpu: [ 548.007102] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed! [ 548.027174] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed [ 548.027242] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22 [ 548.027306] amdgpu 0000:0a:00.0: amdgpu: amdgpu_device_ip_init failed [ 548.027308] amdgpu 0000:0a:00.0: amdgpu: Fatal error during GPU init Fixes: 2cd8b14aaa66 ("vfio/pci: Move to the device set infrastructure") Fixes: eadd86f835c6 ("vfio: Remove calls to vfio_group_add_container_user()") Signed-off-by: Anthony DeRossi <ajderossi@gmail.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Link: https://lore.kernel.org/r/20221110014027.28780-4-ajderossi@gmail.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* / vfio-iommufd: Support iommufd for physical VFIO devicesJason Gunthorpe2022-12-023-0/+12
|/ | | | | | | | | | | | | | | | | | | | | | | | This creates the iommufd_device for the physical VFIO drivers. These are all the drivers that are calling vfio_register_group_dev() and expect the type1 code to setup a real iommu_domain against their parent struct device. The design gives the driver a choice in how it gets connected to iommufd by providing bind_iommufd/unbind_iommufd/attach_ioas callbacks to implement as required. The core code provides three default callbacks for physical mode using a real iommu_domain. This is suitable for drivers using vfio_register_group_dev() Link: https://lore.kernel.org/r/6-v4-42cd2eb0e3eb+335a-vfio_iommufd_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Alex Williamson <alex.williamson@redhat.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Lixiao Yang <lixiao.yang@intel.com> Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Yu He <yu.he@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
* vfio: Add vfio_file_is_group()Jason Gunthorpe2022-10-071-1/+1
| | | | | | | | | | | | | | | | This replaces uses of vfio_file_iommu_group() which were only detecting if the file is a VFIO file with no interest in the actual group. The only remaning user of vfio_file_iommu_group() is in KVM for the SPAPR stuff. It passes the iommu_group into the arch code through kvm for some reason. Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Eric Farman <farman@linux.ibm.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1-v2-15417f29324e+1c-vfio_group_disassociate_jgg@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* hisi_acc_vfio_pci: Update some log and comment formatsLongfang Liu2022-09-272-12/+12
| | | | | | | | | | | | 1. Modify some annotation information formats to keep the entire driver annotation format consistent. 2. Modify some log description formats to be consistent with the format of the entire driver log. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Longfang Liu <liulongfang@huawei.com> Link: https://lore.kernel.org/r/20220926093332.28824-6-liulongfang@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* hisi_acc_vfio_pci: Remove useless macro definitionsLongfang Liu2022-09-271-1/+0
| | | | | | | | | | The QM_QUE_ISO_CFG macro definition is no longer used and needs to be deleted from the current driver. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Longfang Liu <liulongfang@huawei.com> Link: https://lore.kernel.org/r/20220926093332.28824-5-liulongfang@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* hisi_acc_vfio_pci: Remove useless function parameterLongfang Liu2022-09-271-3/+5
| | | | | | | | | | | Remove unused function parameters for vf_qm_fun_reset() and ensure the device is enabled before the reset operation is performed. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Longfang Liu <liulongfang@huawei.com> Link: https://lore.kernel.org/r/20220926093332.28824-4-liulongfang@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* hisi_acc_vfio_pci: Fix device data address combination problemLongfang Liu2022-09-271-4/+4
| | | | | | | | | | | The queue address of the accelerator device should be combined into a dma address in a way of combining the low and high bits. The previous combination is wrong and needs to be modified. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Longfang Liu <liulongfang@huawei.com> Link: https://lore.kernel.org/r/20220926093332.28824-3-liulongfang@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* hisi_acc_vfio_pci: Fixes error return code issueLongfang Liu2022-09-271-1/+1
| | | | | | | | | | | | | | | During the process of compatibility and matching of live migration device information, if the isolation status of the two devices is inconsistent, the live migration needs to be exited. The current driver does not return the error code correctly and needs to be fixed. Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Longfang Liu <liulongfang@huawei.com> Link: https://lore.kernel.org/r/20220926093332.28824-2-liulongfang@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/hisi_acc: Use the new device life cycle helpersYi Liu2022-09-212-73/+37
| | | | | | | | | | | | | | Tidy up @probe so all migration specific initialization logic is moved to migration specific @init callback. Remove vfio_pci_core_{un}init_device() given no user now. Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20220921104401.38898-5-kevin.tian@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/mlx5: Use the new device life cycle helpersYi Liu2022-09-211-14/+36
| | | | | | | | | | mlx5 has its own @init/@release for handling migration cap. Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20220921104401.38898-4-kevin.tian@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/pci: Use the new device life cycle helpersYi Liu2022-09-212-10/+45
| | | | | | | | | | | | | Also introduce two pci core helpers as @init/@release for pci drivers: - vfio_pci_core_init_dev() - vfio_pci_core_release_dev() Signed-off-by: Yi Liu <yi.l.liu@intel.com> Signed-off-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20220921104401.38898-3-kevin.tian@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/mlx5: Set the driver DMA logging callbacksYishai Hadas2022-09-083-3/+14
| | | | | | | | | Now that everything is ready set the driver DMA logging callbacks if supported by the device. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20220908183448.195262-11-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/mlx5: Manage error scenarios on trackerYishai Hadas2022-09-082-2/+61
| | | | | | | | | Handle async error events and health/recovery flow to safely stop the tracker upon error scenarios. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20220908183448.195262-10-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* vfio/mlx5: Report dirty pages from trackerYishai Hadas2022-09-082-0/+195
| | | | | | | | | | | | | | | | | | | | | Report dirty pages from tracker. It includes: Querying for dirty pages in a given IOVA range, this is done by modifying the tracker into the reporting state and supplying the required range. Using the CQ event completion mechanism to be notified once data is ready on the CQ/QP to be processed. Once data is available turn on the corresponding bits in the bit map. This functionality will be used as part of the 'log_read_and_clear' driver callback in the next patches. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20220908183448.195262-9-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>