summaryrefslogtreecommitdiffstats
path: root/drivers/vfio
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'vfio-v6.3-rc1' of https://github.com/awilliam/linux-vfioLinus Torvalds2023-02-2521-271/+562
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull VFIO updates from Alex Williamson: - Remove redundant resource check in vfio-platform (Angus Chen) - Use GFP_KERNEL_ACCOUNT for persistent userspace allocations, allowing removal of arbitrary kernel limits in favor of cgroup control (Yishai Hadas) - mdev tidy-ups, including removing the module-only build restriction for sample drivers, Kconfig changes to select mdev support, documentation movement to keep sample driver usage instructions with sample drivers rather than with API docs, remove references to out-of-tree drivers in docs (Christoph Hellwig) - Fix collateral breakages from mdev Kconfig changes (Arnd Bergmann) - Make mlx5 migration support match device support, improve source and target flows to improve pre-copy support and reduce downtime (Yishai Hadas) - Convert additional mdev sysfs case to use sysfs_emit() (Bo Liu) - Resolve copy-paste error in mdev mbochs sample driver Kconfig (Ye Xingchen) - Avoid propagating missing reset error in vfio-platform if reset requirement is relaxed by module option (Tomasz Duszynski) - Range size fixes in mlx5 variant driver for missed last byte and stricter range calculation (Yishai Hadas) - Fixes to suspended vaddr support and locked_vm accounting, excluding mdev configurations from the former due to potential to indefinitely block kernel threads, fix underflow and restore locked_vm on new mm (Steve Sistare) - Update outdated vfio documentation due to new IOMMUFD interfaces in recent kernels (Yi Liu) - Resolve deadlock between group_lock and kvm_lock, finally (Matthew Rosato) - Fix NULL pointer in group initialization error path with IOMMUFD (Yan Zhao) * tag 'vfio-v6.3-rc1' of https://github.com/awilliam/linux-vfio: (32 commits) vfio: Fix NULL pointer dereference caused by uninitialized group->iommufd docs: vfio: Update vfio.rst per latest interfaces vfio: Update the kdoc for vfio_device_ops vfio/mlx5: Fix range size calculation upon tracker creation vfio: no need to pass kvm pointer during device open vfio: fix deadlock between group lock and kvm lock vfio: revert "iommu driver notify callback" vfio/type1: revert "implement notify callback" vfio/type1: revert "block on invalid vaddr" vfio/type1: restore locked_vm vfio/type1: track locked_vm per dma vfio/type1: prevent underflow of locked_vm via exec() vfio/type1: exclude mdevs from VFIO_UPDATE_VADDR vfio: platform: ignore missing reset if disabled at module init vfio/mlx5: Improve the target side flow to reduce downtime vfio/mlx5: Improve the source side flow upon pre_copy vfio/mlx5: Check whether VF is migratable samples: fix the prompt about SAMPLE_VFIO_MDEV_MBOCHS vfio/mdev: Use sysfs_emit() to instead of sprintf() vfio-mdev: add back CONFIG_VFIO dependency ...
| * vfio: Fix NULL pointer dereference caused by uninitialized group->iommufdYan Zhao2023-02-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | group->iommufd is not initialized for the iommufd_ctx_put() [20018.331541] BUG: kernel NULL pointer dereference, address: 0000000000000000 [20018.377508] RIP: 0010:iommufd_ctx_put+0x5/0x10 [iommufd] ... [20018.476483] Call Trace: [20018.479214] <TASK> [20018.481555] vfio_group_fops_unl_ioctl+0x506/0x690 [vfio] [20018.487586] __x64_sys_ioctl+0x6a/0xb0 [20018.491773] ? trace_hardirqs_on+0xc5/0xe0 [20018.496347] do_syscall_64+0x67/0x90 [20018.500340] entry_SYSCALL_64_after_hwframe+0x4b/0xb5 Fixes: 9eefba8002c2 ("vfio: Move vfio group specific code into group.c") Cc: stable@vger.kernel.org Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230222074938.13681-1-yan.y.zhao@intel.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Fix range size calculation upon tracker creationYishai Hadas2023-02-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | Fix range size calculation to include the last byte of each range. In addition, log round up the length of the total ranges to be stricter. Fixes: c1d050b0d169 ("vfio/mlx5: Create and destroy page tracker object") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20230208152234.32370-1-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio: no need to pass kvm pointer during device openMatthew Rosato2023-02-093-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | Nothing uses this value during vfio_device_open anymore so it's safe to remove it. Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230203215027.151988-3-mjrosato@linux.ibm.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio: fix deadlock between group lock and kvm lockMatthew Rosato2023-02-093-14/+108
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After 51cdc8bc120e, we have another deadlock scenario between the kvm->lock and the vfio group_lock with two different codepaths acquiring the locks in different order. Specifically in vfio_open_device, vfio holds the vfio group_lock when issuing device->ops->open_device but some drivers (like vfio-ap) need to acquire kvm->lock during their open_device routine; Meanwhile, kvm_vfio_release will acquire the kvm->lock first before calling vfio_file_set_kvm which will acquire the vfio group_lock. To resolve this, let's remove the need for the vfio group_lock from the kvm_vfio_release codepath. This is done by introducing a new spinlock to protect modifications to the vfio group kvm pointer, and acquiring a kvm ref from within vfio while holding this spinlock, with the reference held until the last close for the device in question. Fixes: 51cdc8bc120e ("kvm/vfio: Fix potential deadlock on vfio group_lock") Reported-by: Anthony Krowiak <akrowiak@linux.ibm.com> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Tony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20230203215027.151988-2-mjrosato@linux.ibm.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio: revert "iommu driver notify callback"Steve Sistare2023-02-092-12/+0
| | | | | | | | | | | | | | | | | | | | | | Revert this dead code: commit ec5e32940cc9 ("vfio: iommu driver notify callback") Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1675184289-267876-8-git-send-email-steven.sistare@oracle.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/type1: revert "implement notify callback"Steve Sistare2023-02-091-15/+0
| | | | | | | | | | | | | | | | | | | | | | This is dead code. Revert it. commit 487ace134053 ("vfio/type1: implement notify callback") Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1675184289-267876-7-git-send-email-steven.sistare@oracle.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/type1: revert "block on invalid vaddr"Steve Sistare2023-02-091-89/+5
| | | | | | | | | | | | | | | | | | | | | | Revert this dead code: commit 898b9eaeb3fe ("vfio/type1: block on invalid vaddr") Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1675184289-267876-6-git-send-email-steven.sistare@oracle.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/type1: restore locked_vmSteve Sistare2023-02-091-0/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a vfio container is preserved across exec or fork-exec, the new task's mm has a locked_vm count of 0. After a dma vaddr is updated using VFIO_DMA_MAP_FLAG_VADDR, locked_vm remains 0, and the pinned memory does not count against the task's RLIMIT_MEMLOCK. To restore the correct locked_vm count, when VFIO_DMA_MAP_FLAG_VADDR is used and the dma's mm has changed, add the dma's locked_vm count to the new mm->locked_vm, subject to the rlimit, and subtract it from the old mm->locked_vm. Fixes: c3cbab24db38 ("vfio/type1: implement interfaces to update vaddr") Cc: stable@vger.kernel.org Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1675184289-267876-5-git-send-email-steven.sistare@oracle.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/type1: track locked_vm per dmaSteve Sistare2023-02-091-6/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | Track locked_vm per dma struct, and create a new subroutine, both for use in a subsequent patch. No functional change. Fixes: c3cbab24db38 ("vfio/type1: implement interfaces to update vaddr") Cc: stable@vger.kernel.org Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1675184289-267876-4-git-send-email-steven.sistare@oracle.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/type1: prevent underflow of locked_vm via exec()Steve Sistare2023-02-091-27/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a vfio container is preserved across exec, the task does not change, but it gets a new mm with locked_vm=0, and loses the count from existing dma mappings. If the user later unmaps a dma mapping, locked_vm underflows to a large unsigned value, and a subsequent dma map request fails with ENOMEM in __account_locked_vm. To avoid underflow, grab and save the mm at the time a dma is mapped. Use that mm when adjusting locked_vm, rather than re-acquiring the saved task's mm, which may have changed. If the saved mm is dead, do nothing. locked_vm is incremented for existing mappings in a subsequent patch. Fixes: 73fa0d10d077 ("vfio: Type1 IOMMU implementation") Cc: stable@vger.kernel.org Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1675184289-267876-3-git-send-email-steven.sistare@oracle.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/type1: exclude mdevs from VFIO_UPDATE_VADDRSteve Sistare2023-02-091-2/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disable the VFIO_UPDATE_VADDR capability if mediated devices are present. Their kernel threads could be blocked indefinitely by a misbehaving userland while trying to pin/unpin pages while vaddrs are being updated. Do not allow groups to be added to the container while vaddr's are invalid, so we never need to block user threads from pinning, and can delete the vaddr-waiting code in a subsequent patch. Fixes: c3cbab24db38 ("vfio/type1: implement interfaces to update vaddr") Cc: stable@vger.kernel.org Signed-off-by: Steve Sistare <steven.sistare@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1675184289-267876-2-git-send-email-steven.sistare@oracle.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio: platform: ignore missing reset if disabled at module initTomasz Duszynski2023-02-011-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | If reset requirement was relaxed via module parameter, errors caused by missing reset should not be propagated down to the vfio core. Otherwise initialization will fail. Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com> Fixes: 5f6c7e0831a1 ("vfio/platform: Use the new device life cycle helpers") Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20230131083349.2027189-1-tduszynski@marvell.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Improve the target side flow to reduce downtimeYishai Hadas2023-01-302-12/+105
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Improve the target side flow to reduce downtime as of below. - Support reading an optional record which includes the expected stop_copy size. - Once the source sends this record data, which expects to be sent as part of the pre_copy flow, prepare the data buffers that may be large enough to hold the final stop_copy data. The above reduces the migration downtime as the relevant stuff that is needed to load the image data is prepared ahead as part of pre_copy. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20230124144955.139901-4-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Improve the source side flow upon pre_copyYishai Hadas2023-01-303-34/+151
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Improve the source side flow upon pre_copy as of below. - Prepare the stop_copy buffers as part of moving to pre_copy. - Send to the target a record that includes the expected stop_copy size to let it optimize its stop_copy flow as well. As for sending the target this new record type (i.e. MLX5_MIGF_HEADER_TAG_STOP_COPY_SIZE) we split the current 64 header flags bits into 32 flags bits and another 32 tag bits, each record may have a tag and a flag whether it's optional or mandatory. Optional records will be ignored in the target. The above reduces the downtime upon stop_copy as the relevant data stuff is prepared ahead as part of pre_copy. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20230124144955.139901-3-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Check whether VF is migratableShay Drory2023-01-302-0/+28
| | | | | | | | | | | | | | | | | | | | Add a check whether VF is migratable. Only if VF is migratable, mark the VFIO device as migration capable. Signed-off-by: Shay Drory <shayd@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20230124144955.139901-2-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mdev: Use sysfs_emit() to instead of sprintf()Bo Liu2023-01-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | Follow the advice of the Documentation/filesystems/sysfs.rst and show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Signed-off-by: Bo Liu <liubo03@inspur.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230129084117.2384-1-liubo03@inspur.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/platform: Use GFP_KERNEL_ACCOUNT for userspace persistent allocationsYishai Hadas2023-01-232-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Use GFP_KERNEL_ACCOUNT for userspace persistent allocations. The GFP_KERNEL_ACCOUNT option lets the memory allocator know that this is untrusted allocation triggered from userspace and should be a subject of kmem accounting, and as such it is controlled by the cgroup mechanism. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230108154427.32609-7-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/fsl-mc: Use GFP_KERNEL_ACCOUNT for userspace persistent allocationsYishai Hadas2023-01-232-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Use GFP_KERNEL_ACCOUNT for userspace persistent allocations. The GFP_KERNEL_ACCOUNT option lets the memory allocator know that this is untrusted allocation triggered from userspace and should be a subject of kmem accounting, and as such it is controlled by the cgroup mechanism. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230108154427.32609-6-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/hisi: Use GFP_KERNEL_ACCOUNT for userspace persistent allocationsYishai Hadas2023-01-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Use GFP_KERNEL_ACCOUNT for userspace persistent allocations. The GFP_KERNEL_ACCOUNT option lets the memory allocator know that this is untrusted allocation triggered from userspace and should be a subject of kmem accounting, and as such it is controlled by the cgroup mechanism. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230108154427.32609-5-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio: Use GFP_KERNEL_ACCOUNT for userspace persistent allocationsJason Gunthorpe2023-01-237-14/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use GFP_KERNEL_ACCOUNT for userspace persistent allocations. The GFP_KERNEL_ACCOUNT option lets the memory allocator know that this is untrusted allocation triggered from userspace and should be a subject of kmem accounting, and as such it is controlled by the cgroup mechanism. The way to find the relevant allocations was for example to look at the close_device function and trace back all the kfrees to their allocations. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230108154427.32609-4-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Allow loading of larger images than 512 MBYishai Hadas2023-01-232-14/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow loading of larger images than 512 MB by dropping the arbitrary hard-coded value that we have today and move to use the max device loading value which is for now 4GB. As part of that we move to use the GFP_KERNEL_ACCOUNT option upon allocating the persistent data of mlx5 and rely on the cgroup to provide the memory limit for the given user. The GFP_KERNEL_ACCOUNT option lets the memory allocator know that this is untrusted allocation triggered from userspace and should be a subject of kmem accounting, and as such it is controlled by the cgroup mechanism. Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230108154427.32609-3-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio/mlx5: Fix UBSAN noteYishai Hadas2023-01-231-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prevent calling roundup_pow_of_two() with value of 0 as it causes the below UBSAN note. Move this code and its few extra related lines to be called only when it's really applicable. UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13 shift exponent 64 is too large for 64-bit type 'long unsigned int' CPU: 15 PID: 1639 Comm: live_migration Not tainted 6.1.0-rc4 #1116 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x45/0x59 ubsan_epilogue+0x5/0x36 __ubsan_handle_shift_out_of_bounds.cold+0x61/0xef ? lock_is_held_type+0x98/0x110 ? rcu_read_lock_sched_held+0x3f/0x70 mlx5vf_create_rc_qp.cold+0xe4/0xf2 [mlx5_vfio_pci] mlx5vf_start_page_tracker+0x769/0xcd0 [mlx5_vfio_pci] vfio_device_fops_unl_ioctl+0x63f/0x700 [vfio] __x64_sys_ioctl+0x433/0x9a0 do_syscall_64+0x3d/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd </TASK> Fixes: 79c3cf279926 ("vfio/mlx5: Init QP based resources for dirty tracking") Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230108154427.32609-2-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio-mdev: turn VFIO_MDEV into a selectable symbolChristoph Hellwig2023-01-231-7/+1
| | | | | | | | | | | | | | | | | | | | | | VFIO_MDEV is just a library with helpers for the drivers. Stop making it a user choice and just select it by the drivers that use the helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com> Link: https://lore.kernel.org/r/20230110091009.474427-3-hch@lst.de Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * vfio: platform: No need to check res againAngus Chen2023-01-231-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | In function vfio_platform_regions_init(),we did check res implied by using while loop, so no need to check whether res be null or not again. No functional change intended. Signed-off-by: Angus Chen <angus.chen@jaguarmicro.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Link: https://lore.kernel.org/r/20230107034721.2127-1-angus.chen@jaguarmicro.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* | Merge tag 'for-linus-iommufd' of ↵Linus Torvalds2023-02-247-25/+41
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd updates from Jason Gunthorpe: "Some polishing and small fixes for iommufd: - Remove IOMMU_CAP_INTR_REMAP, instead rely on the interrupt subsystem - Use GFP_KERNEL_ACCOUNT inside the iommu_domains - Support VFIO_NOIOMMU mode with iommufd - Various typos - A list corruption bug if HWPTs are used for attach" * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: iommufd: Do not add the same hwpt to the ioas->hwpt_list twice iommufd: Make sure to zero vfio_iommu_type1_info before copying to user vfio: Support VFIO_NOIOMMU with iommufd iommufd: Add three missing structures in ucmd_buffer selftests: iommu: Fix test_cmd_destroy_access() call in user_copy iommu: Remove IOMMU_CAP_INTR_REMAP irq/s390: Add arch_is_isolated_msi() for s390 iommu/x86: Replace IOMMU_CAP_INTR_REMAP with IRQ_DOMAIN_FLAG_ISOLATED_MSI genirq/msi: Rename IRQ_DOMAIN_MSI_REMAP to IRQ_DOMAIN_ISOLATED_MSI genirq/irqdomain: Remove unused irq_domain_check_msi_remap() code iommufd: Convert to msi_device_has_isolated_msi() vfio/type1: Convert to iommu_group_has_isolated_msi() iommu: Add iommu_group_has_isolated_msi() genirq/msi: Add msi_device_has_isolated_msi()
| * \ Merge tag 'v6.2' into iommufd.git for-nextJason Gunthorpe2023-02-211-11/+20
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Resolve conflicts from the signature change in iommu_map: - drivers/infiniband/hw/usnic/usnic_uiom.c Switch iommu_map_atomic() to iommu_map(.., GFP_ATOMIC) - drivers/vfio/vfio_iommu_type1.c Following indenting change for GFP_KERNEL Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
| * \ \ Merge branch 'vfio-no-iommu' into iommufd.git for-nextJason Gunthorpe2023-02-036-12/+38
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | Shared branch with VFIO for the no-iommu support. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
| | * | | vfio: Support VFIO_NOIOMMU with iommufdJason Gunthorpe2023-02-036-12/+38
| | | |/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a small amount of emulation to vfio_compat to accept the SET_IOMMU to VFIO_NOIOMMU_IOMMU and have vfio just ignore iommufd if it is working on a no-iommu enabled device. Move the enable_unsafe_noiommu_mode module out of container.c into vfio_main.c so that it is always available even if VFIO_CONTAINER=n. This passes Alex's mini-test: https://github.com/awilliam/tests/blob/master/vfio-noiommu-pci-device-open.c Link: https://lore.kernel.org/r/0-v3-480cd64a16f7+1ad0-iommufd_noiommu_jgg@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
| * | | Merge branch 'iommu-memory-accounting' of ↵Jason Gunthorpe2023-01-301-4/+5
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/joro/iommu intoiommufd/for-next Jason Gunthorpe says: ==================== iommufd follows the same design as KVM and uses memory cgroups to limit the amount of kernel memory a iommufd file descriptor can pin down. The various internal data structures already use GFP_KERNEL_ACCOUNT to charge its own memory. However, one of the biggest consumers of kernel memory is the IOPTEs stored under the iommu_domain and these allocations are not tracked. This series is the first step in fixing it. The iommu driver contract already includes a 'gfp' argument to the map_pages op, allowing iommufd to specify GFP_KERNEL_ACCOUNT and then having the driver allocate the IOPTE tables with that flag will capture a significant amount of the allocations. Update the iommu_map() API to pass in the GFP argument, and fix all call sites. Replace iommu_map_atomic(). Audit the "enterprise" iommu drivers to make sure they do the right thing. Intel and S390 ignore the GFP argument and always use GFP_ATOMIC. This is problematic for iommufd anyhow, so fix it. AMD and ARM SMMUv2/3 are already correct. A follow up series will be needed to capture the allocations made when the iommu_domain itself is allocated, which will complete the job. ==================== * 'iommu-memory-accounting' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu/s390: Use GFP_KERNEL in sleepable contexts iommu/s390: Push the gfp parameter to the kmem_cache_alloc()'s iommu/intel: Use GFP_KERNEL in sleepable contexts iommu/intel: Support the gfp argument to the map_pages op iommu/intel: Add a gfp parameter to alloc_pgtable_page() iommufd: Use GFP_KERNEL_ACCOUNT for iommu_map() iommu/dma: Use the gfp parameter in __iommu_dma_alloc_noncontiguous() iommu: Add a gfp parameter to iommu_map_sg() iommu: Remove iommu_map_atomic() iommu: Add a gfp parameter to iommu_map() Link: https://lore.kernel.org/linux-iommu/0-v3-76b587fe28df+6e3-iommu_map_gfp_jgg@nvidia.com Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
| * | | | vfio/type1: Convert to iommu_group_has_isolated_msi()Jason Gunthorpe2023-01-111-13/+3
| | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Trivially use the new API. Link: https://lore.kernel.org/r/3-v3-3313bb5dd3a3+10f11-secure_msi_jgg@nvidia.com Tested-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
* | | | Merge tag 'iommu-updates-v6.3' of ↵Linus Torvalds2023-02-241-4/+5
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu updates from Joerg Roedel: - Consolidate iommu_map/unmap functions. There have been blocking and atomic variants so far, but that was problematic as this approach does not scale with required new variants which just differ in the GFP flags used. So Jason consolidated this back into single functions that take a GFP parameter. - Retire the detach_dev() call-back in iommu_ops - Arm SMMU updates from Will: - Device-tree binding updates: - Cater for three power domains on SM6375 - Document existing compatible strings for Qualcomm SoCs - Tighten up clocks description for platform-specific compatible strings - Enable Qualcomm workarounds for some additional platforms that need them - Intel VT-d updates from Lu Baolu: - Add Intel IOMMU performance monitoring support - Set No Execute Enable bit in PASID table entry - Two performance optimizations - Fix PASID directory pointer coherency - Fix missed rollbacks in error path - Cleanups - Apple t8110 DART support - Exynos IOMMU: - Implement better fault handling - Error handling fixes - Renesas IPMMU: - Add device tree bindings for r8a779g0 - AMD IOMMU: - Various fixes for handling on SNP-enabled systems and handling of faults with unknown request-ids - Cleanups and other small fixes - Various other smaller fixes and cleanups * tag 'iommu-updates-v6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (71 commits) iommu/amd: Skip attach device domain is same as new domain iommu: Attach device group to old domain in error path iommu/vt-d: Allow to use flush-queue when first level is default iommu/vt-d: Fix PASID directory pointer coherency iommu/vt-d: Avoid superfluous IOTLB tracking in lazy mode iommu/vt-d: Fix error handling in sva enable/disable paths iommu/amd: Improve page fault error reporting iommu/amd: Do not identity map v2 capable device when snp is enabled iommu: Fix error unwind in iommu_group_alloc() iommu/of: mark an unused function as __maybe_unused iommu: dart: DART_T8110_ERROR range should be 0 to 5 iommu/vt-d: Enable IOMMU perfmon support iommu/vt-d: Add IOMMU perfmon overflow handler support iommu/vt-d: Support cpumask for IOMMU perfmon iommu/vt-d: Add IOMMU perfmon support iommu/vt-d: Support Enhanced Command Interface iommu/vt-d: Retrieve IOMMU perfmon capability information iommu/vt-d: Support size of the register set in DRHD iommu/vt-d: Set No Execute Enable bit in PASID table entry iommu/vt-d: Remove sva from intel_svm_dev ...
| | \ \ \
| | \ \ \
| *-. \ \ \ Merge branches 'apple/dart', 'arm/exynos', 'arm/renesas', 'arm/smmu', ↵Joerg Roedel2023-02-181-4/+5
| |\ \ \ \ \ | | | |/ / / | | |/| / / | | | |/ / | | |_| / | |/| | 'x86/vt-d', 'x86/amd' and 'core' into next
| | | * iommu: Add a gfp parameter to iommu_map()Jason Gunthorpe2023-01-251-4/+5
| | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The internal mechanisms support this, but instead of exposting the gfp to the caller it wrappers it into iommu_map() and iommu_map_atomic() Fix this instead of adding more variants for GFP_KERNEL_ACCOUNT. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org> Link: https://lore.kernel.org/r/1-v3-76b587fe28df+6e3-iommu_map_gfp_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
* | | Merge tag 'mm-stable-2023-02-20-13-37' of ↵Linus Torvalds2023-02-231-1/+1
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
| * | mm: replace vma->vm_flags direct modifications with modifier callsSuren Baghdasaryan2023-02-091-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace direct modifications to vma->vm_flags with calls to modifier functions to be able to track flag changes and to keep vma locking correctness. [akpm@linux-foundation.org: fix drivers/misc/open-dice.c, per Hyeonggon Yoo] Link: https://lkml.kernel.org/r/20230126193752.297968-5-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Acked-by: Sebastian Reichel <sebastian.reichel@collabora.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjun Roy <arjunroy@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Peter Oskolkov <posk@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Song Liu <songliubraving@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* / vfio/type1: Respect IOMMU reserved regions in vfio_test_domain_fgsp()Niklas Schnelle2023-01-101-11/+20
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit cbf7827bc5dc ("iommu/s390: Fix potential s390_domain aperture shrinking") the s390 IOMMU driver uses reserved regions for the system provided DMA ranges of PCI devices. Previously it reduced the size of the IOMMU aperture and checked it on each mapping operation. On current machines the system denies use of DMA addresses below 2^32 for all PCI devices. Usually mapping IOVAs in a reserved regions is harmless until a DMA actually tries to utilize the mapping. However on s390 there is a virtual PCI device called ISM which is implemented in firmware and used for cross LPAR communication. Unlike real PCI devices this device does not use the hardware IOMMU but inspects IOMMU translation tables directly on IOTLB flush (s390 RPCIT instruction). If it detects IOVA mappings outside the allowed ranges it goes into an error state. This error state then causes the device to be unavailable to the KVM guest. Analysing this we found that vfio_test_domain_fgsp() maps 2 pages at DMA address 0 irrespective of the IOMMUs reserved regions. Even if usually harmless this seems wrong in the general case so instead go through the freshly updated IOVA list and try to find a range that isn't reserved, and fits 2 pages, is PAGE_SIZE * 2 aligned. If found use that for testing for fine grained super pages. Fixes: af029169b8fd ("vfio/type1: Check reserved region conflict and update iova list") Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20230110164427.4051938-2-schnelle@linux.ibm.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
* Merge tag 'driver-core-6.2-rc1' of ↵Linus Torvalds2022-12-161-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the set of driver core and kernfs changes for 6.2-rc1. The "big" change in here is the addition of a new macro, container_of_const() that will preserve the "const-ness" of a pointer passed into it. The "problem" of the current container_of() macro is that if you pass in a "const *", out of it can comes a non-const pointer unless you specifically ask for it. For many usages, we want to preserve the "const" attribute by using the same call. For a specific example, this series changes the kobj_to_dev() macro to use it, allowing it to be used no matter what the const value is. This prevents every subsystem from having to declare 2 different individual macros (i.e. kobj_const_to_dev() and kobj_to_dev()) and having the compiler enforce the const value at build time, which having 2 macros would not do either. The driver for all of this have been discussions with the Rust kernel developers as to how to properly mark driver core, and kobject, objects as being "non-mutable". The changes to the kobject and driver core in this pull request are the result of that, as there are lots of paths where kobjects and device pointers are not modified at all, so marking them as "const" allows the compiler to enforce this. So, a nice side affect of the Rust development effort has been already to clean up the driver core code to be more obvious about object rules. All of this has been bike-shedded in quite a lot of detail on lkml with different names and implementations resulting in the tiny version we have in here, much better than my original proposal. Lots of subsystem maintainers have acked the changes as well. Other than this change, included in here are smaller stuff like: - kernfs fixes and updates to handle lock contention better - vmlinux.lds.h fixes and updates - sysfs and debugfs documentation updates - device property updates All of these have been in the linux-next tree for quite a while with no problems" * tag 'driver-core-6.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (58 commits) device property: Fix documentation for fwnode_get_next_parent() firmware_loader: fix up to_fw_sysfs() to preserve const usb.h: take advantage of container_of_const() device.h: move kobj_to_dev() to use container_of_const() container_of: add container_of_const() that preserves const-ness of the pointer driver core: fix up missed drivers/s390/char/hmcdrv_dev.c class.devnode() conversion. driver core: fix up missed scsi/cxlflash class.devnode() conversion. driver core: fix up some missing class.devnode() conversions. driver core: make struct class.devnode() take a const * driver core: make struct class.dev_uevent() take a const * cacheinfo: Remove of_node_put() for fw_token device property: Add a blank line in Kconfig of tests device property: Rename goto label to be more precise device property: Move PROPERTY_ENTRY_BOOL() a bit down device property: Get rid of __PROPERTY_ENTRY_ARRAY_EL*SIZE*() kernfs: fix all kernel-doc warnings and multiple typos driver core: pass a const * into of_device_uevent() kobject: kset_uevent_ops: make name() callback take a const * kobject: kset_uevent_ops: make filter() callback take a const * kobject: make kobject_namespace take a const * ...
| * driver core: make struct class.devnode() take a const *Greg Kroah-Hartman2022-11-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The devnode() in struct class should not be modifying the device that is passed into it, so mark it as a const * and propagate the function signature changes out into all relevant subsystems that use this callback. Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Reinette Chatre <reinette.chatre@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Jens Axboe <axboe@kernel.dk> Cc: Justin Sanders <justin@coraid.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Benjamin Gaignard <benjamin.gaignard@collabora.com> Cc: Liam Mark <lmark@codeaurora.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Brian Starkey <Brian.Starkey@arm.com> Cc: John Stultz <jstultz@google.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Maxime Ripard <mripard@kernel.org> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: David Airlie <airlied@gmail.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Sean Young <sean@mess.org> Cc: Frank Haverkamp <haver@linux.ibm.com> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Cornelia Huck <cohuck@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Takashi Iwai <tiwai@suse.com> Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Xie Yongji <xieyongji@bytedance.com> Cc: Gautam Dawar <gautam.dawar@xilinx.com> Cc: Dan Carpenter <error27@gmail.com> Cc: Eli Cohen <elic@nvidia.com> Cc: Parav Pandit <parav@nvidia.com> Cc: Maxime Coquelin <maxime.coquelin@redhat.com> Cc: alsa-devel@alsa-project.org Cc: dri-devel@lists.freedesktop.org Cc: kvm@vger.kernel.org Cc: linaro-mm-sig@lists.linaro.org Cc: linux-block@vger.kernel.org Cc: linux-input@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-media@vger.kernel.org Cc: linux-rdma@vger.kernel.org Cc: linux-scsi@vger.kernel.org Cc: linux-usb@vger.kernel.org Cc: virtualization@lists.linux-foundation.org Link: https://lore.kernel.org/r/20221123122523.1332370-2-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge tag 'vfio-v6.2-rc1' of https://github.com/awilliam/linux-vfioLinus Torvalds2022-12-1518-429/+1435
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull VFIO updates from Alex Williamson: - Replace deprecated git://github.com link in MAINTAINERS (Palmer Dabbelt) - Simplify vfio/mlx5 with module_pci_driver() helper (Shang XiaoJing) - Drop unnecessary buffer from ACPI call (Rafael Mendonca) - Correct latent missing include issue in iova-bitmap and fix support for unaligned bitmaps. Follow-up with better fix through refactor (Joao Martins) - Rework ccw mdev driver to split private data from parent structure, better aligning with the mdev lifecycle and allowing us to remove a temporary workaround (Eric Farman) - Add an interface to get an estimated migration data size for a device, allowing userspace to make informed decisions, ex. more accurately predicting VM downtime (Yishai Hadas) - Fix minor typo in vfio/mlx5 array declaration (Yishai Hadas) - Simplify module and Kconfig through consolidating SPAPR/EEH code and config options and folding virqfd module into main vfio module (Jason Gunthorpe) - Fix error path from device_register() across all vfio mdev and sample drivers (Alex Williamson) - Define migration pre-copy interface and implement for vfio/mlx5 devices, allowing portions of the device state to be saved while the device continues operation, towards reducing the stop-copy state size (Jason Gunthorpe, Yishai Hadas, Shay Drory) - Implement pre-copy for hisi_acc devices (Shameer Kolothum) - Fixes to mdpy mdev driver remove path and error path on probe (Shang XiaoJing) - vfio/mlx5 fixes for incorrect return after copy_to_user() fault and incorrect buffer freeing (Dan Carpenter) * tag 'vfio-v6.2-rc1' of https://github.com/awilliam/linux-vfio: (42 commits) vfio/mlx5: error pointer dereference in error handling vfio/mlx5: fix error code in mlx5vf_precopy_ioctl() samples: vfio-mdev: Fix missing pci_disable_device() in mdpy_fb_probe() hisi_acc_vfio_pci: Enable PRE_COPY flag hisi_acc_vfio_pci: Move the dev compatibility tests for early check hisi_acc_vfio_pci: Introduce support for PRE_COPY state transitions hisi_acc_vfio_pci: Add support for precopy IOCTL vfio/mlx5: Enable MIGRATION_PRE_COPY flag vfio/mlx5: Fallback to STOP_COPY upon specific PRE_COPY error vfio/mlx5: Introduce multiple loads vfio/mlx5: Consider temporary end of stream as part of PRE_COPY vfio/mlx5: Introduce vfio precopy ioctl implementation vfio/mlx5: Introduce SW headers for migration states vfio/mlx5: Introduce device transitions of PRE_COPY vfio/mlx5: Refactor to use queue based data chunks vfio/mlx5: Refactor migration file state vfio/mlx5: Refactor MKEY usage vfio/mlx5: Refactor PD usage vfio/mlx5: Enforce a single SAVE command at a time vfio: Extend the device migration protocol with PRE_COPY ...
| * | vfio/mlx5: error pointer dereference in error handlingDan Carpenter2022-12-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This code frees the wrong "buf" variable and results in an error pointer dereference. Fixes: 34e2f27143d1 ("vfio/mlx5: Introduce multiple loads") Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/Y5IKia5SaiVxYmG5@kili Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * | vfio/mlx5: fix error code in mlx5vf_precopy_ioctl()Dan Carpenter2022-12-121-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The copy_to_user() function returns the number of bytes remaining to be copied but we want to return a negative error code here. Fixes: 0dce165b1adf ("vfio/mlx5: Introduce vfio precopy ioctl implementation") Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/Y5IKVknlf5Z5NPtU@kili Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * | hisi_acc_vfio_pci: Enable PRE_COPY flagShameer Kolothum2022-12-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we have everything to support the PRE_COPY state, enable it. Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20221123113236.896-5-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * | hisi_acc_vfio_pci: Move the dev compatibility tests for early checkShameer Kolothum2022-12-062-12/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of waiting till data transfer is complete to perform dev compatibility, do it as soon as we have enough data to perform the check. This will be useful when we enable the support for PRE_COPY. Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20221123113236.896-4-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * | hisi_acc_vfio_pci: Introduce support for PRE_COPY state transitionsShameer Kolothum2022-12-061-3/+71
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The saving_migf is open in PRE_COPY state if it is supported and reads initial device match data. hisi_acc_vf_stop_copy() is refactored to make use of common code. Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20221123113236.896-3-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * | hisi_acc_vfio_pci: Add support for precopy IOCTLShameer Kolothum2022-12-062-0/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | PRECOPY IOCTL in the case of HiSiIicon ACC driver can be used to perform the device compatibility check earlier during migration. Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20221123113236.896-2-shameerali.kolothum.thodi@huawei.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * | vfio/mlx5: Enable MIGRATION_PRE_COPY flagShay Drory2022-12-061-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that everything has been set up for MIGRATION_PRE_COPY, enable it. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-15-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * | vfio/mlx5: Fallback to STOP_COPY upon specific PRE_COPY errorShay Drory2022-12-063-3/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before a SAVE command is issued, a QUERY command is issued in order to know the device data size. In case PRE_COPY is used, the above commands are issued while the device is running. Thus, it is possible that between the QUERY and the SAVE commands the state of the device will be changed significantly and thus the SAVE will fail. Currently, if a SAVE command is failing, the driver will fail the migration. In the above case, don't fail the migration, but don't allow for new SAVEs to be executed while the device is in a RUNNING state. Once the device will be moved to STOP_COPY, SAVE can be executed again and the full device state will be read. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-14-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * | vfio/mlx5: Introduce multiple loadsYishai Hadas2022-12-063-45/+257
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to support PRE_COPY, mlx5 driver transfers multiple states (images) of the device. e.g.: the source VF can save and transfer multiple states, and the target VF will load them by that order. This patch implements the changes for the target VF to decompose the header for each state and to write and load multiple states. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-13-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
| * | vfio/mlx5: Consider temporary end of stream as part of PRE_COPYYishai Hadas2022-12-063-2/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During PRE_COPY the migration data FD may have a temporary "end of stream" that is reached when the initial_bytes were read and no other dirty data exists yet. For instance, this may indicate that the device is idle and not currently dirtying any internal state. When read() is done on this temporary end of stream the kernel driver should return ENOMSG from read(). Userspace can wait for more data or consider moving to STOP_COPY. To not block the user upon read() and let it get ENOMSG we add a new state named MLX5_MIGF_STATE_PRE_COPY on the migration file. In addition, we add the MLX5_MIGF_STATE_SAVE_LAST state to block the read() once we call the last SAVE upon moving to STOP_COPY. Any further error will be marked with MLX5_MIGF_STATE_ERROR and the user won't be blocked. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Yishai Hadas <yishaih@nvidia.com> Link: https://lore.kernel.org/r/20221206083438.37807-12-yishaih@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>