summaryrefslogtreecommitdiffstats
path: root/kernel
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'nfsd-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds2021-08-311-0/+42
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfsd updates from Chuck Lever: "New features: - Support for server-side disconnect injection via debugfs - Protocol definitions for new RPC_AUTH_TLS authentication flavor Performance improvements: - Reduce page allocator traffic in the NFSD splice read actor - Reduce CPU utilization in svcrdma's Send completion handler Notable bug fixes: - Stabilize lockd operation when re-exporting NFS mounts - Fix the use of %.*s in NFSD tracepoints - Fix /proc/sys/fs/nfs/nsm_use_hostnames" * tag 'nfsd-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (31 commits) nfsd: fix crash on LOCKT on reexported NFSv3 nfs: don't allow reexport reclaims lockd: don't attempt blocking locks on nfs reexports nfs: don't atempt blocking locks on nfs reexports Keep read and write fds with each nlm_file lockd: update nlm_lookup_file reexport comment nlm: minor refactoring nlm: minor nlm_lookup_file argument change lockd: lockd server-side shouldn't set fl_ops SUNRPC: Add documentation for the fail_sunrpc/ directory SUNRPC: Server-side disconnect injection SUNRPC: Move client-side disconnect injection SUNRPC: Add a /sys/kernel/debug/fail_sunrpc/ directory svcrdma: xpt_bc_xprt is already clear in __svc_rdma_free() nfsd4: Fix forced-expiry locking rpc: fix gss_svc_init cleanup on failure SUNRPC: Add RPC_AUTH_TLS protocol numbers lockd: change the proc_handler for nsm_use_hostnames sysctl: introduce new proc handler proc_dobool SUNRPC: Fix a NULL pointer deref in trace_svc_stats_latency() ...
| * sysctl: introduce new proc handler proc_doboolJia He2021-08-171-0/+42
| | | | | | | | | | | | | | | | | | | | | | | | This is to let bool variable could be correctly displayed in big/little endian sysctl procfs. sizeof(bool) is arch dependent, proc_dobool should work in all arches. Suggested-by: Pan Xinhui <xinhui@linux.vnet.ibm.com> Signed-off-by: Jia He <hejianet@gmail.com> [thuth: rebased the patch to the current kernel version] Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* | Merge tag 'for-5.15/io_uring-2021-08-30' of git://git.kernel.dk/linux-blockLinus Torvalds2021-08-301-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring updates from Jens Axboe: - cancellation cleanups (Hao, Pavel) - io-wq accounting cleanup (Hao) - io_uring submit locking fix (Hao) - io_uring link handling fixes (Hao) - fixed file improvements (wangyangbo, Pavel) - allow updates of linked timeouts like regular timeouts (Pavel) - IOPOLL fix (Pavel) - remove batched file get optimization (Pavel) - improve reference handling (Pavel) - IRQ task_work batching (Pavel) - allow pure fixed file, and add support for open/accept (Pavel) - GFP_ATOMIC RT kernel fix - multiple CQ ring waiter improvement - funnel IRQ completions through task_work - add support for limiting async workers explicitly - add different clocksource support for timeouts - io-wq wakeup race fix - lots of cleanups and improvement (Pavel et al) * tag 'for-5.15/io_uring-2021-08-30' of git://git.kernel.dk/linux-block: (87 commits) io-wq: fix wakeup race when adding new work io-wq: wqe and worker locks no longer need to be IRQ safe io-wq: check max_worker limits if a worker transitions bound state io_uring: allow updating linked timeouts io_uring: keep ltimeouts in a list io_uring: support CLOCK_BOOTTIME/REALTIME for timeouts io-wq: provide a way to limit max number of workers io_uring: add build check for buf_index overflows io_uring: clarify io_req_task_cancel() locking io_uring: add task-refs-get helper io_uring: fix failed linkchain code logic io_uring: remove redundant req_set_fail() io_uring: don't free request to slab io_uring: accept directly into fixed file table io_uring: hand code io_accept() fd installing io_uring: openat directly into fixed fd table net: add accept helper not installing fd io_uring: fix io_try_cancel_userdata race for iowq io_uring: IRQ rw completion batching io_uring: batch task work locking ...
| * | io_uring: remove files pointer in cancellation functionsHao Xu2021-08-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing cancellation, we use a parameter to indicate where it's from do_exit or exec. So a boolean value is good enough for this, remove the struct files* as it is not necessary. Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> [axboe: fixup io_uring_files_cancel for !CONFIG_IO_URING] Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | Merge tag 'for-5.15/drivers-2021-08-30' of git://git.kernel.dk/linux-blockLinus Torvalds2021-08-301-0/+18
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block driver updates from Jens Axboe: "Sitting on top of the core block changes, here are the driver changes for the 5.15 merge window: - NVMe updates via Christoph: - suspend improvements for devices with an HMB (Keith Busch) - handle double completions more gacefull (Sagi Grimberg) - cleanup the selects for the nvme core code a bit (Sagi Grimberg) - don't update queue count when failing to set io queues (Ruozhu Li) - various nvmet connect fixes (Amit Engel) - cleanup lightnvm leftovers (Keith Busch, me) - small cleanups (Colin Ian King, Hou Pu) - add tracing for the Set Features command (Hou Pu) - CMB sysfs cleanups (Keith Busch) - add a mutex_destroy call (Keith Busch) - remove lightnvm subsystem. It's served its purpose and ultimately led to zoned nvme support, we no longer need it (Christoph) - revert floppy O_NDELAY fix (Denis) - nbd fixes (Hou, Pavel, Baokun) - nbd locking fixes (Tetsuo) - nbd device removal fixes (Christoph) - raid10 rcu warning fix (Xiao) - raid1 write behind fix (Guoqing) - rnbd fixes (Gioh, Md Haris) - misc fixes (Colin)" * tag 'for-5.15/drivers-2021-08-30' of git://git.kernel.dk/linux-block: (42 commits) Revert "floppy: reintroduce O_NDELAY fix" raid1: ensure write behind bio has less than BIO_MAX_VECS sectors md/raid10: Remove unnecessary rcu_dereference in raid10_handle_discard nbd: remove nbd->destroy_complete nbd: only return usable devices from nbd_find_unused nbd: set nbd->index before releasing nbd_index_mutex nbd: prevent IDR lookups from finding partially initialized devices nbd: reset NBD to NULL when restarting in nbd_genl_connect nbd: add missing locking to the nbd_dev_add error path nvme: remove the unused NVME_NS_* enum nvme: remove nvm_ndev from ns nvme: Have NVME_FABRICS select NVME_CORE instead of transport drivers block: nbd: add sanity check for first_minor nvmet: check that host sqsize does not exceed ctrl MQES nvmet: avoid duplicate qid in connect cmd nvmet: pass back cntlid on successful completion nvme-rdma: don't update queue count when failing to set io queues nvme-tcp: don't update queue count when failing to set io queues nvme-tcp: pair send_mutex init with destroy nvme: allow user toggling hmb usage ...
| * | | params: lift param_set_uint_minmax to common codeSagi Grimberg2021-08-161-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is a useful helper hence move it to common code so others can enjoy it. Suggested-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
* | | | Merge tag 'timers-core-2021-08-30' of ↵Linus Torvalds2021-08-3010-156/+398
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "Updates for timekeeping, timers and related drivers: Core code: - Cure a couple of correctness issues in the posix CPU timer code to prevent that the tick dependency for NOHZ full is kept alive for no reason. - Avoid expensive double reprogramming of the clockevent device in hrtimer_start_range_ns(). - Avoid pointless SMP function calls when the clock was set to avoid disturbing CPUs which do not have any affected timers queued. - Make the clocksource watchdog test work correctly when CONFIG_HZ is less than 100. Drivers: - Prefer the ARM architected timer over the Exynos timer which is way more expensive to access. - Add device tree bindings for new Ingenic SoCs - The usual improvements and cleanups all over the place" * tag 'timers-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits) clocksource: Make clocksource watchdog test safe for slow-HZ systems dt-bindings: timer: Add ABIs for new Ingenic SoCs clocksource/drivers/fttmr010: Pass around less pointers clocksource/drivers/mediatek: Optimize systimer irq clear flow on shutdown clocksource/drivers/ingenic: Use bitfield macro helpers clocksource/drivers/sh_cmt: Fix wrong setting if don't request IRQ for clock source channel dt-bindings: timer: convert rockchip,rk-timer.txt to YAML clocksource/drivers/exynos_mct: Mark MCT device as CLOCK_EVT_FEAT_PERCPU clocksource/drivers/exynos_mct: Prioritise Arm arch timer on arm64 hrtimer: Unbreak hrtimer_force_reprogram() hrtimer: Use raw_cpu_ptr() in clock_was_set() hrtimer: Avoid more SMP function calls in clock_was_set() hrtimer: Avoid unnecessary SMP function calls in clock_was_set() hrtimer: Add bases argument to clock_was_set() time/timekeeping: Avoid invoking clock_was_set() twice timekeeping: Distangle resume and clock-was-set events timerfd: Provide timerfd_resume() hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case hrtimer: Ensure timerfd notification for HIGHRES=n hrtimer: Consolidate reprogramming code ...
| * | | | clocksource: Make clocksource watchdog test safe for slow-HZ systemsPaul E. McKenney2021-08-283-23/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The clocksource watchdog test sets a local JIFFIES_SHIFT macro and assumes that HZ is >= 100. For smaller HZ values this shift value is too large and causes undefined behaviour. Move the HZ-based definitions of JIFFIES_SHIFT from kernel/time/jiffies.c to kernel/time/tick-internal.h so the clocksource watchdog test can utilize them, which makes it work correctly with all HZ values. [ tglx: Resolved conflicts and massaged changelog ] Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/lkml/20210812000133.GA402890@paulmck-ThinkPad-P17-Gen-1/
| * | | | hrtimer: Unbreak hrtimer_force_reprogram()Thomas Gleixner2021-08-121-20/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the recent consoliation of reprogramming functions, hrtimer_force_reprogram() is affected by a check whether the new expiry time is past the current expiry time. This breaks the NOHZ logic as that relies on the fact that the tick hrtimer is moved into the future. That means cpu_base->expires_next becomes stale and subsequent reprogramming attempts fail as well until the situation is cleaned up by an hrtimer interrupts. For some yet unknown reason this leads to a complete stall, so for now partially revert the offending commit to a known working state. The root cause for the stall is still investigated and will be fixed in a subsequent commit. Fixes: b14bca97c9f5 ("hrtimer: Consolidate reprogramming code") Reported-by: Mike Galbraith <efault@gmx.de> Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Mike Galbraith <efault@gmx.de> Link: https://lore.kernel.org/r/8735recskh.ffs@tglx
| * | | | hrtimer: Use raw_cpu_ptr() in clock_was_set()Thomas Gleixner2021-08-121-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | clock_was_set() can be invoked from preemptible context. Use raw_cpu_ptr() to check whether high resolution mode is active or not. It does not matter whether the task migrates after acquiring the pointer. Fixes: e71a4153b7c2 ("hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case") Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/875ywacsmb.ffs@tglx
| * | | | hrtimer: Avoid more SMP function calls in clock_was_set()Thomas Gleixner2021-08-101-9/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | By unconditionally updating the offsets there are more indicators whether the SMP function calls on clock_was_set() can be avoided: - When the offset update already happened on the remote CPU then the remote update attempt will yield the same seqeuence number and no IPI is required. - When the remote CPU is currently handling hrtimer_interrupt(). In that case the remote CPU will reevaluate the timer bases before reprogramming anyway, so nothing to do. - After updating it can be checked whether the first expiring timer in the affected clock bases moves before the first expiring (softirq) timer of the CPU. If that's not the case then sending the IPI is not required. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210713135158.887322464@linutronix.de
| * | | | hrtimer: Avoid unnecessary SMP function calls in clock_was_set()Marcelo Tosatti2021-08-101-2/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Setting of clocks triggers an unconditional SMP function call on all online CPUs to reprogram the clock event device. However, only some clocks have their offsets updated and therefore potentially require a reprogram. That's CLOCK_REALTIME and CLOCK_TAI and in the case of resume (delayed sleep time injection) also CLOCK_BOOTTIME. Instead of sending an IPI unconditionally, check each per CPU hrtimer base whether it has active timers in the affected clock bases which are indicated by the caller in the @bases argument of clock_was_set(). If that's not the case, skip the IPI and update the offsets remotely which ensures that any subsequently armed timers on the affected clocks are evaluated with the correct offsets. [ tglx: Adopted to the new bases argument, removed the softirq_active check, added comment, fixed up stale comment ] Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210713135158.787536542@linutronix.de
| * | | | hrtimer: Add bases argument to clock_was_set()Thomas Gleixner2021-08-103-10/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | clock_was_set() unconditionaly invokes retrigger_next_event() on all online CPUs. This was necessary because that mechanism was also used for resume from suspend to idle which is not longer the case. The bases arguments allows the callers of clock_was_set() to hand in a mask which tells clock_was_set() which of the hrtimer clock bases are affected by the clock setting. This mask will be used in the next step to check whether a CPU base has timers queued on a clock base affected by the event and avoid the SMP function call if there are none. Add a @bases argument, provide defines for the active bases masking and fixup all callsites. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210713135158.691083465@linutronix.de
| * | | | time/timekeeping: Avoid invoking clock_was_set() twiceThomas Gleixner2021-08-101-8/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | do_adjtimex() might end up scheduling a delayed clock_was_set() via timekeeping_advance() and then invoke clock_was_set() directly which is pointless. Make timekeeping_advance() return whether an invocation of clock_was_set() is required and handle it at the call sites which allows do_adjtimex() to issue a single direct call if required. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210713135158.580966888@linutronix.de
| * | | | timekeeping: Distangle resume and clock-was-set eventsThomas Gleixner2021-08-104-10/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Resuming timekeeping is a clock-was-set event and uses the clock-was-set notification mechanism. This is in the way of making the clock-was-set update for hrtimers selective so unnecessary IPIs are avoided when a CPU base does not have timers queued which are affected by the clock setting. Distangle it by invoking hrtimer_resume() on each unfreezing CPU and invoke the new timerfd_resume() function from timekeeping_resume() which is the only place where this is needed. Rename hrtimer_resume() to hrtimer_resume_local() to reflect the change. With this the clock_was_set*() functions are not longer required to IPI all CPUs unconditionally and can get some smarts to avoid them. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210713135158.488853478@linutronix.de
| * | | | hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y caseThomas Gleixner2021-08-101-28/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When CONFIG_HIGH_RES_TIMERS is disabled, but NOHZ is enabled then clock_was_set() is not doing anything. With HIGHRES=n the kernel relies on the periodic tick to update the clock offsets, but when NOHZ is enabled and active then CPUs which are in a deep idle sleep do not have a periodic tick which means the expiry of timers affected by clock_was_set() can be arbitrarily delayed up to the point where the CPUs are brought out of idle again. Make the clock_was_set() logic unconditionaly available so that idle CPUs are kicked out of idle to handle the update. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210713135158.288697903@linutronix.de
| * | | | hrtimer: Ensure timerfd notification for HIGHRES=nThomas Gleixner2021-08-102-16/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If high resolution timers are disabled the timerfd notification about a clock was set event is not happening for all cases which use clock_was_set_delayed() because that's a NOP for HIGHRES=n, which is wrong. Make clock_was_set_delayed() unconditially available to fix that. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210713135158.196661266@linutronix.de
| * | | | hrtimer: Consolidate reprogramming codePeter Zijlstra2021-08-101-43/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This code is mostly duplicated. The redudant store in the force reprogram case does no harm and the in hrtimer interrupt condition cannot be true for the force reprogram invocations. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210713135158.054424875@linutronix.de
| * | | | hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns()Thomas Gleixner2021-08-101-7/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If __hrtimer_start_range_ns() is invoked with an already armed hrtimer then the timer has to be canceled first and then added back. If the timer is the first expiring timer then on removal the clockevent device is reprogrammed to the next expiring timer to avoid that the pending expiry fires needlessly. If the new expiry time ends up to be the first expiry again then the clock event device has to reprogrammed again. Avoid this by checking whether the timer is the first to expire and in that case, keep the timer on the current CPU and delay the reprogramming up to the point where the timer has been enqueued again. Reported-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210713135157.873137732@linutronix.de
| * | | | posix-cpu-timers: Recalc next expiration when timer_settime() ends up not ↵Frederic Weisbecker2021-08-101-6/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | queueing There are several scenarios that can result in posix_cpu_timer_set() not queueing the timer but still leaving the threadgroup cputime counter running or keeping the tick dependency around for a random amount of time. 1) If timer_settime() is called with a 0 expiration on a timer that is already disabled, the process wide cputime counter will be started and won't ever get a chance to be stopped by stop_process_timer() since no timer is actually armed to be processed. The following snippet is enough to trigger the issue. void trigger_process_counter(void) { timer_t id; struct itimerspec val = { }; timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, &id); timer_settime(id, TIMER_ABSTIME, &val, NULL); timer_delete(id); } 2) If timer_settime() is called with a 0 expiration on a timer that is already armed, the timer is dequeued but not really disarmed. So the process wide cputime counter and the tick dependency may still remain a while around. The following code snippet keeps this overhead around for one week after the timer deletion: void trigger_process_counter(void) { timer_t id; struct itimerspec val = { }; val.it_value.tv_sec = 604800; timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, &id); timer_settime(id, 0, &val, NULL); timer_delete(id); } 3) If the timer was initially deactivated, this call to timer_settime() with an early expiration may have started the process wide cputime counter even though the timer hasn't been queued and armed because it has fired early and inline within posix_cpu_timer_set() itself. As a result the process wide cputime counter may never stop until a new timer is ever armed in the future. The following code snippet can reproduce this: void trigger_process_counter(void) { timer_t id; struct itimerspec val = { }; signal(SIGALRM, SIG_IGN); timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, &id); val.it_value.tv_nsec = 1; timer_settime(id, TIMER_ABSTIME, &val, NULL); } 4) If the timer was initially armed with a former expiration value before this call to timer_settime() and the current call sets an early deadline that has already expired, the timer fires inline within posix_cpu_timer_set(). In this case it must have been dequeued before firing inline with its new expiration value, yet it hasn't been disarmed in this case. So the process wide cputime counter and the tick dependency may still be around for a while even after the timer fired. The following code snippet can reproduce this: void trigger_process_counter(void) { timer_t id; struct itimerspec val = { }; signal(SIGALRM, SIG_IGN); timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, &id); val.it_value.tv_sec = 100; timer_settime(id, TIMER_ABSTIME, &val, NULL); val.it_value.tv_sec = 0; val.it_value.tv_nsec = 1; timer_settime(id, TIMER_ABSTIME, &val, NULL); } Fix all these issues with triggering the related base next expiration recalculation on the next tick. This also implies to re-evaluate the need to keep around the process wide cputime counter and the tick dependency, in a similar fashion to disarm_timer(). Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210726125513.271824-7-frederic@kernel.org
| * | | | posix-cpu-timers: Consolidate timer base accessorFrederic Weisbecker2021-08-101-15/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the ad-hoc timer base accessors and provide a consolidated one. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210726125513.271824-6-frederic@kernel.org
| * | | | posix-cpu-timers: Remove confusing return value overrideFrederic Weisbecker2021-08-101-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The end of the function cannot be reached with an error in variable ret. Unconfuse reviewers about that. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210726125513.271824-5-frederic@kernel.org
| * | | | posix-cpu-timers: Force next expiration recalc after itimer resetFrederic Weisbecker2021-08-101-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an itimer deactivates a previously armed expiration, it simply doesn't do anything. As a result the process wide cputime counter keeps running and the tick dependency stays set until it reaches the old ghost expiration value. This can be reproduced with the following snippet: void trigger_process_counter(void) { struct itimerval n = {}; n.it_value.tv_sec = 100; setitimer(ITIMER_VIRTUAL, &n, NULL); n.it_value.tv_sec = 0; setitimer(ITIMER_VIRTUAL, &n, NULL); } Fix this with resetting the relevant base expiration. This is similar to disarming a timer. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210726125513.271824-4-frederic@kernel.org
| * | | | posix-cpu-timers: Force next_expiration recalc after timer deletionFrederic Weisbecker2021-08-101-1/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A timer deletion only dequeues the timer but it doesn't shutdown the related costly process wide cputimer counter and the tick dependency. The following code snippet keeps this overhead around for one week after the timer deletion: void trigger_process_counter(void) { timer_t id; struct itimerspec val = { }; val.it_value.tv_sec = 604800; timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, &id); timer_settime(id, 0, &val, NULL); timer_delete(id); } Make sure the next target's tick recalculates the nearest expiration and clears the process wide counter and tick dependency if necessary. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210726125513.271824-3-frederic@kernel.org
| * | | | posix-cpu-timers: Assert task sighand is locked while starting cputime counterFrederic Weisbecker2021-08-102-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Starting the process wide cputime counter needs to be done in the same sighand locking sequence than actually arming the related timer otherwise this races against concurrent timers setting/expiring in the same threadgroup. Detecting that the cputime counter is started without holding the sighand lock is a first step toward debugging such situations. Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210726125513.271824-2-frederic@kernel.org
| * | | | posix-timers: Remove redundant initialization of variable retColin Ian King2021-08-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The variable ret is being initialized with a value that is never read, it is being updated later on. The assignment is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210721120147.109570-1-colin.king@canonical.com
| * | | | clocksource: Replace deprecated CPU-hotplug functions.Sebastian Andrzej Siewior2021-08-101-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The functions get_online_cpus() and put_online_cpus() have been deprecated during the CPU hotplug rework. They map directly to cpus_read_lock() and cpus_read_unlock(). Replace deprecated CPU-hotplug functions with the official version. The behavior remains unchanged. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210803141621.780504-35-bigeasy@linutronix.de
* | | | | Merge tag 'irq-core-2021-08-30' of ↵Linus Torvalds2021-08-3013-50/+195
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Updates to the interrupt core and driver subsystems: Core changes: - The usual set of small fixes and improvements all over the place, but nothing stands out MSI changes: - Further consolidation of the PCI/MSI interrupt chip code - Make MSI sysfs code independent of PCI/MSI and expose the MSI interrupts of platform devices in the same way as PCI exposes them. Driver changes: - Support for ARM GICv3 EPPI partitions - Treewide conversion to generic_handle_domain_irq() for all chained interrupt controllers - Conversion to bitmap_zalloc() throughout the irq chip drivers - The usual set of small fixes and improvements" * tag 'irq-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits) platform-msi: Add ABI to show msi_irqs of platform devices genirq/msi: Move MSI sysfs handling from PCI to MSI core genirq/cpuhotplug: Demote debug printk to KERN_DEBUG irqchip/qcom-pdc: Trim unused levels of the interrupt hierarchy irqdomain: Export irq_domain_disconnect_hierarchy() irqchip/gic-v3: Fix priority comparison when non-secure priorities are used irqchip/apple-aic: Fix irq_disable from within irq handlers pinctrl/rockchip: drop the gpio related codes gpio/rockchip: drop irq_gc_lock/irq_gc_unlock for irq set type gpio/rockchip: support next version gpio controller gpio/rockchip: use struct rockchip_gpio_regs for gpio controller gpio/rockchip: add driver for rockchip gpio dt-bindings: gpio: change items restriction of clock for rockchip,gpio-bank pinctrl/rockchip: add pinctrl device to gpio bank struct pinctrl/rockchip: separate struct rockchip_pin_bank to a head file pinctrl/rockchip: always enable clock for gpio controller genirq: Fix kernel doc indentation EDAC/altera: Convert to generic_handle_domain_irq() powerpc: Bulk conversion to generic_handle_domain_irq() nios2: Bulk conversion to generic_handle_domain_irq() ...
| * \ \ \ \ Merge tag 'irqchip-5.15' of ↵Thomas Gleixner2021-08-291-0/+1
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into irq/core Pull irqchip updates from Marc Zyngier: - API updates: - Treewide conversion to generic_handle_domain_irq() for anything that looks like a chained interrupt controller - Update the irqdomain documentation - Use of bitmap_zalloc() throughout the tree - New functionalities: - Support for GICv3 EPPI partitions - Fixes: - Qualcomm PDC hierarchy fixes - Yet another priority decoding fix for the GICv3 pseudo-NMIs - Fix the apple-aic driver irq_eoi() callback to always unmask the interrupt - Properly handle edge interrupts on loongson-pch-pic - Let the mtk-sysirq driver advertise IRQCHIP_SKIP_SET_WAKE Link: https://lore.kernel.org/r/20210828121013.2647964-1-maz@kernel.org
| | * | | | | irqdomain: Export irq_domain_disconnect_hierarchy()Maulik Shah2021-08-231-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Export irq_domain_disconnect_hierarchy() so irqchip module drivers can use it. Signed-off-by: Maulik Shah <mkshah@codeaurora.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/1629705880-27877-2-git-send-email-mkshah@codeaurora.org
| * | | | | | genirq/msi: Move MSI sysfs handling from PCI to MSI coreBarry Song2021-08-241-0/+134
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move PCI's MSI sysfs code to the irq core so that other busses such as platform can reuse it. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20210813035628.6844-2-21cnbao@gmail.com
| * | | | | | genirq/cpuhotplug: Demote debug printk to KERN_DEBUGLee Jones2021-08-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This sort of information is only generally useful when debugging. No need to have these sprinkled through the kernel log otherwise. Real world problem: During pre-release testing these have an affect on performance on real products. To the point where so much logging builds up, that it sets off the watchdog(s) on some high profile consumer devices. Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210816134817.1503661-1-lee.jones@linaro.org
| * | | | | | genirq: Fix kernel doc indentationThomas Gleixner2021-08-131-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes: 61377ec14457 ("genirq: Clarify documentation for request_threaded_irq()") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
| * | | | | | genirq: Fix kernel-doc warnings in pm.c, msi.c and ipi.cRandy Dunlap2021-08-113-24/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix all kernel-doc warnings in these 3 files and do some simple editing (capitalize acronyms, capitalize Linux). kernel/irq/pm.c:235: warning: expecting prototype for irq_pm_syscore_ops(). Prototype was for irq_pm_syscore_resume() instead kernel/irq/msi.c:530: warning: expecting prototype for __msi_domain_free_irqs(). Prototype was for msi_domain_free_irqs() instead kernel/irq/msi.c:31: warning: No description found for return value of 'alloc_msi_entry' kernel/irq/msi.c:103: warning: No description found for return value of 'msi_domain_set_affinity' kernel/irq/msi.c:288: warning: No description found for return value of 'msi_create_irq_domain' kernel/irq/msi.c:499: warning: No description found for return value of 'msi_domain_alloc_irqs' kernel/irq/msi.c:545: warning: No description found for return value of 'msi_get_domain_info' kernel/irq/ipi.c:264: warning: expecting prototype for ipi_send_mask(). Prototype was for __ipi_send_mask() instead kernel/irq/ipi.c:25: warning: No description found for return value of 'irq_reserve_ipi' kernel/irq/ipi.c:116: warning: No description found for return value of 'irq_destroy_ipi' kernel/irq/ipi.c:163: warning: No description found for return value of 'ipi_get_hwirq' kernel/irq/ipi.c:222: warning: No description found for return value of '__ipi_send_single' kernel/irq/ipi.c:308: warning: No description found for return value of 'ipi_send_single' kernel/irq/ipi.c:329: warning: No description found for return value of 'ipi_send_mask' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210810234835.12547-1-rdunlap@infradead.org
| * | | | | | genirq/timings: Fix error return code in irq_timings_test_irqs()Zhen Lei2021-08-111-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Fixes: f52da98d900e ("genirq/timings: Add selftest for irqs circular buffer") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210811093333.2376-1-thunder.leizhen@huawei.com
| * | | | | | genirq/matrix: Fix kernel doc warnings for irq_matrix_alloc_managed()Baokun Li2021-08-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Describe the arguments correctly. Fixes the following W=1 kernel build warning(s): kernel/irq/matrix.c:287: warning: Function parameter or member 'msk' not described in 'irq_matrix_alloc_managed' kernel/irq/matrix.c:287: warning: Function parameter or member 'mapped_cpu' not described in 'irq_matrix_alloc_managed' kernel/irq/matrix.c:287: warning: Excess function parameter 'cpu' description in 'irq_matrix_alloc_managed' Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210605063413.684085-1-libaokun1@huawei.com
| * | | | | | genirq: Change force_irqthreads to a static keyTanner Love2021-08-102-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With CONFIG_IRQ_FORCED_THREADING=y, testing the boolean force_irqthreads could incur a cache line miss in invoke_softirq() and other places. Replace the test with a static key to avoid the potential cache miss. [ tglx: Dropped the IDE part, removed the export and updated blk-mq ] Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Tanner Love <tannerlove@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210602180338.3324213-1-tannerlove.kernel@gmail.com
| * | | | | | genirq/generic_chip: Use struct_size() in kzalloc()Gustavo A. R. Silva2021-08-101-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make use of the struct_size() helper instead of an open-coded version, in order to avoid any potential type mistakes or integer overflows that, in the worst scenario, could lead to heap overflows. This code was detected with the help of Coccinelle and, audited and fixed manually. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210513212729.GA214145@embeddedor
| * | | | | | genirq: Clarify documentation for request_threaded_irq()Joel Savitz2021-08-101-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clarify wording and document commonly used IRQF_ONESHOT flag. Signed-off-by: Joel Savitz <jsavitz@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210731050740.444454-1-jsavitz@redhat.com
| * | | | | | genirq/affinity: Replace deprecated CPU-hotplug functions.Sebastian Andrzej Siewior2021-08-101-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The functions get_online_cpus() and put_online_cpus() have been deprecated during the CPU hotplug rework. They map directly to cpus_read_lock() and cpus_read_unlock(). Replace deprecated CPU-hotplug functions with the official version. The behavior remains unchanged. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210803141621.780504-26-bigeasy@linutronix.de
| * | | | | | Merge branch 'irq/urgent' into irq/coreThomas Gleixner2021-08-1014-204/+334
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | to pick up fixes on which further changes depend on.
| * | | | | | | genirq: Improve "hwirq" output in /proc and /sys/Cédric Le Goater2021-07-302-2/+2
| | |/ / / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The HW IRQ numbers generated by the PCI MSI layer can be quite large on a pSeries machine when running under the IBM Hypervisor and they appear as negative. Use '%lu' instead to show them correctly. Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* | | | | | | Merge tag 'locking-core-2021-08-30' of ↵Linus Torvalds2021-08-3020-1373/+3123
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking and atomics updates from Thomas Gleixner: "The regular pile: - A few improvements to the mutex code - Documentation updates for atomics to clarify the difference between cmpxchg() and try_cmpxchg() and to explain the forward progress expectations. - Simplification of the atomics fallback generator - The addition of arch_atomic_long*() variants and generic arch_*() bitops based on them. - Add the missing might_sleep() invocations to the down*() operations of semaphores. The PREEMPT_RT locking core: - Scheduler updates to support the state preserving mechanism for 'sleeping' spin- and rwlocks on RT. This mechanism is carefully preserving the state of the task when blocking on a 'sleeping' spin- or rwlock and takes regular wake-ups targeted at the same task into account. The preserved or updated (via a regular wakeup) state is restored when the lock has been acquired. - Restructuring of the rtmutex code so it can be utilized and extended for the RT specific lock variants. - Restructuring of the ww_mutex code to allow sharing of the ww_mutex specific functionality for rtmutex based ww_mutexes. - Header file disentangling to allow substitution of the regular lock implementations with the PREEMPT_RT variants without creating an unmaintainable #ifdef mess. - Shared base code for the PREEMPT_RT specific rw_semaphore and rwlock implementations. Contrary to the regular rw_semaphores and rwlocks the PREEMPT_RT implementation is writer unfair because it is infeasible to do priority inheritance on multiple readers. Experience over the years has shown that real-time workloads are not the typical workloads which are sensitive to writer starvation. The alternative solution would be to allow only a single reader which has been tried and discarded as it is a major bottleneck especially for mmap_sem. Aside of that many of the writer starvation critical usage sites have been converted to a writer side mutex/spinlock and RCU read side protections in the past decade so that the issue is less prominent than it used to be. - The actual rtmutex based lock substitutions for PREEMPT_RT enabled kernels which affect mutex, ww_mutex, rw_semaphore, spinlock_t and rwlock_t. The spin/rw_lock*() functions disable migration across the critical section to preserve the existing semantics vs per-CPU variables. - Rework of the futex REQUEUE_PI mechanism to handle the case of early wake-ups which interleave with a re-queue operation to prevent the situation that a task would be blocked on both the rtmutex associated to the outer futex and the rtmutex based hash bucket spinlock. While this situation cannot happen on !RT enabled kernels the changes make the underlying concurrency problems easier to understand in general. As a result the difference between !RT and RT kernels is reduced to the handling of waiting for the critical section. !RT kernels simply spin-wait as before and RT kernels utilize rcu_wait(). - The substitution of local_lock for PREEMPT_RT with a spinlock which protects the critical section while staying preemptible. The CPU locality is established by disabling migration. The underlying concepts of this code have been in use in PREEMPT_RT for way more than a decade. The code has been refactored several times over the years and this final incarnation has been optimized once again to be as non-intrusive as possible, i.e. the RT specific parts are mostly isolated. It has been extensively tested in the 5.14-rt patch series and it has been verified that !RT kernels are not affected by these changes" * tag 'locking-core-2021-08-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (92 commits) locking/rtmutex: Return success on deadlock for ww_mutex waiters locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexes locking/rtmutex: Dequeue waiter on ww_mutex deadlock locking/rtmutex: Dont dereference waiter lockless locking/semaphore: Add might_sleep() to down_*() family locking/ww_mutex: Initialize waiter.ww_ctx properly static_call: Update API documentation locking/local_lock: Add PREEMPT_RT support locking/spinlock/rt: Prepare for RT local_lock locking/rtmutex: Add adaptive spinwait mechanism locking/rtmutex: Implement equal priority lock stealing preempt: Adjust PREEMPT_LOCK_OFFSET for RT locking/rtmutex: Prevent lockdep false positive with PI futexes futex: Prevent requeue_pi() lock nesting issue on RT futex: Simplify handle_early_requeue_pi_wakeup() futex: Reorder sanity checks in futex_requeue() futex: Clarify comment in futex_requeue() futex: Restructure futex_requeue() futex: Correct the number of requeued waiters for PI futex: Remove bogus condition for requeue PI ...
| * | | | | | | locking/rtmutex: Return success on deadlock for ww_mutex waitersPeter Zijlstra2021-08-271-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ww_mutexes can legitimately cause a deadlock situation in the lock graph which is resolved afterwards by the wait/wound mechanics. The rtmutex chain walk can detect such a deadlock and returns EDEADLK which in turn skips the wait/wound mechanism and returns EDEADLK to the caller. That's wrong because both lock chains might get EDEADLK or the wrong waiter would back out. Detect that situation and return 'success' in case that the waiter which initiated the chain walk is a ww_mutex with context. This allows the wait/wound mechanics to resolve the situation according to the rules. [ tglx: Split it apart and added changelog ] Reported-by: Sebastian Siewior <bigeasy@linutronix.de> Fixes: add461325ec5 ("locking/rtmutex: Extend the rtmutex core to support ww_mutex") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/YSeWjCHoK4v5OcOt@hirez.programming.kicks-ass.net
| * | | | | | | locking/rtmutex: Prevent spurious EDEADLK return caused by ww_mutexesPeter Zijlstra2021-08-271-0/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rtmutex based ww_mutexes can legitimately create a cycle in the lock graph which can be observed by a blocker which didn't cause the problem: P1: A, ww_A, ww_B P2: ww_B, ww_A P3: A P3 might therefore be trapped in the ww_mutex induced cycle and run into the lock depth limitation of rt_mutex_adjust_prio_chain() which returns -EDEADLK to the caller. Disable the deadlock detection walk when the chain walk observes a ww_mutex to prevent this looping. [ tglx: Split it apart and added changelog ] Reported-by: Sebastian Siewior <bigeasy@linutronix.de> Fixes: add461325ec5 ("locking/rtmutex: Extend the rtmutex core to support ww_mutex") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/YSeWjCHoK4v5OcOt@hirez.programming.kicks-ass.net
| * | | | | | | locking/rtmutex: Dequeue waiter on ww_mutex deadlockThomas Gleixner2021-08-251-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The rt_mutex based ww_mutex variant queues the new waiter first in the lock's rbtree before evaluating the ww_mutex specific conditions which might decide that the waiter should back out. This check and conditional exit happens before the waiter is enqueued into the PI chain. The failure handling at the call site assumes that the waiter, if it is the top most waiter on the lock, is queued in the PI chain and then proceeds to adjust the unmodified PI chain, which results in RB tree corruption. Dequeue the waiter from the lock waiter list in the ww_mutex error exit path to prevent this. Fixes: add461325ec5 ("locking/rtmutex: Extend the rtmutex core to support ww_mutex") Reported-by: Sebastian Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210825102454.042280541@linutronix.de
| * | | | | | | locking/rtmutex: Dont dereference waiter locklessThomas Gleixner2021-08-252-2/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new rt_mutex_spin_on_onwer() loop checks whether the spinning waiter is still the top waiter on the lock by utilizing rt_mutex_top_waiter(), which is broken because that function contains a sanity check which dereferences the top waiter pointer to check whether the waiter belongs to the lock. That's wrong in the lockless spinwait case: CPU 0 CPU 1 rt_mutex_lock(lock) rt_mutex_lock(lock); queue(waiter0) waiter0 == rt_mutex_top_waiter(lock) rt_mutex_spin_on_onwer(lock, waiter0) { queue(waiter1) waiter1 == rt_mutex_top_waiter(lock) ... top_waiter = rt_mutex_top_waiter(lock) leftmost = rb_first_cached(&lock->waiters); -> signal dequeue(waiter1) destroy(waiter1) w = rb_entry(leftmost, ....) BUG_ON(w->lock != lock) <- UAF The BUG_ON() is correct for the case where the caller holds lock->wait_lock which guarantees that the leftmost waiter entry cannot vanish. For the lockless spinwait case it's broken. Create a new helper function which avoids the pointer dereference and just compares the leftmost entry pointer with current's waiter pointer to validate that currrent is still elegible for spinning. Fixes: 992caf7f1724 ("locking/rtmutex: Add adaptive spinwait mechanism") Reported-by: Sebastian Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210825102453.981720644@linutronix.de
| * | | | | | | locking/semaphore: Add might_sleep() to down_*() familyXiaoming Ni2021-08-201-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Semaphore is sleeping lock. Add might_sleep() to down*() family (with exception of down_trylock()) to detect atomic context sleep. Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20210809021215.19991-1-nixiaoming@huawei.com
| * | | | | | | locking/ww_mutex: Initialize waiter.ww_ctx properlySebastian Andrzej Siewior2021-08-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The consolidation of the debug code for mutex waiter intialization sets waiter::ww_ctx to a poison value unconditionally. For regular mutexes this is intended to catch the case where waiter_ww_ctx is dereferenced accidentally. For ww_mutex the poison value has to be overwritten either with a context pointer or NULL for ww_mutexes without context. The rework broke this as it made the store conditional on the context pointer instead of the argument which signals whether ww_mutex code should be compiled in or optiized out. As a result waiter::ww_ctx ends up with the poison pointer for contextless ww_mutexes which causes a later dereference of the poison pointer because it is != NULL. Use the build argument instead so for ww_mutex the poison value is always overwritten. Fixes: c0afb0ffc06e6 ("locking/ww_mutex: Gather mutex_waiter initialization") Reported-by: Guenter Roeck <linux@roeck-us.net> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20210819193030.zpwrpvvrmy7xxxiy@linutronix.de
| * | | | | | | locking/spinlock/rt: Prepare for RT local_lockThomas Gleixner2021-08-171-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the static and runtime initializer mechanics to support the RT variant of local_lock, which requires the lock type in the lockdep map to be set to LD_LOCK_PERCPU. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20210815211305.967526724@linutronix.de